Change Pyspark_python On Spark Workers
Solution 1:
Browsing through the souce code, it looks like the Python driver code puts the value of the Python executable path from its Spark context when creating work items for running Python functions in spark/rdd.py
:
def_wrap_function(sc, func, deserializer, serializer, profiler=None):
assert deserializer, "deserializer should not be empty"assert serializer, "serializer should not be empty"
command = (func, profiler, deserializer, serializer)
pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
^^^^^^^^^^^^^
sc.pythonVer, broadcast_vars, sc._javaAccumulator)
The Python runner PythonRunner.scala
then uses the path stored in the first work item it receives to launch new interpreter instances:
private[spark] abstractclassBasePythonRunner[IN, OUT](
funcs: Seq[ChainedPythonFunctions],
evalType: Int,
argOffsets: Array[Array[Int]])
extends Logging {
...
protectedval pythonExec: String = funcs.head.funcs.head.pythonExec
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
def compute(
inputIterator: Iterator[IN],
partitionIndex: Int,
context: TaskContext): Iterator[OUT] = {
...
val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)
...
}
...
}
Based on that, I'm afraid that it seems not currently possible to have separate configurations for the Python executable in the master and in the workers. Also see the third comment to issue SPARK-26404. Perhaps you should file an RFE with the Apache Spark project.
I'm not a Spark guru though and there might still be a way to do it, perhaps by setting PYSPARK_PYTHON
to just "python"
and then making sure the system PATH
is configured accordingly so that your Python executable comes first.
Post a Comment for "Change Pyspark_python On Spark Workers"