Skip to content Skip to sidebar Skip to footer

Change Pyspark_python On Spark Workers

We distribute our Python app, which uses Spark, together with Python 3.7 interpreter (python.exe with all necessary libs lies near MyApp.exe). To set PYSPARK_PYTHON we have have fu

Solution 1:

Browsing through the souce code, it looks like the Python driver code puts the value of the Python executable path from its Spark context when creating work items for running Python functions in spark/rdd.py:

def_wrap_function(sc, func, deserializer, serializer, profiler=None):
    assert deserializer, "deserializer should not be empty"assert serializer, "serializer should not be empty"
    command = (func, profiler, deserializer, serializer)
    pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
    return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
                                                                             ^^^^^^^^^^^^^
                                  sc.pythonVer, broadcast_vars, sc._javaAccumulator)

The Python runner PythonRunner.scala then uses the path stored in the first work item it receives to launch new interpreter instances:

private[spark] abstractclassBasePythonRunner[IN, OUT](
    funcs: Seq[ChainedPythonFunctions],
    evalType: Int,
    argOffsets: Array[Array[Int]])
  extends Logging {
  ...
  protectedval pythonExec: String = funcs.head.funcs.head.pythonExec
                                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  ...
  def compute(
      inputIterator: Iterator[IN],
      partitionIndex: Int,
      context: TaskContext): Iterator[OUT] = {
    ...
    val worker: Socket = env.createPythonWorker(pythonExec, envVars.asScala.toMap)
    ...
  }
  ...
}

Based on that, I'm afraid that it seems not currently possible to have separate configurations for the Python executable in the master and in the workers. Also see the third comment to issue SPARK-26404. Perhaps you should file an RFE with the Apache Spark project.

I'm not a Spark guru though and there might still be a way to do it, perhaps by setting PYSPARK_PYTHON to just "python" and then making sure the system PATH is configured accordingly so that your Python executable comes first.

Post a Comment for "Change Pyspark_python On Spark Workers"