Adding Custom Jars To Pyspark In Jupyter Notebook
Solution 1:
I've managed to get it working from within the jupyter notebook which is running form the all-spark container.
I start a python3 notebook in jupyterhub and overwrite the PYSPARK_SUBMIT_ARGS flag as shown below. The Kafka consumer library was downloaded from the maven repository and put in my home directory /home/jovyan:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] =
'--jars /home/jovyan/spark-streaming-kafka-assembly_2.10-1.6.1.jar pyspark-shell'import pyspark
from pyspark.streaming.kafkaimportKafkaUtilsfrom pyspark.streamingimportStreamingContext
sc = pyspark.SparkContext()
ssc = StreamingContext(sc,1)
broker = "<my_broker_ip>"
directKafkaStream = KafkaUtils.createDirectStream(ssc, ["test1"],
{"metadata.broker.list": broker})
directKafkaStream.pprint()
ssc.start()
Note: Don't forget the pyspark-shell in the environment variables!
Extension: If you want to include code from spark-packages you can use the --packages flag instead. An example on how to do this in the all-spark-notebook can be found here
Solution 2:
Indeed, there is a way to link it dynamically via the SparkConf object when you create the SparkSession, as explained in this answer:
spark = SparkSession \
.builder \
.appName("My App") \
.config("spark.jars", "/path/to/jar.jar,/path/to/another/jar.jar") \
.getOrCreate()
Solution 3:
You can run your jupyter notebook with the pyspark command by setting the relevant environment variables:
export PYSPARK_DRIVER_PYTHON=jupyter
export IPYTHON=1
export PYSPARK_DRIVER_PYTHON_OPTS="notebook --port=XXX --ip=YYY"
with XXX being the port you want to use to access the notebook and YYY being the ip address.
Now simply run pyspark and add --jars as a switch the same as you would spark submit
Solution 4:
In case someone is the same as me: I tried all above solutions and none of them works for me. What I'm trying to do is to use Delta Lake in the Jupyter notebook.
Finally I can use from delta.tables import *
by calling SparkContext.addPyFile("/path/to/your/jar.jar")
first. Though in the spark official docs, it only mentions adding .zip
or .py
file, but I tried .jar
and it worked perfectly.
Solution 5:
for working on jupyter-notebook with spark you need to give the location of the external jars before the creation of sparkContext object. pyspark --jars youJar will create a sparkcontext with location of external jars
Post a Comment for "Adding Custom Jars To Pyspark In Jupyter Notebook"