Read Xml Using Pyspark In Jupyter Notebook
Solution 1:
As you've surmised, the thing is to get the package loaded such that PySpark will use it in your context in Jupyter.
Start your notebook with your regular imports:
import pandas as pd
from pyspark.sqlimportSparkSessionimport os
Before you instantiate your session, do:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0 pyspark-shell'
Notes:
- the first part of the package version has to match the version of Scala that your spark was built with - you can find this out by doing spark-submit --version from the command line. e.g.
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____//__
_\\/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user centos on 2021-02-16T06:09:22Z
Revision 648457905c4ea7d00e3d88048c63f360045f0714
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
The second part of the package version just has to be what has been provided for the given version of Scala - you can find that here: https://github.com/databricks/spark-xml - so in my case, since I had Spark built with Scala 2.12, the package I needed was com.databricks:spark-xml_2.12:0.12.0
Now instantiate your session:
# Creates a session on a local master
sparkSesh = SparkSession.builder.appName("XML_Import") \
.master("local[*]").getOrCreate()
Find a simple .xml file whose structure you know - in my case I used the XML version of nmap output
thisXML = "simple.xml"
The reason for that is so that you can provide appropriate values for 'rootTag' and 'rowTag' below:
someXSDF = sparkSesh.read.format('xml') \
.option('rootTag', 'nmaprun') \
.option('rowTag', 'host') \
.load(thisXML)
If the file is small enough, you can just do a .toPandas() to review it:
someXSDF.toPandas()[["address", "ports"]][:5]
Then close the session.
sparkSesh.stop()
Closing Notes:
- if you want to test this outside of Jupyter, just go the command line and do
pyspark --packages com.databricks:spark-xml_2.12:0.12.0
you should see it load up properly in the PySpark shell
- if the package version doesn't match up with the scala version, you might get this error:
"Exception: Java gateway process exited before sending its port number"
which is a pretty funny way to explain that a package version number is wrong - if you've loaded the wrong package for the version of Scala that was used to build your Spark, you'll likely get this error when you try to read the XML:
py4j.protocol.Py4JJavaError: An error occurred while calling o43.load. : java.lang.NoClassDefFoundError: scala/Product$class
- if the read seems work but you get an empty dataframe, you probably specified the wrong root tag and/or row tag
- if you need to support multiple read types (let's say you also needed to be able to read Avro files in the same notebook), you would list multiple packages with commas (no spaces) separating them, like so:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.1.2 pyspark-shell'
- My version info: Python 3.6.9, Spark 3.0.2
Post a Comment for "Read Xml Using Pyspark In Jupyter Notebook"