Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames.

Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON out to file.

The (Scala) examples below of reading in, and writing out a JSON dataset was done is Spark 1.6.0. If you are using the spark-shell, you can skip the import and sqlContext creation steps.

//imports 
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

//create sqlContext
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

//read json into DataFrame
val df = context.read.json("bdPerson_v1_1k.json")

//filter columns in DataFrame
val dfSmall = json.select("first_name", "last_name")

//write DataFrame as JSON into the "names" directory
toSchemaRDD.write.json("names")

Sample Output:

{"first_name":"Chase","last_name":"Dotson"}
{"first_name":"Emily","last_name":"Mccall"}
{"first_name":"Mason","last_name":"Haynes"}
{"first_name":"Anna","last_name":"Greer"}
{"first_name":"Sophia","last_name":"Stephenson"}

This example will write the DataFrame to multiple “part” files inside of a newly created “names” directory. All the “part” files combined will contain all the data from the DataFrame.

Leave a Reply

How to Read / Write JSON in Spark