Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames.

Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON out to file.

The (Scala) examples below of reading in, and writing out a JSON dataset was done is Spark 1.6.0. If you are using the spark-shell, you can skip the import and sqlContext creation steps.

import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext

//create sqlContext
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)

//read json into DataFrame
val df ="bdPerson_v1_1k.json")

//filter columns in DataFrame
val dfSmall ="first_name", "last_name")

//write DataFrame as JSON into the "names" directory

Sample Output:


This example will write the DataFrame to multiple “part” files inside of a newly created “names” directory. All the “part” files combined will contain all the data from the DataFrame.

Leave a Reply

How to Read / Write JSON in Spark