Needing to read and write JSON data is a common big data task. Thankfully this is very easy to do in Spark using Spark SQL DataFrames.
Spark SQL can automatically infer the schema of a JSON dataset, and use it to load data into a DataFrame object. A DataFrame’s schema is used when writing JSON out to file.
The (Scala) examples below of reading in, and writing out a JSON dataset was done is Spark 1.6.0. If you are using the spark-shell
, you can skip the import
and sqlContext
creation steps.
//imports
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SQLContext
//create sqlContext
val conf = new SparkConf()
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
//read json into DataFrame
val df = context.read.json("bdPerson_v1_1k.json")
//filter columns in DataFrame
val dfSmall = json.select("first_name", "last_name")
//write DataFrame as JSON into the "names" directory
toSchemaRDD.write.json("names")
Sample Output:
{"first_name":"Chase","last_name":"Dotson"}
{"first_name":"Emily","last_name":"Mccall"}
{"first_name":"Mason","last_name":"Haynes"}
{"first_name":"Anna","last_name":"Greer"}
{"first_name":"Sophia","last_name":"Stephenson"}
This example will write the DataFrame to multiple “part” files inside of a newly created “names” directory. All the “part” files combined will contain all the data from the DataFrame.