JSON is a very common way to store data. But JSON can get messy and parsing it can get tricky. Here are a few examples of parsing nested data structures in JSON using Spark DataFrames (examples here done with Spark 1.6.0).
Our sample.json file:
{
"user": "gT35Hhhre9m",
"dates": ["2016-01-29", "2016-01-28"],
"status": "OK",
"reason": "some reason",
"content": [{
"foo": 123,
"bar": "val1"
}, {
"foo": 456,
"bar": "val2"
}, {
"foo": 789,
"bar": "val3"
}, {
"foo": 124,
"bar": "val4"
}, {
"foo": 126,
"bar": "val5"
}]
}
Assuming you already have a SQLContext object created, the examples below will demonstrate how to parse the nested data from the JSON above.
Read the JSON file into a Spark DataFrame:
scala> val df = sqlContext.read.json("sample.json")
df: org.apache.spark.sql.DataFrame = [content: array<struct<bar:string,foo:bigint>>, dates: array<string>, reason: string, status: string, user: string]
//output
df.show
+--------------------+--------------------+-----------+------+-----------+
| content| dates| reason|status| user|
+--------------------+--------------------+-----------+------+-----------+
|[[val1,123], [val...|[2016-01-29, 2016...|some reason| OK|gT35Hhhre9m|
+--------------------+--------------------+-----------+------+-----------+
We can see in our output that the “content” field contains an array of structs, while our “dates” field contains an array of integers. The first step to being able to access the data in these data structures is to extract and “explode” the column into a new DataFrame using the explode
function.
Extracting “dates” into new DataFrame:
//explode dates field
scala> val dfDates = df.select(explode(df("dates")))
//output
dfDates.show
+----------+
| col|
+----------+
|2016-01-29|
|2016-01-28|
+----------+
//rename "col" to "dates"
scala> val dfDates = df.select(explode(df("dates"))).toDF("dates")
//output
dfDates.show
+----------+
| dates|
+----------+
|2016-01-29|
|2016-01-28|
+----------+
Our “content” field contains and array of structs. To access the data in each of these structs, we must use the dot operator.
Extracting data in array of structs:
//explode content field
scala> val dfContent = df.select(explode(df("content")))
dfContent: org.apache.spark.sql.DataFrame = [col: struct<bar:string,foo:bigint>]
//output
scala> dfContent.show
+----------+
| col|
+----------+
|[val1,123]|
|[val2,456]|
|[val3,789]|
|[val4,124]|
|[val5,126]|
+----------+
//rename "col" to "content"
scala> val dfContent = df.select(explode(df("content"))).toDF("content")
dfContent: org.apache.spark.sql.DataFrame = [content: struct<bar:string,foo:bigint>]
//output
scala> dfContent.show
+----------+
| content|
+----------+
|[val1,123]|
|[val2,456]|
|[val3,789]|
|[val4,124]|
|[val5,126]|
+----------+
//extracting fields in struct
scala> val dfFooBar = dfContent.select("content.foo", "content.bar")
dfFooBar: org.apache.spark.sql.DataFrame = [foo: bigint, bar: string]
//output
scala> dfFooBar.show
+---+----+
|foo| bar|
+---+----+
|123|val1|
|456|val2|
|789|val3|
|124|val4|
|126|val5|
+---+----+