One of the great features of Spark is the variety of data sources it can read from and write to. If you already have a database to write to, connecting to that database and writing data from Spark is fairly simple. This example shows how to write to database that supports JDBC connections.

Databases Supporting JDBC Connections

Spark can easily write to databases that support JDBC connections. MySQL, Oracle, and Postgres are common options. In this post we show an example using MySQL.

Downloading the Database JDBC Driver

A JDBC driver is needed to connect your database to Spark. The MySQL JDBC driver can be downloaded at MySQL provides ZIP or TAR archives that contain the database driver. Inside each of these archives will be a mysql-connector-java-...-bin.jar file. This is the JDBC driver that enables Spark to connect to the database.

Write data from Spark to Database

We now have everything we need to connect Spark to our database. If running within the spark-shell use the --jars option and provide the location of your JDBC driver jar file on the command line.
spark-shell --jars ./mysql-connector-java-5.0.8-bin.jar

Once the spark-shell has started, we can now insert data from a Spark DataFrame into our database. A sample of the our DataFrame’s contents can be seen below.

| id|first_name| last_name|               email|             country|     ip_address|
|  1|   Barbara|       Day|      [email protected]|                null||
|  2|  Jennifer|   Roberts| [email protected]|               China|           null|
|  3|     Helen|      Gray| [email protected]|               China||
|  4|     Maria|       Day|     [email protected]|          Uzbekistan||
|  5|    Bonnie|      Diaz|     [email protected]|            Malaysia||

Spark DataFrames (as of Spark 1.4) have a write() method that can be used to write to a database. The write() method returns a DataFrameWriter object. DataFrameWriter objects have a jdbc() method, which is used to save DataFrame contents to an external database table via JDBC. The jdbc() method takes a JDBC URL, destination table name, and a Java Properties object containing other connection information.

Here is an example of putting these various pieces together to write to a MySQL database.

//create properties object
val prop = new java.util.Properties
prop.setProperty("driver", "com.mysql.jdbc.Driver")
prop.setProperty("user", "root")
prop.setProperty("password", "pw") 

//jdbc mysql url - destination database is named "data"
val url = "jdbc:mysql://localhost:3306/data"

//destination database table 
val table = "sample_data_table"

//write data from spark dataframe to database
df.write.mode("append").jdbc(url, table, prop)

Notice in the above example we set the mode of the DataFrameWriter to "append" using df.write.mode("append"). The mode() method specifies how to handle the database insert when then destination table already exists. The default behavior is for Spark to create and insert data into the destination table. If the table already exists, you will get a TableAlreadyExists Exception. In order to write to an existing table you must use mode("append") as in the example above.

Leave a Reply

Writing to a Database from Spark