One of the great features of Spark is the variety of data sources it can read from. Loading data from a database into Spark using JDBC requires 3 major steps. First you need a running database that support JDBC connections. Next you will need to download and use the JDBC driver of that database. Finally data is loaded into Spark.

Databases Supporting JDBC Connections

A database that supports JDBC connections is needed to load data from a database into Spark. MySQL, Oracle, and Postgres are common options. In this post we show an example using MySQL.

Downloading the Database JDBC Driver

A JDBC driver is needed to connect your database to Spark. The MySQL JDBC driver can be downloaded at MySQL provides ZIP or TAR archives that contain the database driver. Inside each of these archives will be a mysql-connector-java-...-bin.jar file. This is the JDBC driver that enables Spark to connect to the database.

Connect to Spark to Database and Load Data

We now have everything we need to connect Spark to our database. If running within the spark-shell use the --jars option and provide the location of your JDBC driver jar file on the command line.
spark-shell --jars ./mysql-connector-java-5.0.8-bin.jar

Once the spark-shell has started, we can now load data from the database into a Spark DataFrame. When creating a connection to the database, you need to specify input options such as driver class, database table, and username to configure the connection to the database. This can be done a couple of ways. The two examples below are equivalent.

//add database input options separately 
val df ="jdbc").
  option("url", "jdbc:mysql://localhost:3306/data").
  option("dbtable", "sample_data_1").
  option("driver", "com.mysql.jdbc.Driver").
  option("user", "root").
  option("password", "pw").

//add database input options as Map
val df="jdbc").
      "url" -> "jdbc:mysql://localhost:3306/data", 
      "dbtable" -> "sample_data_1", 
      "driver" -> "com.mysql.jdbc.Driver", 
      "user" -> "root", 
      "password" -> "pw")).

The load() method shown above loads data into a Spark DataFrame. At this point our database data is now available to use in Spark. You can see a preview of this data using the show() method.

| id|first_name| last_name|               email|             country|     ip_address|
|  1|   Barbara|       Day|      [email protected]|                null||
|  2|  Jennifer|   Roberts| [email protected]|               China|           null|
|  3|     Helen|      Gray| [email protected]|               China||
|  4|     Maria|       Day|     [email protected]|          Uzbekistan||
|  5|    Bonnie|      Diaz|     [email protected]|            Malaysia||
|  6|   Jessica|  Gonzalez|jgonzalez5@redcro...|               China||
|  7|    Amanda|     Stone|   [email protected]|            Slovenia||
|  8|     Roger|      Sims|    [email protected]|           Argentina||
|  9|    Angela|Williamson|awilliamson8@drop...|               China||
| 10|   Antonio|      Shaw|ashaw9@studiopres...|              Sweden||
| 11|  Lawrence|   Sanchez|lsancheza@domainm...|              France||
| 12|     Wanda|  Lawrence|wlawrenceb@geocit...|               China||
| 13|      Carl|  Marshall|[email protected]|       United States||
| 14|    Amanda|    Chavez|    [email protected]|Saint Vincent and...||
| 15|    Andrea|  Hamilton|  [email protected]|               Libya||
| 16|     Larry|      Hill|lhillf@craigslist...|               China||
| 17|     Emily|    Nguyen|  [email protected]|            Zimbabwe||
| 18|   Phillip|      Long|      [email protected]|              Brazil||
| 19|     Chris|    Burton|[email protected]...|       United States||
| 20|      Paul|    Brooks|[email protected]|           Indonesia||

Leave a Reply

Loading Data from a Database into Spark