Loading text files in Spark is a very common task, and luckily it is easy to do.

Below are a few examples of loading a text file (located on the Big Datums GitHub repo) into an RDD in Spark. If you have looked at the Spark Documentation you will notice that they do not include file:// in their example. This however is often needed because many Spark installations will use HDFS as the standard location to look for input or to output files. The examples below were done on the Spark shell.

Create RDD from text file:

val fileData = sc.textFile("file:///home/user1/bddatagen_people_wHeader_v1_5k.txt")

Create RDD from text file and filter header row:

val fileData = sc.textFile("file:///home/user1/bddatagen_people_wHeader_v1_5k.txt").
  filter(!_.contains("first_name"))

Create RDD of usernames by splitting file records by the field delimiter (in this case “\t”) and retaining only the the second field (username):

val fileData = sc.textFile("file:///home/user1/bddatagen_people_wHeader_v1_5k.txt").
  filter(!_.contains("first_name")).
  map(_.split("\t")).
  map(_(1))

Leave a Reply

How to Load a Text File into Spark