Sunday, September 2, 2018

Reading resource files using Scala

Often while coding up unit tests in Scala, I need to read from a file which is available in the resources folder. There are a few variations to how this can be done, specifically if I am using the contents of the file as DataFrame in Spark. Here are some examples what I want to keep for myself as notes.

All these examples work with JDK 1.8u144, Scala 2.11.8 and Spark 2.3.1. Test them out if your version of software differ substantially from these versions.
  • Get the contents of the text file as Iterator:

    // In the path below, "/" refers to the root of the resources folder
    val fileStream = scala.io.Source.fromInputStream(getClass.getResourceAsStream("/path/to/file.txt"))
    val iterator: Iterator[String] = fileStream.getLines()
    
    // Print the lines 
    fileStream.getLines().foreach(println)
    
    // To get the complete contents of the file in a string
    val contents = fileStream.getLines().mkString("\n")
    

  • Get a text file as Spark RDD[String], individual lines as rows:

    // In the path below, "/" refers to the root of the resources folder
    val session: SparkSession = SparkSession.builder().config(conf).getOrCreate()
    val fileStream = getClass.getResourceAsStream("/path/to/file.txt")
    val input = ss.sparkContext.makeRDD(scala.io.Source.fromInputStream(fileStream).getLines().toList)
    
    // Another shorter alternative:
    val input = session.sparkContext.textFile(getClass.getResource("/path/to/file.txt").getPath)
    

  • Get a text file as a [n x 1] Spark DataFrame with individual lines as rows:

    // In the path below, "/" refers to the root of the resources folder
    val session: SparkSession = SparkSession.builder().config(conf).getOrCreate()
    val input: DataFrame = session.sparkContext.textFile(getClass.getResource("/path/to/file.txt").getPath).toDF("text")
    

  • Read a line-delimited JSON file into a Spark DataFrame:
    // In the path below, "/" refers to the root of the resources folder
    val session: SparkSession = SparkSession.builder().config(conf).getOrCreate()
    // The below line works when the file is in the resource folder and the program is run
    // through IDE or through command line. However, doesn't work when packaged in a jar.
    // Results in org.apache.spark.sql.AnalysisException: Path does not exist
    val input: DataFrame = session.read.json(getClass.getResource("/path/to/json/file").getPath).select("text")
    
    // The following works when the resource file is packaged in a jar
    import session.implicits._
    val ds = session.createDataset[String](Source.fromInputStream(getClass.getResourceAsStream("/path/to/json/file")).getLines().toSeq)
    val input = session.read.json(ds).select("text")
    

  • The final one is a niche use case where we have a bunch of events in an avro file, and would like to read the events. In this case, we use an iterator style, i.e. we stream the file lazily. For more notes on this, see here.

    // In the path below, "/" refers to the root of the resources folder
    val resourceURI: URI = getClass.getResource("/path/to/events.avro").toURI
    val file: File = Paths.get(resourceURI).toFile
    val datumReader = new SpecificDatumReader[Event]()
    val dataFileReader = new DataFileReader[Event](new File(file.getAbsolutePath), datumReader)
    
    // DataFileReader behaves like an iterator, i.e., a stream of events.
    // E.g., if we wanted filter a few events to a separate collection:
    def eventFilter(event: Event) = ???
    val specialSet = dataFileReader.withFilter(eventFilter).foldLeft(...)
    

Update (Sep 16, 2018): Added way to read JSON file in Spark through read.json

No comments:

Post a Comment