Saturday, December 31, 2016

Using ssh tunnels for launching hadoop or spark jobs

This is a rather niche topic, so if you are here, you probably have weighed your options, talked with your colleagues, and have enough reason to do just this. Hence, without further ado, we will get right in the topic.

As mentioned in another post, I use an ssh double tunnel, and a socks proxy to launch hadoop commands. That works quite well. The commands I use to set up the ssh double tunnel and socks host are:
  • ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22
  • ssh -f -ND 8020 -i "</path/to/keyfile>" samikr@localhost -p 5522
Next I can run commands like:
  • hadoop fs -ls hdfs:///user/samik
Note that, in order to do this, I will have to have the following:
  • Hadoop distribution needs to be available locally
  • The version of local hadoop distribution needs to exactly match the version server-side
  • Appropriate configuration files
I initially thought that I will extend this method by having the spark distribution locally and launch the spark jobs on the cluster using this. I progressed quite a bit, however hit an insurmountable roadblock in the form of a bug in the spark library. The bug text explains the issue in detail. However, if your remote machine (where you are trying to execute the spark-submit from) can resolve the hadoop namenodes and yarn master, check out this page and try out - it might work for you.

Instead I had to revert to the painful process of uploading jar every time I was changing something in the program, which is what I wanted to avoid in the first place. But the method is quite clean - here are the steps to achieve this.

    • Create a single jar of your spark code
      • One caveat here: if your dependencies include one or more signed jar, you will have to either manually edit the jar to remove the signature files, or use maven settings to exclude the signature files. Otherwise you will get the following error when you run the code on cluster: Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes. Follow this StackOverflow thread for more details. Recommended process here is to use a maven file using the shade plugin.
    • Push the jar to your hadoop user folder
      • This is one way to make the jar available to the YARN executer. 
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:///user/samikr/<jarname>.jar" < /path/to/jar/file
      • Might have to delete existing jar file.  
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -rm -skipTrash hdfs:///user/samikr/<jarname>.jar"
    • Push the spark-hadoop jar to the hadoop user folder as well. Even better, run a sample spark application (e.g., this page shows running SparkPi in HDP) using spark-submit on the remote host, and note down the path to the spark-hadoop jar that is being used. It should typically already be available somewhere in HDFS. Agin, the reason is to avoid copying this jar every time the job is launched.
      • Command: 
        ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/spark-submit --class <complete.classname.for.ExecutableClass> --master yarn-cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar hdfs:///user/samikr/<jarname>.jar"
      • In the above command, “--conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar” specifies the location of the spark-hadoop jar in HDFS. This part should not be required when the command is being run through an ssh tunnel - check out.


    To launch hadoop jobs (or yarn jobs for that matter), it is a bit straightforward. In this case, I had to just copy the fat jar in the remote node (not in HDFS) and launch the job. Commands were as follows.
    • scp -P5522 -i "<key file>" /path/to/fat.jar samikr@localhost:.
    • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/yarn jar fat.jar <complete.classname.for.ExecutableClass>"
    More notes:
    • In order to avoid the loop of compile-upload-test, it would be better to create a local hadoop-spark node, test out the code there, and then use the above procedure to run the job on the cluster. There are some interesting notes here - possibly a topic for a later post.
    • Using ssh config files are recommended to shorten some of the large ssh commands above.

    No comments:

    Post a Comment