As mentioned in another post, I use an ssh double tunnel, and a socks proxy to launch hadoop commands. That works quite well. The commands I use to set up the ssh double tunnel and socks host are:
- ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22
- ssh -f -ND 8020 -i "</path/to/keyfile>" samikr@localhost -p 5522
- hadoop fs -ls hdfs:///user/samik
- Hadoop distribution needs to be available locally
- The version of local hadoop distribution needs to exactly match the version server-side
- Appropriate configuration files
Instead I had to revert to the painful process of uploading jar every time I was changing something in the program, which is what I wanted to avoid in the first place. But the method is quite clean - here are the steps to achieve this.
- Create a single jar of your spark code
- One caveat here: if your dependencies include one or more signed jar, you will have to either manually edit the jar to remove the signature files, or use maven settings to exclude the signature files. Otherwise you will get the following error when you run the code on cluster: Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes. Follow this StackOverflow thread for more details. Recommended process here is to use a maven file using the shade plugin.
- Push the jar to your hadoop user folder
- This is one way to make the jar available to the YARN executer.
- ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:///user/samikr/<jarname>.jar" < /path/to/jar/file
- Might have to delete existing jar file.
- ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -rm -skipTrash hdfs:///user/samikr/<jarname>.jar"
- Push the spark-hadoop jar to the hadoop user folder as well. Even better, run a sample spark application (e.g., this page shows running SparkPi in HDP) using spark-submit on the remote host, and note down the path to the spark-hadoop jar that is being used. It should typically already be available somewhere in HDFS. Agin, the reason is to avoid copying this jar every time the job is launched.
- Finally, launch the job:
- Command:ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/spark-submit --class <complete.classname.for.ExecutableClass> --master yarn-cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar hdfs:///user/samikr/<jarname>.jar"
- In the above command, “--conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar” specifies the location of the spark-hadoop jar in HDFS. This part should not be required when the command is being run through an ssh tunnel - check out.
To launch hadoop jobs (or yarn jobs for that matter), it is a bit straightforward. In this case, I had to just copy the fat jar in the remote node (not in HDFS) and launch the job. Commands were as follows.
- scp -P5522 -i "<key file>" /path/to/fat.jar samikr@localhost:.
- ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/yarn jar fat.jar <complete.classname.for.ExecutableClass>"
- In order to avoid the loop of compile-upload-test, it would be better to create a local hadoop-spark node, test out the code there, and then use the above procedure to run the job on the cluster. There are some interesting notes here - possibly a topic for a later post.
- Using ssh config files are recommended to shorten some of the large ssh commands above.