Saturday, December 31, 2016

Using ssh tunnels for launching hadoop or spark jobs

This is a rather niche topic, so if you are here, you probably have weighed your options, talked with your colleagues, and have enough reason to do just this. Hence, without further ado, we will get right in the topic.

As mentioned in another post, I use an ssh double tunnel, and a socks proxy to launch hadoop commands. That works quite well. The commands I use to set up the ssh double tunnel and socks host are:
  • ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22
  • ssh -f -ND 8020 -i "</path/to/keyfile>" samikr@localhost -p 5522
Next I can run commands like:
  • hadoop fs -ls hdfs:///user/samik
Note that, in order to do this, I will have to have the following:
  • Hadoop distribution needs to be available locally
  • The version of local hadoop distribution needs to exactly match the version server-side
  • Appropriate configuration files
I initially thought that I will extend this method by having the spark distribution locally and launch the spark jobs on the cluster using this. I progressed quite a bit, however hit an insurmountable roadblock in the form of a bug in the spark library. The bug text explains the issue in detail. However, if your remote machine (where you are trying to execute the spark-submit from) can resolve the hadoop namenodes and yarn master, check out this page and try out - it might work for you.

Instead I had to revert to the painful process of uploading jar every time I was changing something in the program, which is what I wanted to avoid in the first place. But the method is quite clean - here are the steps to achieve this.

    • Create a single jar of your spark code
      • One caveat here: if your dependencies include one or more signed jar, you will have to either manually edit the jar to remove the signature files, or use maven settings to exclude the signature files. Otherwise you will get the following error when you run the code on cluster: Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes. Follow this StackOverflow thread for more details. Recommended process here is to use a maven file using the shade plugin.
    • Push the jar to your hadoop user folder
      • This is one way to make the jar available to the YARN executer. 
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:///user/samikr/<jarname>.jar" < /path/to/jar/file
      • Might have to delete existing jar file.  
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -rm -skipTrash hdfs:///user/samikr/<jarname>.jar"
    • Push the spark-hadoop jar to the hadoop user folder as well. Even better, run a sample spark application (e.g., this page shows running SparkPi in HDP) using spark-submit on the remote host, and note down the path to the spark-hadoop jar that is being used. It should typically already be available somewhere in HDFS. Agin, the reason is to avoid copying this jar every time the job is launched.
      • Command: 
        ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/spark-submit --class <complete.classname.for.ExecutableClass> --master yarn-cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar hdfs:///user/samikr/<jarname>.jar"
      • In the above command, “--conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar” specifies the location of the spark-hadoop jar in HDFS. This part should not be required when the command is being run through an ssh tunnel - check out.


    To launch hadoop jobs (or yarn jobs for that matter), it is a bit straightforward. In this case, I had to just copy the fat jar in the remote node (not in HDFS) and launch the job. Commands were as follows.
    • scp -P5522 -i "<key file>" /path/to/fat.jar samikr@localhost:.
    • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/yarn jar fat.jar <complete.classname.for.ExecutableClass>"
    More notes:
    • In order to avoid the loop of compile-upload-test, it would be better to create a local hadoop-spark node, test out the code there, and then use the above procedure to run the job on the cluster. There are some interesting notes here - possibly a topic for a later post.
    • Using ssh config files are recommended to shorten some of the large ssh commands above.

    Thursday, December 22, 2016

    Using ssh: multiple security algorithms and keys

    While using ssh to connect to hosts, I recently faced this interesting issue. I usually use a double ssh tunnel to connect to varied internal hosts which are behind firewall. Typically the way I set the double tunnel is using the command:

    ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22

    You will notice that the above command sets up a tunnel which forwards the ssh port of the final host to a local port (5522), so that I can run commands. This works pretty well, and I have used this tunnel to run hadoop commands or submit spark jobs.

    I was recently trying to set up a tunnel to a new host to submit spark jobs. The tunnel setup went well, but when I tried to run a hadoop command over the tunnel, I got a message regarding the key type being used for ssh handshake.

    $ ssh -i "</path/to/keyfile>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:/
    //user/samikr/datapipes.jar" < datapipes.jar
    Unable to negotiate with 127.0.0.1: no matching host key type found. Their offer: ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519

    I checked the config file and the contents were as follows.


    $ cat config
    Host *
           HostkeyAlgorithms ssh-dss

    Clearly I needed to add one of the accepted key types for this server, but I faced some trouble specifying multiple keys in the same line. After some searching, this is what worked (note no space after comma).

    $ cat config
    Host *
           HostkeyAlgorithms ssh-dss,ssh-rsa

    Now the command seemed to be going through, but I got another error message.

    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
    @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
    IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
    Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    It is also possible that a host key has just been changed.
    The fingerprint for the RSA key sent by the remote host is
    SHA256:xxxxxxxxxxxx
    Please contact your system administrator.
    Add correct host key in /home/<user>/.ssh/known_hosts to get rid of this message.
    Offending DSA key in /home/<user>/.ssh/known_hosts:5
    RSA host key for [localhost]:5522 has changed and you have requested strict checking.
    Host key verification failed.

    There seems to be already an entry for localhost/5522 in the known_hosts, but for ssh-dss algorithm. I was hoping that another line with the new algorithm would get added in the known_hosts file for localhost, but apparently, with strict checking, only one entry per host is allowed. I had to get rid of that line, and then things went through for the command.