Saturday, December 31, 2016

Using ssh tunnels for launching hadoop or spark jobs

This is a rather niche topic, so if you are here, you probably have weighed your options, talked with your colleagues, and have enough reason to do just this. Hence, without further ado, we will get right in the topic.

As mentioned in another post, I use an ssh double tunnel, and a socks proxy to launch hadoop commands. That works quite well. The commands I use to set up the ssh double tunnel and socks host are:
  • ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22
  • ssh -f -ND 8020 -i "</path/to/keyfile>" samikr@localhost -p 5522
Next I can run commands like:
  • hadoop fs -ls hdfs:///user/samik
Note that, in order to do this, I will have to have the following:
  • Hadoop distribution needs to be available locally
  • The version of local hadoop distribution needs to exactly match the version server-side
  • Appropriate configuration files
I initially thought that I will extend this method by having the spark distribution locally and launch the spark jobs on the cluster using this. I progressed quite a bit, however hit an insurmountable roadblock in the form of a bug in the spark library. The bug text explains the issue in detail. However, if your remote machine (where you are trying to execute the spark-submit from) can resolve the hadoop namenodes and yarn master, check out this page and try out - it might work for you.

Instead I had to revert to the painful process of uploading jar every time I was changing something in the program, which is what I wanted to avoid in the first place. But the method is quite clean - here are the steps to achieve this.
    • Create a single jar of your spark code
      • One caveat here: if your dependencies include one or more signed jar, you will have to either manually edit the jar to remove the signature files, or use maven settings to exclude the signature files. Otherwise you will get the following error when you run the code on cluster: Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes. Follow this StackOverflow thread for more details. Recommended process here is to use a maven file using the shade plugin.
    • Push the jar to your hadoop user folder
      • This is one way to make the jar available to the YARN executer. 
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:///user/samikr/<jarname>.jar" < /path/to/jar/file
      • Might have to delete existing jar file.  
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -rm -skipTrash hdfs:///user/samikr/<jarname>.jar"
    • Push the spark-hadoop jar to the hadoop user folder as well. Even better, run a sample spark application (e.g., this page shows running SparkPi in HDP) using spark-submit on the remote host, and note down the path to the spark-hadoop jar that is being used. It should typically already be available somewhere in HDFS. Agin, the reason is to avoid copying this jar every time the job is launched.
    •  Finally, launch the job:
      • Command: 
        ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/spark-submit --class <complete.classname.for.ExecutableClass> --master yarn-cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar hdfs:///user/samikr/<jarname>.jar"
      • In the above command, “--conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar” specifies the location of the spark-hadoop jar in HDFS. This part should not be required when the command is being run through an ssh tunnel - check out.

    To launch hadoop jobs (or yarn jobs for that matter), it is a bit straightforward. In this case, I had to just copy the fat jar in the remote node (not in HDFS) and launch the job. Commands were as follows.
    • scp -P5522 -i "<key file>" /path/to/fat.jar samikr@localhost:.
    • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/yarn jar fat.jar <complete.classname.for.ExecutableClass>"
    More notes:
    • In order to avoid the loop of compile-upload-test, it would be better to create a local hadoop-spark node, test out the code there, and then use the above procedure to run the job on the cluster. There are some interesting notes here - possibly a topic for a later post.
    • Using ssh config files are recommended to shorten some of the large ssh commands above.
    Update  (Sep 5, 2018): Added few relevant links.

    Thursday, December 22, 2016

    Using ssh: multiple security algorithms and keys

    While using ssh to connect to hosts, I recently faced this interesting issue. I usually use a double ssh tunnel to connect to varied internal hosts which are behind firewall. Typically the way I set the double tunnel is using the command:

    ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22

    You will notice that the above command sets up a tunnel which forwards the ssh port of the final host to a local port (5522), so that I can run commands. This works pretty well, and I have used this tunnel to run hadoop commands or submit spark jobs.

    I was recently trying to set up a tunnel to a new host to submit spark jobs. The tunnel setup went well, but when I tried to run a hadoop command over the tunnel, I got a message regarding the key type being used for ssh handshake.

    $ ssh -i "</path/to/keyfile>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:/
    //user/samikr/datapipes.jar" < datapipes.jar
    Unable to negotiate with no matching host key type found. Their offer: ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519

    I checked the config file and the contents were as follows.

    $ cat config
    Host *
           HostkeyAlgorithms ssh-dss

    Clearly I needed to add one of the accepted key types for this server, but I faced some trouble specifying multiple keys in the same line. After some searching, this is what worked (note no space after comma).

    $ cat config
    Host *
           HostkeyAlgorithms ssh-dss,ssh-rsa

    Now the command seemed to be going through, but I got another error message.

    Someone could be eavesdropping on you right now (man-in-the-middle attack)!
    It is also possible that a host key has just been changed.
    The fingerprint for the RSA key sent by the remote host is
    Please contact your system administrator.
    Add correct host key in /home/<user>/.ssh/known_hosts to get rid of this message.
    Offending DSA key in /home/<user>/.ssh/known_hosts:5
    RSA host key for [localhost]:5522 has changed and you have requested strict checking.
    Host key verification failed.

    There seems to be already an entry for localhost/5522 in the known_hosts, but for ssh-dss algorithm. I was hoping that another line with the new algorithm would get added in the known_hosts file for localhost, but apparently, with strict checking, only one entry per host is allowed. I had to get rid of that line, and then things went through for the command.

    Sunday, January 17, 2016

    My DisplayLink troubleshooting guide

    I use Targus DisplayLink adapter for using my Win 10 ASUS laptop with multiple monitors. When things work, they work really well. But often they don't. And the problems are weird. All of a sudden, the adapter just stop working. How many times I try, nothing works, except when it starts working again. 

    Here are some of my notes from my troubleshooting experience.
    • The adapter needs correct voltage to get detected the first time I am inserting the USB 3.0. Most of the time the voltage is there, and things work. But when it doesn't work, I have seen things kick in if I do a few of the below:
      • Let Laptop hard drive activity subside. That means the HDD LED is not active.
      • Power cycle through the DisplayLink adapter.
        • This sometimes still doesn't work because the power adapter that powers the DisplayLink adapter also powers the external monitors. 
      • Just detach the  mini plug from the power box of the DisplayLink adapter, and put that back in.
        • I have recently discovered this, and this seems to work well so far.
    • Once I did the last one, and the monitors started misbehaving. Specifically, the two external monitors constantly started switching between #1 and #2. I have a Dell 23' monitor connected through HDMI, and a Dell 19' connected through DVI. The switching was rapid (e.g., once every second or so) and very weird, haven't seen this before. I had to detach my 2nd monitor and reattach it back in order to get things working again.
    • [Update - March 4, 2016] New issue: when I inserted the USB cable, I started getting the following error: "USB Device Malfunctioning", and details had something like "Device Descriptor request failed". I tried a bunch of stuff, including the usual uninstall-reinstall etc., but finally what worked is a power cycling of the display adapter.
    • [Update - Nov 6, 2017]  New Issue: after my corp laptop was upgraded to Win 10 Creators Update (Version 1607, build 14393.1770), the DisplayLink adapter stopped working again. Repeated power cycling did not help much either. Note that, how DisplayLink adapter works has also changed a bit (read more here).
      • I tried using version 8.3M1, but that didn't seem to work too well. I am now using version 8.4 Alpha, and the following points are corresponding to this.
      • In 'Settings -> Devices -> Connected Devices' an 'Unknown Device' or something similar was showing up along with 'DisplayLink USB Graphics Device'. Similarly, something similar like 'Unknown Device' was showing up in the 'Devices and Printers' window as well.
      • What finally helped was uninstalling the unknown device from the 'Connected Devices' screen, multiple times, and then power cycling. After doing this a few times, at one point the device got finally recognized as 'DisplayLink USB Graphics Device'.
      • Still looks like it is a hit-miss situation.