Sunday, March 5, 2017

Using HTCondor for grid computing

For a project I am experimenting with, I wanted to run multiple instances of a program parallely, each instance dealing with different input data. An example to demonstrate the use case: Let’s say I have a folder full of movies in wmv format, all of which I want to convert to mp4 format, using the ffmpeg program. I would like to run as many as these conversion jobs parallely, possibly on multiple computers.

Given my current exposure to big data solutions like Hadoop/Spark, I initially thought that, for e.g., we can use yarn scheduler and run these jobs on a spark cluster. But my research indicated that these use cases are not really served by yarn/spark. These tasks are not cloud computing tasks, but they are grid computing tasks.

Now, given my graduate days in University of Wisconsin, Madison, for me, grid computing means condor (currently called HTCondor). I have used condor for my Ph.D. research, and it was awesome. So, I took a look, and gave it a try. Since I never had to set up a cluster before, here are some notes for setting up a cluster aka grid for these jobs.

Note that, this is not really a Getting Started guide, this doesn’t show, for example, sample jobs or sample submission files. For that purpose, follow the manual.

Machines Setup

Currently using a cluster of 3 VM, each running CentOS 6. Using v8.6.0 of HTCondor, and the manual is available here.

HTCondor Setup

Here are the commands that would install HTCondor on each of these CentOS 6 systems. To be run as root user. The original instruction can be found here.

cd /etc/yum.repos.d/
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
rpm --import RPM-GPG-KEY-HTCondor
yum install condor-all

Notes:
  • CentOS <version> is equivalent to RHEL <version>
  • You might have to clear yum cache by using ‘yum clear all’
  • Once the installation is successful, the binaries goes into /usr/bin (/usr becomes the “installation folder”), and the config files go into /etc/condor
  • As part of the installation process, user & group ‘condor’ is created.
Here  are the corresponding commands that would install HTCondor on debian systems (I ended up setting up on my Virtualbox VM for experimenting while on the go, and I run a debian distro there: Bunsen Labs Linux). Again, to be run as root user. The original instruction can be found here.

echo "deb http://research.cs.wisc.edu/htcondor/debian/stable/ jessie contrib" 
                                                     >> /etc/apt/sources.list
wget -qO - http://research.cs.wisc.edu/htcondor/debian/HTCondor-Release.gpg.key 
                                                     | sudo apt-key add -
apt-get update
apt-get install condor

HTCondor Configuration

  • Configurations are available at condor_config in /etc/condor
  • One of the most important gotchas of configuring HTCondor is if the reverse DNS lookup works properly on your hosts, meaning an ‘nslookup <your ip address>’ returns your full host name. If this works, lots of trouble would be saved. Otherwise you have to add additional configs as required.
  • Changed the following configurations (first two because of the above reason):
    • For all boxes in the grid, including CM (Central Manager):
      • condor_config: ALLOW_WRITE = *
    • For all boxes in the grid, including CM (Central Manager), if you want to use it for computations:
      • condor_config.local: TRUST_UID_DOMAIN=true
    • For boxes other than the CM, point to the correct CM IP address, or CM host name if reverse DNS is working.
      • condor_config: CONDOR_HOST = x.x.x.x

Running HTCondor on a Single Node

We will start with running HTCondor on a single node, which will eventually become our central manager.
  • Start the condor daemons:
    • On CentOS 6: /sbin/service condor start
    • On Debian jessi: /etc/init.d/service condor start OR /etc/init.d/condor start
  • Check if the correct processes are running.

[root@... condor]# ps -elf | egrep condor_
5 S condor      4200       1  0  80   0 - 11653 poll_s 14:53 ?        00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid
4 S root        4241    4200  0  80   0 -  6379 poll_s 14:53 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 498
4 S condor      4242    4200  0  80   0 - 11513 poll_s 14:53 ?        00:00:00 condor_shared_port -f
4 S condor      4243    4200  0  80   0 - 16380 poll_s 14:53 ?        00:00:00 condor_collector -f
4 S condor      4244    4200  0  80   0 - 11639 poll_s 14:53 ?        00:00:00 condor_negotiator -f
4 S condor      4245    4200  0  80   0 - 16646 poll_s 14:53 ?        00:00:00 condor_schedd -f
4 S condor      4246    4200  0  80   0 - 11760 poll_s 14:53 ?        00:00:00 condor_startd -f
0 S root        4297    4098  0  80   0 - 25254 pipe_w 14:54 pts/0    00:00:00 egrep condor_

    • Check if condor_status returns correctly, with an output similar to this.

      [root@... condor]# condor_status
      Name      OpSys      Arch   State     Activity LoadAv Mem   ActvtyTim

      slot1@... LINUX      X86_64 Unclaimed Idle      0.080 7975  0+00:00:0
      slot2@... LINUX      X86_64 Unclaimed Idle      0.000 7975  0+00:00:2

                          Machines Owner Claimed Unclaimed Matched Preempting  Drain

             X86_64/LINUX        2     0       0         2       0          0      0

                    Total        2     0       0         2       0          0      0

      Notes:

      • If the condor_status does not return anything, it is probably because the DNS reverse lookup is not working in the system. Please follow the details in this thread, and start with setting the ALLOW_WRITE variable in the condor_config file to *.
      • My CentOS6 VM is dual core with 16GB of RAM, so HTCondor automatically breaks this up into 2 slots, each having 1 core/8GB RAM.

      I directly experimented with a java universe job, but a vanilla universe job can be tried as well. Once the daemons are running and condor_status returns correctly, get back to the user login and submit the job, using ‘condor_submit <job props file>’. If it takes a bit of time to run the job, condor_status would show it.

      [root@akka01-samik condor]# condor_status
      Name      OpSys      Arch   State     Activity LoadAv Mem   ActvtyTim

      slot1@... LINUX      X86_64 Claimed   Busy      0.000 7975  0+00:00:0
      slot2@... LINUX      X86_64 Unclaimed Idle      0.000 7975  0+00:00:2

                          Machines Owner Claimed Unclaimed Matched Preempting  Drain

             X86_64/LINUX        2     0       1         1       0          0      0

                    Total        2     0       1         1       0          0      0

      The command condor_q would show it too.
      [root@... condor]# condor_q


      -- Schedd: ... : <x.x.x.x:42646> @ 02/24/17 14:23:43
      OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
      samik.r CMD: java    2/24 07:38      _      1      _      1 2.0

      1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

      Troubleshooting/Notes:
      • Because of the reverse DNS problem, initially when I submitted the job, it was put on hold, and the condor_status and condor_q was showing as much. Here are the steps I followed for debugging:
        • I followed the instructions available here, specifically using the ‘condor_q -analyze <job id>’ command.
        • Looked at the log file StarterLog.slot1
      • Another time, I got a ClassNotFoundException, since the jar file was not available at proper place. This error showed up in the output folder of the job, in the error file (mentioned in the condor submission file). 

      Running HTCondor on multiple nodes

      Once things work for a single work, it is easy to add more nodes to the cluster. Just install the HTCondor package, change the configurations as described, and start the service.

      Notes for running jobs

      Some points that were not very obvious to me even after reading the instruction manual:
      • In vanilla universe jobs, if the executable reads from standard input, use the input command, else use the arguments command. Consider the following snippet:
      executable = testprog
      input = input_file

      For this snippet, condor tries to execute something like: “testprog < input_file”. If this is how testprog is supposed to run, great. Otherwise, this will not work. On the other hand, if what you really want is “testprog input_file”, then the following snippet should work.

      executable = testprog
      should_transfer_files = yes
      # Define input file
      my_input_file     = input_file
      transfer_input_files = $(my_input_file)
      args           = "<other args if required> $(my_input_file)"

      Note that this possibly works a little bit differently in the java universe, where the input file name shows up as an argument to the executable (java).
      • For executables that are supposed to be available at the exact same path on all the nodes in the cluster, it is recommended to mention the path and switch off transferring executable.

      executable     = /path/to/executable
      transfer_executable = false


      Update (March 9, 2017): Added Debian Jessie related commands, and formatting changes

      Saturday, December 31, 2016

      Using ssh tunnels for launching hadoop or spark jobs

      This is a rather niche topic, so if you are here, you probably have weighed your options, talked with your colleagues, and have enough reason to do just this. Hence, without further ado, we will get right in the topic.

      As mentioned in another post, I use an ssh double tunnel, and a socks proxy to launch hadoop commands. That works quite well. The commands I use to set up the ssh double tunnel and socks host are:
      • ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22
      • ssh -f -ND 8020 -i "</path/to/keyfile>" samikr@localhost -p 5522
      Next I can run commands like:
      • hadoop fs -ls hdfs:///user/samik
      Note that, in order to do this, I will have to have the following:
      • Hadoop distribution needs to be available locally
      • The version of local hadoop distribution needs to exactly match the version server-side
      • Appropriate configuration files
      I initially thought that I will extend this method by having the spark distribution locally and launch the spark jobs on the cluster using this. I progressed quite a bit, however hit an insurmountable roadblock in the form of a bug in the spark library. The bug text explains the issue in detail. However, if your remote machine (where you are trying to execute the spark-submit from) can resolve the hadoop namenodes and yarn master, check out this page and try out - it might work for you.

      Instead I had to revert to the painful process of uploading jar every time I was changing something in the program, which is what I wanted to avoid in the first place. But the method is quite clean - here are the steps to achieve this.

        • Create a single jar of your spark code
          • One caveat here: if your dependencies include one or more signed jar, you will have to either manually edit the jar to remove the signature files, or use maven settings to exclude the signature files. Otherwise you will get the following error when you run the code on cluster: Exception in thread "main" java.lang.SecurityException: Invalid signature file digest for Manifest main attributes. Follow this StackOverflow thread for more details. Recommended process here is to use a maven file using the shade plugin.
        • Push the jar to your hadoop user folder
          • This is one way to make the jar available to the YARN executer. 
            • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:///user/samikr/<jarname>.jar" < /path/to/jar/file
          • Might have to delete existing jar file.  
            • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -rm -skipTrash hdfs:///user/samikr/<jarname>.jar"
        • Push the spark-hadoop jar to the hadoop user folder as well. Even better, run a sample spark application (e.g., this page shows running SparkPi in HDP) using spark-submit on the remote host, and note down the path to the spark-hadoop jar that is being used. It should typically already be available somewhere in HDFS. Agin, the reason is to avoid copying this jar every time the job is launched.
          • Command: 
            ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/spark-submit --class <complete.classname.for.ExecutableClass> --master yarn-cluster --driver-memory 4g --executor-memory 2g --executor-cores 1 --conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar hdfs:///user/samikr/<jarname>.jar"
          • In the above command, “--conf spark.yarn.jar=hdfs://production1/hdp/apps/x.x.x/spark/spark-hdp-assembly.jar” specifies the location of the spark-hadoop jar in HDFS. This part should not be required when the command is being run through an ssh tunnel - check out.


        To launch hadoop jobs (or yarn jobs for that matter), it is a bit straightforward. In this case, I had to just copy the fat jar in the remote node (not in HDFS) and launch the job. Commands were as follows.
        • scp -P5522 -i "<key file>" /path/to/fat.jar samikr@localhost:.
        • ssh -i "<key file>" samikr@localhost -p 5522 "/usr/bin/yarn jar fat.jar <complete.classname.for.ExecutableClass>"
        More notes:
        • In order to avoid the loop of compile-upload-test, it would be better to create a local hadoop-spark node, test out the code there, and then use the above procedure to run the job on the cluster. There are some interesting notes here - possibly a topic for a later post.
        • Using ssh config files are recommended to shorten some of the large ssh commands above.

        Thursday, December 22, 2016

        Using ssh: multiple security algorithms and keys

        While using ssh to connect to hosts, I recently faced this interesting issue. I usually use a double ssh tunnel to connect to varied internal hosts which are behind firewall. Typically the way I set the double tunnel is using the command:

        ssh -4 -A -f -N -i "</path/to/keyfile>" username@intermediate-host -L5522:<final_host>:22

        You will notice that the above command sets up a tunnel which forwards the ssh port of the final host to a local port (5522), so that I can run commands. This works pretty well, and I have used this tunnel to run hadoop commands or submit spark jobs.

        I was recently trying to set up a tunnel to a new host to submit spark jobs. The tunnel setup went well, but when I tried to run a hadoop command over the tunnel, I got a message regarding the key type being used for ssh handshake.

        $ ssh -i "</path/to/keyfile>" samikr@localhost -p 5522 "/usr/bin/hadoop fs -put - hdfs:/
        //user/samikr/datapipes.jar" < datapipes.jar
        Unable to negotiate with 127.0.0.1: no matching host key type found. Their offer: ssh-rsa,ecdsa-sha2-nistp256,ssh-ed25519

        I checked the config file and the contents were as follows.


        $ cat config
        Host *
               HostkeyAlgorithms ssh-dss

        Clearly I needed to add one of the accepted key types for this server, but I faced some trouble specifying multiple keys in the same line. After some searching, this is what worked (note no space after comma).

        $ cat config
        Host *
               HostkeyAlgorithms ssh-dss,ssh-rsa

        Now the command seemed to be going through, but I got another error message.

        @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        @    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
        @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
        IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
        Someone could be eavesdropping on you right now (man-in-the-middle attack)!
        It is also possible that a host key has just been changed.
        The fingerprint for the RSA key sent by the remote host is
        SHA256:xxxxxxxxxxxx
        Please contact your system administrator.
        Add correct host key in /home/<user>/.ssh/known_hosts to get rid of this message.
        Offending DSA key in /home/<user>/.ssh/known_hosts:5
        RSA host key for [localhost]:5522 has changed and you have requested strict checking.
        Host key verification failed.

        There seems to be already an entry for localhost/5522 in the known_hosts, but for ssh-dss algorithm. I was hoping that another line with the new algorithm would get added in the known_hosts file for localhost, but apparently, with strict checking, only one entry per host is allowed. I had to get rid of that line, and then things went through for the command.

        Sunday, January 17, 2016

        My DisplayLink troubleshooting guide

        I use Targus DisplayLink adapter for using my Win 10 ASUS laptop with multiple monitors. When things work, they work really well. But often they don't. And the problems are weird. All of a sudden, the adapter just stop working. How many times I try, nothing works, except when it starts working again. 

        Here are some of my notes from my troubleshooting experience.
        • The adapter needs correct voltage to get detected the first time I am inserting the USB 3.0. Most of the time the voltage is there, and things work. But when it doesn't work, I have seen things kick in if I do a few of the below:
          • Let Laptop hard drive activity subside. That means the HDD LED is not active.
          • Power cycle through the DisplayLink adapter.
            • This sometimes still doesn't work because the power adapter that powers the DisplayLink adapter also powers the external monitors. 
          • Just detach the  mini plug from the power box of the DisplayLink adapter, and put that back in.
            • I have recently discovered this, and this seems to work well so far.
        • Once I did the last one, and the monitors started misbehaving. Specifically, the two external monitors constantly started switching between #1 and #2. I have a Dell 23' monitor connected through HDMI, and a Dell 19' connected through DVI. The switching was rapid (e.g., once every second or so) and very weird, haven't seen this before. I had to detach my 2nd monitor and reattach it back in order to get things working again.
        • [Update - March 4, 2016] New issue: when I inserted the USB cable, I started getting the following error: "USB Device Malfunctioning", and details had something like "Device Descriptor request failed". I tried a bunch of stuff, including the usual uninstall-reinstall etc., but finally what worked is a power cycling of the display adapter.

        Thursday, October 29, 2015

        Black Screen of Death on Windows 10

        I have so far liked Win 10, but was recently hit by the dreaded black screen of death (looks like, this is the new blue screen of death), where I am stuck with a black screen with a mouse pointer. There are numerous posts on the Internet, but it was actually not that easy to get through most of them. 

        Let me state the issue clearly: I have a few months old ASUS Q302L model, which was working great. This is how it started:
        • I had the brilliant idea of connecting the ASUS Q302L laptop to Chromecast, so that I can project videos (mostly Youtube) on TV.
        • Installed Chromecast for PC, but realized that it doesn't do much.
        • Installed Chromecast plugin for Chrome browser, and that seemed to do the trick.
        • Uninstalled Chromecast for PC. [I think, this is the root cause for all my woes! Ok, Google!!]
        After a day or so of the above, I noticed that my Youtube videos are not working on the notebook any more. The flash based player doesn't work, the HTML5 based player doesn't work. In fact, to my horror, I found out, that videos doesn't play at all on the notebook, even on standalone software like WMP. Bummer!! 
        • Decided to restore the PC to a recent restore point.
          • The most recent restore point errored out, saying something like "Can't change registry settings" etc., and booted back to Win 10 with no video.
          • Used an earlier restore point.
        • That seemed to work, until I was stuck at the black screen with mouse pointer.
        Then started the long process of google searches and tryouts. Here are the steps in the ordeal:
        • Realized soon that I need to get to the 'System Repair' option by using Win 10 installation media.
          • Used another Win 10 PC to create an USB Win 10 installation media.
        • Inserted the USB media, but it was not used for boot up. Realized that I have to change the boot order.
        REMINDER #1: When I get a new laptop, always change the boot order to have the USB drive before the hard drive. (More caveats to this later)
        • Realized that unlike lot of other systems, the nice ASUS logo screen when powering up by Q302L model, doesn't show any details of how to get into BIOS menu. More searching followed, including taking a peek at the manual for ASUS Q302L model. Seemed like either F2 or F8 might help. Tried both options, but no dice.
          • Eventually F2 worked, seemed like it was just a matter of timing. Now I actually appreciate having a separate button to boot into BIOS menu, like Lenevo models.
        • I tried a bunch of stuff: 
          • Other restore points: they errored out.
          • I had no System Image set up.
        REMINDER #2: Probably a System Image would have been an easier option to restore, when the restore points failed. This is time consuming however, so can possibly be done once a month.
          • I also tried out booting into safe mode, safe mode with networking, with low-res graphics etc., however none of these worked. Every time I got stuck with the same black screen - it was really stubborn.
          • To my chagrin,  I found out that every time I tried booting without the Win 10 USB drive to check if things work, the boot option went back to 'Hard Drive', so I had to start at F2 again.
          • Finally decided to reinstall Win 10.
        • Reinstalled Win 10. Things went reasonably smooth, and I got back the display.
        • Got stuck at activating Win 10 though - the process errored out with an error code: 0xC004C003
          • As per this page, this means: The MAK is blocked on the activation server. 
          • My Win 10 copy was initially upgraded using Windows Update from Win 8.1, so as such there shouldn't be any problem regarding a fresh installation at a later point, specifically when there is no hardware change.
          • Contacted Customer Support over chat, and finally got things resolved.
            • Looks like, I have not set up my Country. This can be set up by bringing in the dialog by executing 'slui 4' command from Run. This has to be the same country from where you activated Win 10 the first time.
            • That may not have been enough. After that, the CSR needed my Win 8.1 product key. This can be obtained by running the following command in an Admin command prompt: wmic path SoftwareLicensingService get OA3xOriginalProductKey
            • Then the CSR did something in the background, got the Win 10 product key, and used that to activate.
            • My suggestion is to get this done using CSR help.
        Long ordeal. I still have to reinstall all my apps, but I can use the laptop - which is more than enough of what I can ask :-)

        Some links that were good but not helpful to me (may be they would be helpful to you):
        - Youtube link
        - MS link 

          Thursday, September 3, 2015

          Setting up Orientdb - a graph database

          I recently found myself dabbling to graph databases for some interesting analytics work. One of the software that we wanted to test out is a document-graph database called Orientdb. The software is interesting because it covers functionality from both document databases (e.g., MongoDB) and pure graph databases (e.g., Neo4j). While trying to set up the server however, faced a few gaps in the documentation, which is why this post. I am using Orientdb Community Edition v2.1.1.

          Using Custom Port

          By default, Orientdb binary server runs at port 2424 and web server runs at port 2480. To use different ports with Orientdb, one has to change the settings in config/orientdb-server-config.xml. There you would find the port ranges provided for each server operation (i.e., binary and web servers). Honestly, I have not seen port ranges for other servers before, presumably this is for the purpose that if a particular port is occupied, the software will run on next available port - kinda curious. Anyways, you will have to change the starting port numbers for your desired port in those settings.

          Connecting from Orientdb console

          The console example provided in the documentation does not provide examples when you are running the servers on custom port. Here is how you can connect to the server, or to a database.

          orientdb> connect remote:localhost:3306 root root

          Connecting to remote Server instance [remote:localhost:3306] with user 'root'...OK
          orientdb {server=remote:localhost:3306/}>

          orientdb {server=remote:localhost:3306/}> list databases

          Found 1 databases:

          * GratefulDeadConcerts (plocal)
           You can also directly connect to a database, while mentioning the port, as follows:
          orientdb> connect remote:localhost:3306/GratefulDeadConcerts root root

          Connecting to database [remote:localhost:3306/GratefulDeadConcerts] with user 'root'...OK
          orientdb {db=GratefulDeadConcerts}> classes


          CLASSES
          ----------------------------------------------+------------------------------------+------------+----------------+
           NAME                                         | SUPERCLASS                         | CLUSTERS   | RECORDS        |
          ----------------------------------------------+------------------------------------+------------+----------------+
           E                                            |                                    | 10         |              0 |
           followed_by                                  | [E]                                | 11         |           7047 |
          Will update this post with more notes.