Sunday, March 5, 2017

Using HTCondor for grid computing

For a project I am experimenting with, I wanted to run multiple instances of a program parallely, each instance dealing with different input data. An example to demonstrate the use case: Let’s say I have a folder full of movies in wmv format, all of which I want to convert to mp4 format, using the ffmpeg program. I would like to run as many as these conversion jobs parallely, possibly on multiple computers.

Given my current exposure to big data solutions like Hadoop/Spark, I initially thought that, for e.g., we can use yarn scheduler and run these jobs on a spark cluster. But my research indicated that these use cases are not really served by yarn/spark. These tasks are not cloud computing tasks, but they are grid computing tasks.

Now, given my graduate days in University of Wisconsin, Madison, for me, grid computing means condor (currently called HTCondor). I have used condor for my Ph.D. research, and it was awesome. So, I took a look, and gave it a try. Since I never had to set up a cluster before, here are some notes for setting up a cluster aka grid for these jobs.

Note that, this is not really a Getting Started guide, this doesn’t show, for example, sample jobs or sample submission files. For that purpose, follow the manual.

Machines Setup

Currently using a cluster of 3 VM, each running CentOS 6. Using v8.6.0 of HTCondor, and the manual is available here.

HTCondor Setup

Here are the commands that would install HTCondor on each of these CentOS 6 systems. To be run as root user. The original instruction can be found here.

cd /etc/yum.repos.d/
wget https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-stable-rhel6.repo
wget http://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
rpm --import RPM-GPG-KEY-HTCondor
yum install condor-all

Notes:
  • CentOS <version> is equivalent to RHEL <version>
  • You might have to clear yum cache by using ‘yum clear all’
  • Once the installation is successful, the binaries goes into /usr/bin (/usr becomes the “installation folder”), and the config files go into /etc/condor
  • As part of the installation process, user & group ‘condor’ is created.
Here  are the corresponding commands that would install HTCondor on debian systems (I ended up setting up on my Virtualbox VM for experimenting while on the go, and I run a debian distro there: Bunsen Labs Linux). Again, to be run as root user. The original instruction can be found here.

echo "deb http://research.cs.wisc.edu/htcondor/debian/stable/ jessie contrib" 
                                                     >> /etc/apt/sources.list
wget -qO - http://research.cs.wisc.edu/htcondor/debian/HTCondor-Release.gpg.key 
                                                     | sudo apt-key add -
apt-get update
apt-get install condor

HTCondor Configuration

  • Configurations are available at condor_config in /etc/condor
  • One of the most important gotchas of configuring HTCondor is if the reverse DNS lookup works properly on your hosts, meaning an ‘nslookup <your ip address>’ returns your full host name. If this works, lots of trouble would be saved. Otherwise you have to add additional configs as required.
  • Changed the following configurations (first two because of the above reason):
    • For all boxes in the grid, including CM (Central Manager):
      • condor_config: ALLOW_WRITE = *
    • For all boxes in the grid, including CM (Central Manager), if you want to use it for computations:
      • condor_config.local: TRUST_UID_DOMAIN=true
    • For boxes other than the CM, point to the correct CM IP address, or CM host name if reverse DNS is working.
      • condor_config: CONDOR_HOST = x.x.x.x

Running HTCondor on a Single Node

We will start with running HTCondor on a single node, which will eventually become our central manager.
  • Start the condor daemons:
    • On CentOS 6: /sbin/service condor start
    • On Debian jessi: /etc/init.d/service condor start OR /etc/init.d/condor start
  • Check if the correct processes are running.

[root@... condor]# ps -elf | egrep condor_
5 S condor      4200       1  0  80   0 - 11653 poll_s 14:53 ?        00:00:00 condor_master -pidfile /var/run/condor/condor_master.pid
4 S root        4241    4200  0  80   0 -  6379 poll_s 14:53 ?        00:00:00 condor_procd -A /var/run/condor/procd_pipe -L /var/log/condor/ProcLog -R 1000000 -S 60 -C 498
4 S condor      4242    4200  0  80   0 - 11513 poll_s 14:53 ?        00:00:00 condor_shared_port -f
4 S condor      4243    4200  0  80   0 - 16380 poll_s 14:53 ?        00:00:00 condor_collector -f
4 S condor      4244    4200  0  80   0 - 11639 poll_s 14:53 ?        00:00:00 condor_negotiator -f
4 S condor      4245    4200  0  80   0 - 16646 poll_s 14:53 ?        00:00:00 condor_schedd -f
4 S condor      4246    4200  0  80   0 - 11760 poll_s 14:53 ?        00:00:00 condor_startd -f
0 S root        4297    4098  0  80   0 - 25254 pipe_w 14:54 pts/0    00:00:00 egrep condor_

    • Check if condor_status returns correctly, with an output similar to this.

      [root@... condor]# condor_status
      Name      OpSys      Arch   State     Activity LoadAv Mem   ActvtyTim

      slot1@... LINUX      X86_64 Unclaimed Idle      0.080 7975  0+00:00:0
      slot2@... LINUX      X86_64 Unclaimed Idle      0.000 7975  0+00:00:2

                          Machines Owner Claimed Unclaimed Matched Preempting  Drain

             X86_64/LINUX        2     0       0         2       0          0      0

                    Total        2     0       0         2       0          0      0

      Notes:

      • If the condor_status does not return anything, it is probably because the DNS reverse lookup is not working in the system. Please follow the details in this thread, and start with setting the ALLOW_WRITE variable in the condor_config file to *.
      • My CentOS6 VM is dual core with 16GB of RAM, so HTCondor automatically breaks this up into 2 slots, each having 1 core/8GB RAM.

      I directly experimented with a java universe job, but a vanilla universe job can be tried as well. Once the daemons are running and condor_status returns correctly, get back to the user login and submit the job, using ‘condor_submit <job props file>’. If it takes a bit of time to run the job, condor_status would show it.

      [root@akka01-samik condor]# condor_status
      Name      OpSys      Arch   State     Activity LoadAv Mem   ActvtyTim

      slot1@... LINUX      X86_64 Claimed   Busy      0.000 7975  0+00:00:0
      slot2@... LINUX      X86_64 Unclaimed Idle      0.000 7975  0+00:00:2

                          Machines Owner Claimed Unclaimed Matched Preempting  Drain

             X86_64/LINUX        2     0       1         1       0          0      0

                    Total        2     0       1         1       0          0      0

      The command condor_q would show it too.
      [root@... condor]# condor_q


      -- Schedd: ... : <x.x.x.x:42646> @ 02/24/17 14:23:43
      OWNER   BATCH_NAME    SUBMITTED   DONE   RUN    IDLE  TOTAL JOB_IDS
      samik.r CMD: java    2/24 07:38      _      1      _      1 2.0

      1 jobs; 0 completed, 0 removed, 0 idle, 1 running, 0 held, 0 suspended

      Troubleshooting/Notes:
      • Because of the reverse DNS problem, initially when I submitted the job, it was put on hold, and the condor_status and condor_q was showing as much. Here are the steps I followed for debugging:
        • I followed the instructions available here, specifically using the ‘condor_q -analyze <job id>’ command.
        • Looked at the log file StarterLog.slot1
      • Another time, I got a ClassNotFoundException, since the jar file was not available at proper place. This error showed up in the output folder of the job, in the error file (mentioned in the condor submission file). 

      Running HTCondor on multiple nodes

      Once things work for a single work, it is easy to add more nodes to the cluster. Just install the HTCondor package, change the configurations as described, and start the service.

      Notes for running jobs

      Some points that were not very obvious to me even after reading the instruction manual:
      • In vanilla universe jobs, if the executable reads from standard input, use the input command, else use the arguments command. Consider the following snippet:
      executable = testprog
      input = input_file

      For this snippet, condor tries to execute something like: “testprog < input_file”. If this is how testprog is supposed to run, great. Otherwise, this will not work. On the other hand, if what you really want is “testprog input_file”, then the following snippet should work.

      executable = testprog
      should_transfer_files = yes
      # Define input file
      my_input_file     = input_file
      transfer_input_files = $(my_input_file)
      args           = "<other args if required> $(my_input_file)"

      Note that this possibly works a little bit differently in the java universe, where the input file name shows up as an argument to the executable (java).
      • For executables that are supposed to be available at the exact same path on all the nodes in the cluster, it is recommended to mention the path and switch off transferring executable.

      executable     = /path/to/executable
      transfer_executable = false


      Update (March 9, 2017): Added Debian Jessie related commands, and formatting changes

      No comments:

      Post a Comment