Created at: 2014-01-27

Modified at: 2014-11-06

TORQUE notes

TORQUE is a resource manager to provide to processes, a organized way to access a system's resource. In conjunction with Maui, which is a scheduler for the queues, it are the main tools installed in a HPC system. This page will show you basics about TORQUE installation and configuration.

TORQUE

Maui

There are some proprietary alternatives like PBS Professional, that joins both a resource manager and a scheduler and Moab from Adaptive Computing, which is another HPC suite. Those are not covered here.

PBS Professional

Moab

Adaptive Computing

Overview

(2014-02-21)

You can get an overview of TORQUE in the TORQUE architecture page of the TORQUE Administration Guide, but in summary it is as follows:

TORQUE architecture

TORQUE Administration Guide

For a successful TORQUE installation, four daemons should be running:

trqauthd: The daemon that clients connect to (using TORQUE commands like qsub, qdel, qstat, etc. It then authorizes connections to pbs_server. See more information on the "Configuring trqauthd for client commands" linked below. It opens a UNIX Domain Socket in /tmp/trqauthd-unix.
pbs_server: The daemon that gets in touch with pbs_mom (in nodes) to run new jobs. Default listen to port 15001.
pbs_mom: Responsible for running jobs in nodes. Communicates with pbs_server. Default listen to ports 15002 and 15003.
pbs_sched: Basic scheduler that is called by pbs_server in a period of time see scheduler_iteration setting in the pbs_server_attributes(7) man page. pbs_sched is a very simple scheduler and is usually replaced by Maui. Schedulers default configuration listen to port 15004.

Configuring trqauthd for client commands

UNIX Domain Socket

pbs_server_attributes(7)

The following diagram is a summary of the communication between daemons and user programs::

    master node
    ........................................
    :                                      :
    :   +--------------------+             :
    :   |    user commands   |             :
    :   | (qsub, qdel, etc.) |             :
    :   +--------------------+             :
    :             ^                        :
    :             |                        :
    :             v                        :
    :        +----------+                  :
    :        | trqauthd |                  :
    :        +----------+                  :
    :             ^                        :
    :             |                        :
    :             v                        :
    :       +------------+   +-----------+ :
    :       | pbs_server |<->| pbs_sched | :
    :       +------------+   +-----------+ :
    :             ^                        :
    :             |                        :
    :......................................:
                  |
                  |
    slave nodes   |
    ..............................
    :             |              :
    :             v              :
    :       +------------+       :
    :       |   pbs_mom  |       :
    :       +------------+       :
    :                            :
    :............................:

Here, "master node" is the computer that has TORQUE administraton tools installed and that makes all the work of enqueing jobs, running the scheduler, allocation of resources, etc. Normally it is not used for processing jobs. "slave nodes" are the nodes used to run processing jobs. Each "slave node" runs a instance of pbs_mom. Both kind of nodes can be ran in one computer (so, therefore running all daemons in just one computer).

Installation in a supercomputer

At the time of this writing, TORQUE 4.2.6 was the newest version. So let's work with that. I'm working on installing it not on a traditional Beowulf Cluster (see link below), but in a supercomputer with 136 processors (we will use just 134) and more than 260 GB RAM.

Beowulf Cluster

After downloading and unpacking the tarball, let's configure the system. As the root user do::

    # export TORQUE_HOME=/opt/torque-4.2.6
    # ./configure --prefix=$TORQUE_HOME
    # make
    # make install

We should not forget init scripts. Since our system is SUSE Linux Enterprise Server the commands are::

    # cp contrib/init.d/suse.trqauthd /etc/init.d/trqauthd
    # chkconfig -add trqauthd
    # service trqauthd start

I prefer not to install it among other things in /usr, so I used the --prefix thing above. We will call that directory with $TORQUE_HOME.

After that, it is important to tell our system where we just installed TORQUE, if it is not a standard location::

    # export PATH=$PATH:$TORQUE_HOME/bin:$TORQUE_HOME/sbin

It is a good idea to add it to your rc scripts.

Than, let's create the server database with::

    # pbs_server -t create

*Note:*

In the "Installing TORQUE" section of the TORQUE Administration Guide, it tell us to use the ./torque.setup root script, that already creates a basic setup for qmgr for us, but we got a problem (see "Error qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host" below).

After that, we should start pbs_server, and call the qmgr program to setup the queues:

    # qterm
    # pbs_server
    # qmgr

In the qmgr console, let's configure a basic queue called batch with the following commands::

    create queue batch

    set queue batch queue_type = Execution
    set queue batch resources_max.mem = 100gb
    set queue batch resources_max.procs = 100
    set queue batch resources_max.walltime = 01:00:00
    set queue batch enabled = True
    set queue batch started = True

    set server scheduling = True
    set server managers = root@hostname
    set server default_queue = batch
    set server log_events = 511
    set server mail_from = root
    set server scheduler_iteration = 600
    set server node_check_rate = 150
    set server tcp_timeout = 6

We already have a queue setup. Now we need to tell TORQUE to use the computer we are installed. Configuration and daemons for ordinary nodes are different for the master node but, since we are using a standalone supercomputer, we are running it both as the server and the client.

Configuration that should be in nodes are $TORQUE_HOME/mom_priv/config. It should just be::

    $pbs_server hostname

*Note:*

    Avoid using `localhost` as the hostname.  Use the output of the
    `hostname` command.  I've had some problems with that (see the
    Troubleshooting_ section) which is probably misconfiguration I didn't
    realize.

And, in the server side, let's just specify the nodes in the $TORQUE_HOME/server_priv/nodes file. In our case::

    hostname np=134

Where "hostname" is the output of the hostname command. See we specify the number of processors and can also specify other settings.

After that, let's restart the daemons and also start client-side daemons::

    # pkill pbs_server
    # qterm
    # pbs_server
    # pbs_mom

Let's see the output of pbsnodes::

    # pbsnodes
    hostname
         state = job-exclusive
         np = 134
         ntype = cluster
         jobs = 0/12.hostname.domain
         status = rectime=1390849225,varattr=,jobs=12.hostname.domain,state=free,netload=135590624,gres=pbs_server:= hostname,loadave=16.08,ncpus=134,physmem=269558192kb,availmem=273994304kb,totmem=280048592kb,idletime=91,nusers=1,nsessions=4,sessions=147029 163085 164261 173479,uname=Linux hostname 2.6.16.60-0.42.10-default #1 SMP Tue Apr 27 05:11:27 UTC 2010 ia64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003

pbsnodes

And let's see if we see our queues::

    # qstat -q

    server: hostname

    Queue            Memory CPU Time Walltime Node  Run Que Lm  State
    ---------------- ------ -------- -------- ----  --- --- --  -----
    batch             100gb    --       --      --    1   0 --   E R
                                                   ----- -----
                                                       1     0

In this example we already submitted a job, which is running.

A complex queue setup

This basic installation works fine for one queue, but normally TORQUE users use it on a cluster with many nodes. The jobs must be scheduled so no user get's more priority than others. Another different and more complex setup, is the one that follows (the output of qmgr -c 'p s'::

    create queue routing
    set queue routing queue_type = Route
    set queue routing route_destinations = P16
    set queue routing route_destinations += P8
    set queue routing route_destinations += P4
    set queue routing route_destinations += P1-2
    set queue routing route_destinations += serial
    set queue routing enabled = True
    set queue routing started = True

    create queue serial
    set queue serial queue_type = Execution
    set queue serial resources_max.mem = 20gb
    set queue serial resources_max.ncpus = 1
    set queue serial resources_max.nodes = 1
    set queue serial resources_max.procs = 1
    set queue serial resources_max.walltime = 720:00:00
    set queue serial resources_min.procs = 1
    set queue serial max_user_run = 4
    set queue serial enabled = True
    set queue serial started = True

    create queue P1-2
    set queue P1-2 queue_type = Execution
    set queue P1-2 resources_max.mem = 20gb
    set queue P1-2 resources_max.procs = 2
    set queue P1-2 resources_max.walltime = 24:00:00
    set queue P1-2 resources_min.procs = 1
    set queue P1-2 enabled = True
    set queue P1-2 started = True

    create queue P4
    set queue P4 queue_type = Execution
    set queue P4 resources_max.mem = 12gb
    set queue P4 resources_max.procs = 4
    set queue P4 resources_max.walltime = 24:00:00
    set queue P4 enabled = True
    set queue P4 started = True

    create queue P8
    set queue P8 queue_type = Execution
    set queue P8 resources_max.mem = 24gb
    set queue P8 resources_max.procs = 8
    set queue P8 resources_max.walltime = 24:00:00
    set queue P8 max_user_run = 8
    set queue P8 enabled = True
    set queue P8 started = True

    create queue P16
    set queue P16 queue_type = Execution
    set queue P16 resources_max.mem = 24gb
    set queue P16 resources_max.procs = 16
    set queue P16 resources_max.walltime = 24:00:00
    set queue P16 max_user_run = 1
    set queue P16 enabled = True
    set queue P16 started = True

    set server scheduling = True
    (... rest of server configuration)

We now have a more complex setup, with different queues that have different attributes (number of processors, walltime, memory available etc.). More information about the attributes in the TORQUE Administration Guide. We also have a "routing" queue to route jobs to the right queues. TORQUE alone (with its very simple scheduler, pbs_sched cannot do that for us.

In that case, we need to use another scheduler. We are going to use Maui_. Take a look at our page about Maui (linked below) for maui installation and setup with TORQUE.

our page about Maui

Troubleshooting

Error `qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host`

For some reason that I don't know why (Google didn't help) I got this error when running ./torque.setup root command as recommended by the TORQUE Administration Guide. So I ran pbs_server -t create and configured the queues manually.

pbsnodes showing down host

If the output of pbsnodes is::

    # pbsnodes
    localhost
         state = down
         np = 134
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003

Check the content of the $TORQUE_HOME/server_priv/nodes and $TORQUE_HOME/mom_priv/config files, as well the hostname of the host you are running TORQUE server and clients.

Only one job run per time, even if there is resources free

(2014-02-18)

There can be different reasons for this problem, like a misconfigured scheduler or queue. In our case, we were trying to configure TORQUE in a supercomputer environment, with lots of CPUs and memory.

We know that in a cluster environment, pbs_server is executed in a "master node" and pbs_mom on the others. A supercomputer environment is a single computer that both runs pbs_server and pbs_mom. There is no problem for that. When we execute pbsnodes command we will see one single node with lots of processors::

    # pbsnodes
    localhost
         state = down
         np = 134
         ntype = cluster
         mom_service_port = 15002
         mom_manager_port = 15003

After running the job, it just put the whole computer in a job-exclusive state, preventing other jobs to run::

    # pbsnodes
    bachianas
         state = job-exclusive
         np = 134
         ntype = cluster
         jobs = 0/36.bachianas
         status = rectime=1391694740,varattr=,jobs=36.bachianas,state=free,netload=17322218165,gres=,loadave=6.02,ncpus=136,physmem=269558192kb,availmem=271858224kb,totmem=280048592kb,idletime=420,nusers=1,nsessions=4,sessions=50921 55356 55540 192338,uname=Linux bachianas 2.6.16.60-0.42.10-default #1 SMP Tue Apr 27 05:11:27 UTC 2010 ia64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003

We should not treat this machine as a cluster. TORQUE sees the whole machine as a single node and locks it for one job, doesn't matter if the job uses 1 or 134 CPUs.

In this very specific case, the machine supported the NUMA_ architecture, so we can compile TORQUE with NUMA support to divide the CPUs in logical units. Check the TORQUE on NUMA systems section (linked below) of the TORQUE Administration Guide.

NUMA

TORQUE on NUMA systems

qrun: Unknown node-attribute

This error can have different causes. One is that your scheduler is probably not running or cannot communicate with pbs_server. Log of my maui setup showed something like::

    02/14 12:00:13 MRMClusterQuery()
    02/14 12:00:13 WARNING:  no resources detected
    02/14 12:00:13 MRMWorkloadQuery()
    02/14 12:00:13 WARNING:  no workload detected

In my case it was a problem with Maui. This was a problem with the RMCFG[HOSTNAME] setting.

job failing into the wrong queue

(2014-02-28)

Job failing in the wrong queue can have several reasons. The most common reason, IMO, is wrong queue configuration, not a problem with the job itself. But another common reason is wrong PBS directives. The following directive::

    #PBS nodes=2:ppn=4

is *wrong*. The right one would be::

    #PBS -l nodes=2:ppn=4

See -l argument? That is the right way to use it.

job in 'R' state, but Time Use is always 00:00:00

(2014-06-03)

After queueing a job, I executed qrun on it, because Maui_'s scheduler was stopped, for testing purposes. After running qrun, it changed to the R state in qstat, but the column Time Use didn't change. It was 00:00:00.

    07/02/2014 15:53:53;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect from 200.XXX.XXX.XXX:1023 (address not trusted - check entry in server_priv/nodes)

qrun(8)

qstat(1B)

Solution 1: In one case, the origin of this problem was that in $TORQUE_HOME/server_name it was correctly the hostname of the machine, but in /etc/hosts the hostname was associated with a external IP (200.XXX...) that nodes could not access. The solution was to change the IP entry in /etc/hosts to the internal IP address other nodes can access.

Solution 2: In another case, pbs_mom was just having problems to execute in some nodes. Why? The log directory (usually /var) was full. I needed to delete some stuff.

pbs_mom(8)

Job being completed just after submit, with no further information

(2014-07-17)

This error can be several reasons. It is better to check the logs in $TORQUE_HOME/server_logs.

There is an error like::

    PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, ty pe=LocateJob, from user@node

where user is the user login name and node is the name of the node.

Server is OK, scheduler is OK too. It is probably some problem in the node. After logging in it, I realized that users had problems with the NFS partitions, since we made changes to the NFS server and the local firewall. umount and mount them again was not enough, so we had to reboot all nodes. So, it worked out.

Job never enters state R (Run)

(2014-11-06)

This is a *very common* problem that can have many different reasons.

Reason 1: Filesystem that has logs is full

A problem I had was the following: there was some jobs running on the system, but newer jobs wasn't running. If we just check its status with Maui's command checkjob, it tell us::

    # checkjob 3038

    (...)

    job is deferred.  Reason:  NoResources  (exceeds available partition procs)
    Holds:    Batch  Defer  (hold reason:  NoResources)
    PE:  16.00  StartPriority:  31
    cannot select job 3038 for partition DEFAULT (job hold active)

checkjob

So we have a "Maui hold" in it. If we just try to relase it with releasehold and wait the scheduler cycle, we see that it is put the hold again. Maui log doesn't help and just tell us that there aren't available resources, like checkjob just did.

releasehold

pbsnodes tell us jobs are free. But, if you investigate with care, you will see a different message::

    bachianas-1
         state = free
         np = 4
         ntype = cluster
         status = rectime=1413550150,varattr=,jobs=,state=free,netload=? 0,gres=,message=ERROR: torque spool filesystem full,loadave=0.00,ncpus=4,physmem=8077312kb,availmem=7872128kb,totmem=8077312kb,idletime=11,nusers=0,nsessions=0,uname=Linux bachianas 2.6.16.60-0.42.10-default #1 SMP Tue Apr 27 05:11:27
     UTC 2010 ia64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003

There is a message field with the following content: ERROR: torque spool filesystem full. Some time ago the filesystem was really full and we had to delete some files but we didn't restart pbs_mom.

So, in this case, a quick restart of pbs_mom daemon solved the problem.

Reason 2: DNS problems

If we use command checkjob to investigate, we see::

    job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 15085, msg: 'Time out MSG=connection to mom timed out')

This means that the server cannot connect to the mom daemon and vice-versa.

In mom_logs we find the line::

    11/05/2014 18:30:47;0008;PBS_Server.23876;Job;3080.bachianas.ufabc.edu.br;unable to run job, send to MOM '3364214663' failed

And a call to qrun command to force the execution of the job returns::

    qrun: Time out MSG=connection to mom timed out 3083.bachianas.ufabc.edu.br

There is also various messages in server_logs saying it is not possible to communicate to mom.

First, check if both daemons pbs_server and pbs_mom are running. If so, there is likely a problem with DNS. In my case, there was an entry for an invalid DNS server in /etc/resolv.conf and it was necessary to remove it. After that, I had to free jobs from hold with the releasehold command.