home | codereading | contact | math | misc | patches | tech


TORQUE notes

TORQUE is a resource manager to provide to processes, a organized way to access a system's resource. In conjunction with Maui, which is a scheduler for the queues, it are the main tools installed in a HPC system. This page will show you basics about TORQUE installation and configuration.

There are some proprietary alternatives like PBS Professional, that joins both a resource manager and a scheduler and Moab from Adaptive Computing, which is another HPC suite. Those are not covered here.

Overview

You can get an overview of TORQUE in the TORQUE architecture page of the TORQUE Administration Guide, but in summary it is as follows:

For a successful TORQUE installation, four daemons should be running:

trqauthd
The daemon that clients connect to (using TORQUE commands like qsub, qdel, qstat, etc. It then authorizes connections to pbs_server. More information here. It opens a UNIX Domain Socket in /tmp/trqauthd-unix.
pbs_server
The daemon that gets in touch with pbs_mom (in nodes) to run new jobs. Default listen to port 15001.
pbs_mom
Responsible for running jobs in nodes. Communicates with pbs_server. Default listen to ports 15002 and 15003.
pbs_sched
Basic scheduler that is called by pbs_server in a period of time see scheduler_iteration setting in the pbs_server_attributes(7) man page. pbs_sched is a very simple scheduler and is usually replaced by Maui. Schedulers default configuration listen to port 15004.

The following diagram is a summary of the communication between daemons and user programs:

master node
........................................
:                                      :
:   +--------------------+             :
:   |    user commands   |             :
:   | (qsub, qdel, etc.) |             :
:   +--------------------+             :
:             ^                        :
:             |                        :
:             v                        :
:        +----------+                  :
:        | trqauthd |                  :
:        +----------+                  :
:             ^                        :
:             |                        :
:             v                        :
:       +------------+   +-----------+ :
:       | pbs_server |<->| pbs_sched | :
:       +------------+   +-----------+ :
:             ^                        :
:             |                        :
:......................................:
              |
              |
slave nodes   |
..............................
:             |              :
:             v              :
:       +------------+       :
:       |   pbs_mom  |       :
:       +------------+       :
:                            :
:............................:

Here, "master node" is the computer that has TORQUE administraton tools installed and that makes all the work of enqueing jobs, running the scheduler, allocation of resources, etc. Normally it is not used for processing jobs. "slave nodes" are the nodes used to run processing jobs. Each "slave node" runs a instance of pbs_mom. Both kind of nodes can be ran in one computer (so, therefore running all daemons in just one computer). See Installation in a supercomputer for details.

Installation in a supercomputer

At the time of this writing, TORQUE 4.2.6 was the newest version. So let's work with that. I'm working on installing it not on a traditional Beowulf Cluster, but in a supercomputer with 136 processors (we will use just 134) and more than 260 GB RAM.

After downloading and unpacking the tarball, let's configure the system. As the root user do:

# export TORQUE_HOME=/opt/torque-4.2.6
# ./configure --prefix=$TORQUE_HOME
# make
# make install

We should not forget init scripts. Since our system is SUSE Linux Enterprise Server the commands are:

# cp contrib/init.d/suse.trqauthd /etc/init.d/trqauthd
# chkconfig -add trqauthd
# service trqauthd start

I prefer not to install it among other things in /usr, so I used the --prefix thing above. We will call that directory with $TORQUE_HOME.

After that, it is important to tell our system where we just installed TORQUE, if it is not a standard location:

# export PATH=$PATH:$TORQUE_HOME/bin:$TORQUE_HOME/sbin

It is a good idea to add it to your rc scripts.

Than, let's create the server database with:

# pbs_server -t create

Note

In the Installing TORQUE Section of the TORQUE Administration Guide, it tell us to use the ./torque.setup root script, that already creates a basic setup for qmgr for us, but we got a problem (see Error qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host above).

After that, we should start pbs_server, and call the qmgr program to setup the queues:

# qterm
# pbs_server
# qmgr

In the qmgr console, let's configure a basic queue called batch with the following commands:

create queue batch

set queue batch queue_type = Execution
set queue batch resources_max.mem = 100gb
set queue batch resources_max.procs = 100
set queue batch resources_max.walltime = 01:00:00
set queue batch enabled = True
set queue batch started = True

set server scheduling = True
set server managers = root@hostname
set server default_queue = batch
set server log_events = 511
set server mail_from = root
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6

We already have a queue setup. Now we need to tell TORQUE to use the computer we are installed. Configuration and daemons for ordinary nodes are different for the master node but, since we are using a standalone supercomputer, we are running it both as the server and the client.

Configuration that should be in nodes are $TORQUE_HOME/mom_priv/config. It should just be:

$pbs_server hostname

Note

Avoid using localhost as the hostname. Use the output of the hostname command. I've had some problems with that (see the Troubleshooting section) which is probably misconfiguration I didn't realize.

And, in the server side, let's just specify the nodes in the $TORQUE_HOME/server_priv/nodes file. In our case:

hostname np=134

Where "hostname" is the output of the hostname command. See we specify the number of processors and can also specify other settings.

After that, let's restart the daemons and also start client-side daemons:

# pkill pbs_server
# qterm
# pbs_server
# pbs_mom

Let's see the output of pbsnodes:

# pbsnodes
hostname
     state = job-exclusive
     np = 134
     ntype = cluster
     jobs = 0/12.hostname.domain
     status = rectime=1390849225,varattr=,jobs=12.hostname.domain,state=free,netload=135590624,gres=pbs_server:= hostname,loadave=16.08,ncpus=134,physmem=269558192kb,availmem=273994304kb,totmem=280048592kb,idletime=91,nusers=1,nsessions=4,sessions=147029 163085 164261 173479,uname=Linux hostname 2.6.16.60-0.42.10-default #1 SMP Tue Apr 27 05:11:27 UTC 2010 ia64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

And let's see if we see our queues:

# qstat -q

server: hostname

Queue            Memory CPU Time Walltime Node  Run Que Lm  State
---------------- ------ -------- -------- ----  --- --- --  -----
batch             100gb    --       --      --    1   0 --   E R
                                               ----- -----
                                                   1     0

In this example we already submitted a job, which is running.

A complex queue setup

This basic installation works fine for one queue, but normally TORQUE users use it on a cluster with many nodes. The jobs must be scheduled so no user get's more priority than others. Another different and more complex setup, is the one that follows (the output of qmgr -c 'p s':

create queue routing
set queue routing queue_type = Route
set queue routing route_destinations = P16
set queue routing route_destinations += P8
set queue routing route_destinations += P4
set queue routing route_destinations += P1-2
set queue routing route_destinations += serial
set queue routing enabled = True
set queue routing started = True

create queue serial
set queue serial queue_type = Execution
set queue serial resources_max.mem = 20gb
set queue serial resources_max.ncpus = 1
set queue serial resources_max.nodes = 1
set queue serial resources_max.procs = 1
set queue serial resources_max.walltime = 720:00:00
set queue serial resources_min.procs = 1
set queue serial max_user_run = 4
set queue serial enabled = True
set queue serial started = True

create queue P1-2
set queue P1-2 queue_type = Execution
set queue P1-2 resources_max.mem = 20gb
set queue P1-2 resources_max.procs = 2
set queue P1-2 resources_max.walltime = 24:00:00
set queue P1-2 resources_min.procs = 1
set queue P1-2 enabled = True
set queue P1-2 started = True

create queue P4
set queue P4 queue_type = Execution
set queue P4 resources_max.mem = 12gb
set queue P4 resources_max.procs = 4
set queue P4 resources_max.walltime = 24:00:00
set queue P4 enabled = True
set queue P4 started = True

create queue P8
set queue P8 queue_type = Execution
set queue P8 resources_max.mem = 24gb
set queue P8 resources_max.procs = 8
set queue P8 resources_max.walltime = 24:00:00
set queue P8 max_user_run = 8
set queue P8 enabled = True
set queue P8 started = True

create queue P16
set queue P16 queue_type = Execution
set queue P16 resources_max.mem = 24gb
set queue P16 resources_max.procs = 16
set queue P16 resources_max.walltime = 24:00:00
set queue P16 max_user_run = 1
set queue P16 enabled = True
set queue P16 started = True

set server scheduling = True
(... rest of server configuration)

We now have a more complex setup, with different queues that have different attributes (number of processors, walltime, memory available etc.). More information about the attributes in the TORQUE Administration Guide. We also have a "routing" queue to route jobs to the right queues. TORQUE alone (with its very simple scheduler, pbs_sched cannot do that for us.

In that case, we need to use another scheduler. We are going to use Maui. Take a look at our page about Maui for maui installation and setup with TORQUE.

Troubleshooting

Error qmgr obj= svr=default: Bad ACL entry in host list MSG=First bad host

For some reason that I don't know why (Google didn't help) I got this error when running ./torque.setup root command as recommended by the TORQUE Administration Guide. So I ran pbs_server -t create and configured the queues manually.

pbsnodes showing down host

If the output of pbsnodes is:

# pbsnodes
localhost
     state = down
     np = 134
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003

Check the content of the $TORQUE_HOME/server_priv/nodes and $TORQUE_HOME/mom_priv/config files, as well the hostname of the host you are running TORQUE server and clients.

Only one job run per time, even if there is resources free

There can be different reasons for this problem, like a misconfigured scheduler or queue. In our case, we were trying to configure TORQUE in a supercomputer environment, with lots of CPUs and memory.

We know that in a cluster environment, pbs_server is executed in a "master node" and pbs_mom on the others. A supercomputer environment is a single computer that both runs pbs_server and pbs_mom. There is no problem for that. When we execute pbsnodes command we will see one single node with lots of processors:

# pbsnodes
localhost
     state = down
     np = 134
     ntype = cluster
     mom_service_port = 15002
     mom_manager_port = 15003

After running the job, it just put the whole computer in a job-exclusive state, preventing other jobs to run:

# pbsnodes
bachianas
     state = job-exclusive
     np = 134
     ntype = cluster
     jobs = 0/36.bachianas
     status = rectime=1391694740,varattr=,jobs=36.bachianas,state=free,netload=17322218165,gres=,loadave=6.02,ncpus=136,physmem=269558192kb,availmem=271858224kb,totmem=280048592kb,idletime=420,nusers=1,nsessions=4,sessions=50921 55356 55540 192338,uname=Linux bachianas 2.6.16.60-0.42.10-default #1 SMP Tue Apr 27 05:11:27 UTC 2010 ia64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

We should not treat this machine as a cluster. TORQUE sees the whole machine as a single node and locks it for one job, doesn't matter if the job uses 1 or 134 CPUs.

In this very specific case, the machine supported the NUMA architecture, so we can compile TORQUE with NUMA support to divide the CPUs in logical units. Check the TORQUE on NUMA systems section of the TORQUE Administration Guide for more information.

After configuring the

qrun: Unknown node-attribute

This error can have different causes. One is that your scheduler is probably not running or cannot communicate with pbs_server. Log of my maui setup showed something like:

02/14 12:00:13 MRMClusterQuery()
02/14 12:00:13 WARNING:  no resources detected
02/14 12:00:13 MRMWorkloadQuery()
02/14 12:00:13 WARNING:  no workload detected

In my case it was a problem with Maui. This was a problem with the RMCFG[HOSTNAME] setting. See our page about Maui for more details.

job failing into the wrong queue

Job failing in the wrong queue can have several reasons. The most common reason, IMO, is wrong queue configuration, not a problem with the job itself. But another common reason is wrong PBS directives. The following directive:

#PBS nodes=2:ppn=4

Is wrong. The right one would be:

#PBS -l nodes=2:ppn=4

See -l argument? That is the right way to use it.

job in 'R' state, but Time Use is always 00:00:00

After queueing a job, I executed qrun(8) on it, because Maui's scheduler was stopped, for testing purposes. After running qrun, it changed to the R state in qstat(1B), but the column Time Use didn't change. It was 00:00:00.

07/02/2014 15:53:53;0001;PBS_Server;Svr;PBS_Server;LOG_ERROR::is_request, bad attempt to connect from 200.XXX.XXX.XXX:1023 (address not trusted - check entry in server_priv/nodes)

Solution 1: In one case, the origin of this problem was that in $TORQUE_HOME/server_name it was correctly the hostname of the machine, but in /etc/hosts the hostname was associated with a external IP (200.XXX...) that nodes could not access. The solution was to change the IP entry in /etc/hosts to the internal IP address other nodes can access.

Solution 2: In another case, pbs_mom(8) was just having problems to execute in some nodes. Why? The log directory (usually /var) was full. I needed to delete some stuff.

Job being completed just after submit, with no further information

This error can be several reasons. It is better to check the logs in $TORQUE_HOME/server_logs.

There is an error like:

PBS_Server;Req;req_reject;Reject reply code=15001(Unknown Job Id), aux=0, ty pe=LocateJob, from user@node

where user is the user login name and node is the name of the node.

Server is OK, scheduler is OK too. It is probably some problem in the node. After logging in it, I realized that users had problems with the NFS partitions, since we made changes to the NFS server and the local firewall. umount and mount them again was not enough, so we had to reboot all nodes. So, it worked out.

Job never enters state R (Run)

This is a very common problem that can have many different reasons.

Reason 1: Filesystem that has logs is full

A problem I had was the following: there was some jobs running on the system, but newer jobs wasn't running. If we just check its status with Maui's command checkjob, it tell us:

# checkjob 3038

(...)

job is deferred.  Reason:  NoResources  (exceeds available partition procs)
Holds:    Batch  Defer  (hold reason:  NoResources)
PE:  16.00  StartPriority:  31
cannot select job 3038 for partition DEFAULT (job hold active)

So we have a Maui hold in it. If we just try to relase it with releasehold and wait the scheduler cycle, we see that it is put the hold again. Maui log doesn't help and just tell us that there aren't available resources, like checkjob just did.

pbsnodes tell us jobs are free. But, if you investigate with care, you will see a different message:

bachianas-1
     state = free
     np = 4
     ntype = cluster
     status = rectime=1413550150,varattr=,jobs=,state=free,netload=? 0,gres=,message=ERROR: torque spool filesystem full,loadave=0.00,ncpus=4,physmem=8077312kb,availmem=7872128kb,totmem=8077312kb,idletime=11,nusers=0,nsessions=0,uname=Linux bachianas 2.6.16.60-0.42.10-default #1 SMP Tue Apr 27 05:11:27
 UTC 2010 ia64,opsys=linux
     mom_service_port = 15002
     mom_manager_port = 15003

There is a message field with the following content: ERROR: torque spool filesystem full. Some time ago the filesystem was really full and we had to delete some files but we didn't restart pbs_mom.

So, in this case, a quick restart of pbs_mom daemon solved the problem.

Reason 2: DNS problems

If we use command checkjob to investigate, we see:

job is deferred.  Reason:  RMFailure  (cannot start job - RM failure, rc: 15085, msg: 'Time out MSG=connection to mom timed out')

This means that the server cannot connect to the mom daemon and vice-versa.

In mom_logs we find the line:

11/05/2014 18:30:47;0008;PBS_Server.23876;Job;3080.bachianas.ufabc.edu.br;unable to run job, send to MOM '3364214663' failed

And a call to qrun command to force the execution of the job returns:

qrun: Time out MSG=connection to mom timed out 3083.bachianas.ufabc.edu.br

There is also various messages in server_logs saying it is not possible to communicate to mom.

First, check if both daemons pbs_server and pbs_mom are running. If so, there is likely a problem with DNS. In my case, there was an entry for an invalid DNS server in /etc/resolv.conf and it was necessary to remove it. After that, I had to free jobs from hold with the releasehold command.