Created at:

Modified at:

Maui notes

(2014-01-29)

As said in the "TORQUE notes" page, a scheduler is needed to if you want more than a batch queue to work in your HPC environment. That is why we use Maui, which is an opensource scheduler.

TORQUE notes

Maui

Installation

After downloding Maui from unpack it::

    # tar zxf maui-3.2.6p17.tar.gz
    # cd maui-3.2.6p17
    # ./configure --with-pbs=$TORQUE_HOME

Where $TORQUE_HOME is the path of the TORQUE installation if not in /usr/local.

*Important*: For some reason, the --prefix parameter of the configure script is not respected. When using PREFIX other than /usr/local/maui, it installs something in the user defined directory, but still insists in installing something in /usr/local/maui (developers use it hardcoded). So, better not to use --prefix.

After that, just compile and install it::

    # make
    # make install

In the configuration file, /usr/local/maui/maui.cfg, I needed to make a small change (don't know exactly why). I changed the following line::

    RMCFG[HOSTNAME] TYPE=PBS@RMNMHOST@

to::

    RMCFG[HOSTNAME] TYPE=PBS

After that, let's make sure TORQUE is not using its native and simple scheduler. (You also might want to delete it from the init scripts if they are there)::

    # pkill pbs_sched

And start the maui daemon (You also might want to add it to the init scripts of your system)::

    # /usr/local/maui/sbin/maui

Logging

To make sure maui is communicating with TORQUE, I prefered to decrease the scheduler_iteration parameter of TORQUE::

    # set server scheduler_iteration = 30

And watched the logs in /usr/local/maui/log/maui.log. TORQUE jobs submitted from qsub should appear on the logs for each iteraction.

Operation

(2014-08-14)

Maui holds

TODO: when Maui holds happen? Link to the maui manual.

Besides TORQUE holds, there are also Maui holds. You can check the existance of a hold for any job with the checkjob command. Example::

    # checkjob 27332

    (...)

    Holds:    Defer  
    Messages:  exceeds available partition procs
    PE:  8.00  StartPriority:  4000
    cannot select job 27332 for partition DEFAULT (job hold active)

See my TORQUE notes, for more information

In this case we see it is deferred because I had to turn off nodes and didn't stop the scheduler (that can be done with schedctl -s). So, TORQUE tried to run than with Maui, and Maui rejected, putting a hold on it.

After turning on nodes, it was necessary to release the jobs with the releasehold command. Do not confuse with TORQUE's command qrls.

Reservations

(2014-10-15)

Every submition of a job makes reservations on nodes and processors mapped by Maui (told by TORQUE) so different jobs don't use the same processors. All reservations can be checked with the showres command::

    # showres
    Reservations

    ReservationID       Type S       Start         End    Duration    N/P    StartTime

    29763                Job R -1:03:12:05  1:20:47:55  3:00:00:00    8/64   Tue Oct 14 10:24:29
    29764                Job R -1:03:11:12  1:20:48:48  3:00:00:00    8/64   Tue Oct 14 10:25:22
    29780                Job R    -9:27:48  1:02:32:12  1:12:00:00    1/1    Wed Oct 15 04:08:46
    29781                Job R    -7:12:18  1:04:47:42  1:12:00:00    1/1    Wed Oct 15 06:24:16
    29782                Job R    -5:22:34  1:06:37:26  1:12:00:00    1/1    Wed Oct 15 08:14:00
    29783                Job R    -3:17:20  1:08:42:40  1:12:00:00    1/1    Wed Oct 15 10:19:14
    29784                Job R    -1:45:40  1:10:14:20  1:12:00:00    1/1    Wed Oct 15 11:50:54
    29785                Job R    -1:31:33  1:10:28:27  1:12:00:00    1/1    Wed Oct 15 12:05:01
    29786                Job R    -1:31:33  1:10:28:27  1:12:00:00    1/1    Wed Oct 15 12:05:01
    29807                Job R    -8:46:05  2:15:13:55  3:00:00:00    2/16   Wed Oct 15 04:50:29
    29808                Job I  4:21:23:26  7:21:23:26  3:00:00:00    8/64   Mon Oct 20 12:00:00
    29809                Job R    -1:54:34  1:22:05:26  2:00:00:00    1/8    Wed Oct 15 11:42:00
    SYSTEM.0            User -  2:22:23:26  4:21:23:26  1:23:00:00   22/176  Sat Oct 18 12:00:00

    13 reservations located

showres

Each reservation has its ID as the ID of the job. But see that we have a reservation called SYSTEM.0 that reserves all nodes (22) and all processors (176) of the current system for a given time interval. This was necessary because at this interval the cluster will be turned off and we don't want jobs to begin execution within this period.

So we created a manual reservation with the setres_ command::

    setres -s 12:00:00_10/18 -e 12:00:00_10/20 ALL

setres