Created at:
Modified at:
Maui notes
(2014-01-29)
As said in the "TORQUE notes" page, a scheduler is needed to if you want more than a batch queue to work in your HPC environment. That is why we use Maui, which is an opensource scheduler.
Installation
After downloding Maui from unpack it::
# tar zxf maui-3.2.6p17.tar.gz
# cd maui-3.2.6p17
# ./configure --with-pbs=$TORQUE_HOME
Where $TORQUE_HOME
is the path of the TORQUE installation if not in
/usr/local
.
*Important*: For some reason, the --prefix
parameter of the configure script is not
respected. When using PREFIX other than /usr/local/maui
, it installs
something in the user defined directory, but still insists in installing
something in /usr/local/maui
(developers use it hardcoded). So,
better not to use --prefix
.
After that, just compile and install it::
# make
# make install
In the configuration file, /usr/local/maui/maui.cfg
, I needed to make a
small change (don't know exactly why). I changed the following line::
RMCFG[HOSTNAME] TYPE=PBS@RMNMHOST@
to::
RMCFG[HOSTNAME] TYPE=PBS
After that, let's make sure TORQUE is not using its native and simple scheduler. (You also might want to delete it from the init scripts if they are there)::
# pkill pbs_sched
And start the maui
daemon (You also might want to add it to the init
scripts of your system)::
# /usr/local/maui/sbin/maui
Logging
To make sure maui is communicating with TORQUE, I prefered to decrease the
scheduler_iteration
parameter of TORQUE::
# set server scheduler_iteration = 30
And watched the logs in /usr/local/maui/log/maui.log
. TORQUE jobs
submitted from qsub
should appear on the logs for each iteraction.
Operation
(2014-08-14)
Maui holds
TODO: when Maui holds happen? Link to the maui manual.
Besides TORQUE holds, there are also Maui holds.
You can check the existance of a hold for any job with the checkjob
command. Example::
# checkjob 27332
(...)
Holds: Defer
Messages: exceeds available partition procs
PE: 8.00 StartPriority: 4000
cannot select job 27332 for partition DEFAULT (job hold active)
See my TORQUE notes, for more information
In this case we see it is deferred because I had to turn off nodes and didn't
stop the scheduler (that can be done with schedctl -s
). So, TORQUE tried
to run than with Maui, and Maui rejected, putting a hold on it.
After turning on nodes, it was necessary to release the jobs with the
releasehold
command. Do not confuse with TORQUE's command qrls
.
Reservations
(2014-10-15)
Every submition of a job makes reservations on nodes and processors mapped by Maui (told by TORQUE) so different jobs don't use the same processors. All reservations can be checked with the showres command::
# showres
Reservations
ReservationID Type S Start End Duration N/P StartTime
29763 Job R -1:03:12:05 1:20:47:55 3:00:00:00 8/64 Tue Oct 14 10:24:29
29764 Job R -1:03:11:12 1:20:48:48 3:00:00:00 8/64 Tue Oct 14 10:25:22
29780 Job R -9:27:48 1:02:32:12 1:12:00:00 1/1 Wed Oct 15 04:08:46
29781 Job R -7:12:18 1:04:47:42 1:12:00:00 1/1 Wed Oct 15 06:24:16
29782 Job R -5:22:34 1:06:37:26 1:12:00:00 1/1 Wed Oct 15 08:14:00
29783 Job R -3:17:20 1:08:42:40 1:12:00:00 1/1 Wed Oct 15 10:19:14
29784 Job R -1:45:40 1:10:14:20 1:12:00:00 1/1 Wed Oct 15 11:50:54
29785 Job R -1:31:33 1:10:28:27 1:12:00:00 1/1 Wed Oct 15 12:05:01
29786 Job R -1:31:33 1:10:28:27 1:12:00:00 1/1 Wed Oct 15 12:05:01
29807 Job R -8:46:05 2:15:13:55 3:00:00:00 2/16 Wed Oct 15 04:50:29
29808 Job I 4:21:23:26 7:21:23:26 3:00:00:00 8/64 Mon Oct 20 12:00:00
29809 Job R -1:54:34 1:22:05:26 2:00:00:00 1/8 Wed Oct 15 11:42:00
SYSTEM.0 User - 2:22:23:26 4:21:23:26 1:23:00:00 22/176 Sat Oct 18 12:00:00
13 reservations located
Each reservation has its ID as the ID of the job. But see that we have a
reservation called SYSTEM.0
that reserves all nodes (22) and all
processors (176) of the current system for a given time interval. This was
necessary because at this interval the cluster will be turned off and we don't
want jobs to begin execution within this period.
So we created a manual reservation with the setres_ command::
setres -s 12:00:00_10/18 -e 12:00:00_10/20 ALL