As said in the "TORQUE notes" page, a scheduler is needed to if you want more than a batch queue to work in your HPC environment. That is why we use Maui, which is an opensource scheduler.
After downloding Maui from unpack it::
# tar zxf maui-3.2.6p17.tar.gz # cd maui-3.2.6p17 # ./configure --with-pbs=$TORQUE_HOME
$TORQUE_HOME is the path of the TORQUE installation if not in
*Important*: For some reason, the
--prefix parameter of the configure script is not
respected. When using PREFIX other than
/usr/local/maui, it installs
something in the user defined directory, but still insists in installing
/usr/local/maui (developers use it hardcoded). So,
better not to use
After that, just compile and install it::
# make # make install
In the configuration file,
/usr/local/maui/maui.cfg, I needed to make a
small change (don't know exactly why). I changed the following line::
After that, let's make sure TORQUE is not using its native and simple scheduler. (You also might want to delete it from the init scripts if they are there)::
# pkill pbs_sched
And start the
maui daemon (You also might want to add it to the init
scripts of your system)::
To make sure maui is communicating with TORQUE, I prefered to decrease the
scheduler_iteration parameter of TORQUE::
# set server scheduler_iteration = 30
And watched the logs in
/usr/local/maui/log/maui.log. TORQUE jobs
qsub should appear on the logs for each iteraction.
TODO: when Maui holds happen? Link to the maui manual.
Besides TORQUE holds, there are also Maui holds.
You can check the existance of a hold for any job with the
# checkjob 27332 (...) Holds: Defer Messages: exceeds available partition procs PE: 8.00 StartPriority: 4000 cannot select job 27332 for partition DEFAULT (job hold active)
See my TORQUE notes, for more information
In this case we see it is deferred because I had to turn off nodes and didn't
stop the scheduler (that can be done with
schedctl -s). So, TORQUE tried
to run than with Maui, and Maui rejected, putting a hold on it.
After turning on nodes, it was necessary to release the jobs with the
releasehold command. Do not confuse with TORQUE's command
Every submition of a job makes reservations on nodes and processors mapped by Maui (told by TORQUE) so different jobs don't use the same processors. All reservations can be checked with the showres command::
# showres Reservations ReservationID Type S Start End Duration N/P StartTime 29763 Job R -1:03:12:05 1:20:47:55 3:00:00:00 8/64 Tue Oct 14 10:24:29 29764 Job R -1:03:11:12 1:20:48:48 3:00:00:00 8/64 Tue Oct 14 10:25:22 29780 Job R -9:27:48 1:02:32:12 1:12:00:00 1/1 Wed Oct 15 04:08:46 29781 Job R -7:12:18 1:04:47:42 1:12:00:00 1/1 Wed Oct 15 06:24:16 29782 Job R -5:22:34 1:06:37:26 1:12:00:00 1/1 Wed Oct 15 08:14:00 29783 Job R -3:17:20 1:08:42:40 1:12:00:00 1/1 Wed Oct 15 10:19:14 29784 Job R -1:45:40 1:10:14:20 1:12:00:00 1/1 Wed Oct 15 11:50:54 29785 Job R -1:31:33 1:10:28:27 1:12:00:00 1/1 Wed Oct 15 12:05:01 29786 Job R -1:31:33 1:10:28:27 1:12:00:00 1/1 Wed Oct 15 12:05:01 29807 Job R -8:46:05 2:15:13:55 3:00:00:00 2/16 Wed Oct 15 04:50:29 29808 Job I 4:21:23:26 7:21:23:26 3:00:00:00 8/64 Mon Oct 20 12:00:00 29809 Job R -1:54:34 1:22:05:26 2:00:00:00 1/8 Wed Oct 15 11:42:00 SYSTEM.0 User - 2:22:23:26 4:21:23:26 1:23:00:00 22/176 Sat Oct 18 12:00:00 13 reservations located
Each reservation has its ID as the ID of the job. But see that we have a
SYSTEM.0 that reserves all nodes (22) and all
processors (176) of the current system for a given time interval. This was
necessary because at this interval the cluster will be turned off and we don't
want jobs to begin execution within this period.
So we created a manual reservation with the setres_ command::
setres -s 12:00:00_10/18 -e 12:00:00_10/20 ALL