Lychee: Difference between revisions

From MDWiki
Jump to navigationJump to search
No edit summary
 
(37 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Lychee =
= Lychee =
<code>lychee</code> is one of our servers. This document provides a full description of its services. If a service is moved to another server, please just move the corresponding description to its page.
<code>lychee</code> is one of our servers. This document provides a full description of its services. If a service is moved to another server, please just move the corresponding description to its page.
== Web Server ==
Lychee acts as a webserver to the cluster (httpd), serving the centos repository. This is configured using chkconfig.


== Software ==
== Software ==
Line 12: Line 15:


The <code>module</code> command is sourced in the global bash configuration file <code>/etc/bashrc</code> and in the global csh configuration file <code>/etc/csh.local</code>.
The <code>module</code> command is sourced in the global bash configuration file <code>/etc/bashrc</code> and in the global csh configuration file <code>/etc/csh.local</code>.
==== Specific developmental versions ====
The directories:
# /marksw/gromos.ethz.ch_repos_forcefields/
# /marksw/gromos.ethz.ch_repos_gromos++/
# /marksw/gromos.ethz.ch_repos_gromos96/
# /marksw/gromos.ethz.ch_repos_gromos_doc/
# /marksw/gromos.ethz.ch_repos_gromosXXc++/
# /marksw/gromos.ethz.ch_repos_gromosXXF77/
Contain dated subdirectories containing the various corresponding versions of source code, etc from the IGC Group at ETH. To grab the latest source code, copy one of the subdirectories and inside the copy just run <code>svn up</code>.
The corresponding binaries for the various source codes are in:
# /marksw/gromos++  (correponding to /marksw/gromos.ethz.ch_repos_gromos++/)
# /marksw/gromosXXc++  (corresponding to /marksw/gromos.ethz.ch_repos_gromosXXc++/)
==== Brisbane MD Group Topologies ====
Topologies created by the Brisbane MD Group (and not yet make public) are found in /marksw/BrisbaneMDGroupForceFields/.


=== Yum ===
=== Yum ===
Line 18: Line 43:


==== Repository Software Installed ====
==== Repository Software Installed ====
zsh yum-utils
zsh yum-utils OpenIPMI OpenIPMI-tools openbabel openbabel-devel
 
=== /opt ===
 
==== Rasmol ====
* Download source from [http://www.rasmol.org/software/RasMol_Latest.tar.gz]
* tar -xzvf RasMol_Latest.tar.gz
* cd Rasmol_2.7.4.2_23Mar08/src/
* Edit Imakefile:
* LDLIBS = -L/usr/lib64 '...'
* Make sure /usr/lib64/libforms.so exists (package xforms-devel doesn't actually create symbolic links correctly)
* ./build_all.sh
* Move rasmol_32BIT to /opt/bin/rasmol
 
==== Acrobat ====
* rpm -iv http://ardownload.adobe.com/pub/adobe/reader/unix/9.x/9.1/enu/AdbeRdr9.1.0-1_i486linux_enu.rpm
* ln -s /opt/Adobe/Reader9/bin/acroread /opt/bin/acroread
 
==== VMD ====
* Done by Mitch/AJ
 
==== Pymol ====
* Done by Mitch/AJ
 
==== AMD Core Math Library (ACML) ====
* Download from [http://developer.amd.com/Downloads/acml-4-2-0-gfortran-64bit.tgz here]
* Extract & run ./install-acml-4-2-0-gfortran-64bit.sh
** Licence? accept
** Location (/opt/acml4.2.0)? [Enter]
 
=== TwistedConch (Python SSH library) ===
* cd /tmp; wget http://tmrc.mit.edu/mirror/twisted/Twisted/8.2/Twisted-8.2.0.tar.bz2
* tar -xjvf Twisted-8.2.0.tar.bz2; cd Twisted-8.2.0/
* python setup.py install
 
=== OpenBabel Python Bindings ===
* Download http://transact.dl.sourceforge.net/sourceforge/openbabel/openbabel-2.1.1.tar.gz
* Extract & cd to openbabel-2.1.1/scripts/python
* Setup python interface
** python setup.py build
** python setup.py install
* To get pybel "working": edit /usr/lib64/python24/site-packages/pybel.py. The diff is below:
<pre>
47c47
< _descdict = _getplugins(ob.OBDescriptor.FindType, descs)
---
> _descdict = { 'LogP':ob.OBLogP, 'MR':ob.OBMR, 'TPSA':ob.OBPSA } #_getplugins(ob.OBDescriptor.FindType, descs)
53c53
< _forcefields = _getplugins(ob.OBForceField.FindType, forcefields)
---
> _forcefields = { 'mmff94':ob.OBForceField("mmff94") } # _getplugins(ob.OBForceField.FindType, forcefields)
56c56
< _operations = _getplugins(ob.OBOp.FindType, operations)
---
> #_operations = _getplugins(ob.OBOp.FindType, operations)
334,335c334,335
<        if self.dim != 3:
<            self.make3D(forcefield)
---
>        #if self.dim != 3:
>        #    self.make3D(forcefield)
367c367
<        _operations['Gen3D'].Do(self.OBMol)
---
>        print 'Pybel.py: 367: Gen3D not available!' #_operations['Gen3D'].Do(self.OBMol)
</pre>
 
== FedoraDS (LDAP) ==
 
=== Add new user ===
* Find UID that has not been used: /marksw/scripts/0.1/nextUID2.py lists the UIDs that have been used
* Find GID that will be associated with new user: MDGroup = 500, BMMG = 540, SBeatson = 600, KobeLab = 520
* ssh lychee -X
* cd /opt/fedora-ds/
* ./startconsole -u admin -a http://lychee.md.smms.uq.edu.au:60000/
** lychee.md.smms.uq.edu.au -> Directory
*** Directory
**** md -> People -> Add User
***** Fill in info (including UID and GID in the POSIX tab)
****** Note: User names should be all lower case consisting of first name concatenated with the first letter of the surname
* sudo rsync -a /etc/skel/ /home/NEWUSERNAME/
* sudo chown -R NEWUSERNAME:NEWUSERSGID /home/NEWUSERNAME/
 
== FlexLM Licence Server ==
Lychee runs the license server for the PGI compilers - see /etc/init.d/lmgrd and /opt/pgi/
 
== Maui Resource Manager ==
 
=== Fairshare ===
 
To enable Fairshare, FSPOLICY must be set
 
    FSPOLICY              DEDICATEDPES
    FSDEPTH              10
    FSINTERVAL            24:00:00
    FSDECAY              0.80
 
Above setting means use each fairshare interval is 24hrs, total number of 10 intervals are taken into statistics, at a decay rate of 0.8.
 
    FSWEIGHT                1
    FSGROUPWEIGHT          50
    FSUSERWEIGHT            40
    FSQOSWEIGHT            10
 
also the fairshare of GROUP, USER, QOS must be set.
 
To check current FS of the jobs queueing, use
    sudo diagnose -p
 
=== Job Preemption ===
 
Preemption is the name given by maui to the ability to suspend running jobs to allow higher priority jobs to run, once the job has completed the job that has been suspended is restarted.
 
Preemption requires several different parameters to be set, if one is missed out preemption will not work. These are:
 
'''PREEMPTPOLICY'''
 
The PREEMPTPOLICY policy has two possible options, namely -
 
  REQUEUE - This terminates the job and requeues the job
  SUSPEND - This suspends the job and restarts it once the job that causes the
            preemption completes
 
'''BACKFILLPOLICY'''
 
The BACKFILLPOLICY has four possible options namely -
 
FIRSTFIT - This causes jobs to be scheduled based upon the priorities set on the queues, this is the default
BESTFIT - This will cause the priority of jobs to be changed to allow for best use of the batch system.
NONE - Disable BACKFILL
GREEDY -
'''RESERVATIONPOLICY'''
 
The RESERVATIONPOLICY parameter has three possible options namely -
 
  CURRENTHIGHEST
  HIGHEST
  NONE
 
N.B. In order for preemption to take place maui first sends the suspend signal to the job that has to suspend, on the next iteration of the schedular will start the preemptor job. Do not be surprised to see jobs suspended in the queue yet have the preemptor jobs still waiting to run.
 
'''PREEMPTOR's and PREEMPTEE's'''
 
The preemption facilities within Maui are not as flexible as those in other batching systems such as Sun Grid Engine. Queues can be either a preemptor (i.e., a queue whose jobs can preempt others) or a preemptee (i.e., a queue whose jobs can be preempted) but not both. You can define one queue as a preemptor and several as preemptee's or several queues as preemptor's and one as a preemptee.
 
'''QOSCFG'''
 
In order to configure premption activate the QOSWEIGHT, to do this set the variable QOSWEIGHT to a number greater than 1.
 
QOSWEIGHT 1
 
You can then use the QOSCFG parameters to define which queues are preemptors and which are preempteed
 
QOSCFG[short]  QFLAGS=PREEMPTOR
QOSCFG[long]  QFLAGS=PREEMPTEE
 
'''CLASSCFG'''
 
Once you have defined which queues are preemptors and which are preemptees you the use the CLASSCFG to define priorities. As with the QOSCFG options you need to turn on the CLASSCFG options by setting the CLASSWEIGHT
 
  CLASSWEIGHT 1
 
You then define CLASSCFG options for each QUEUE and define a priority and associate a QOSCFG with each CLASSCFG using the QDEF parameter
 
  CLASSCFG[short] QDEF=short PRIORITY=11000
  CLASSCFG[long] QDEF=long PRIORITY=10000
 
N.B. With priorities the higher the number, the higher the priority
 
You can use this command to check if a job has correct PREEMPTOR/PREEMPTEE flags
 
    sudo checkjob <id> |grep Flag


== Install SystemImager Image Server (for Imaging Workstations) ==
== Install SystemImager Image Server (for Imaging Workstations) ==
Line 31: Line 228:
** AUTOINSTALL_TARBALL_DIR = /systemimager/tarballs
** AUTOINSTALL_TARBALL_DIR = /systemimager/tarballs
** AUTOINSTALL_TORRENT_DIR = /systemimager/torrents
** AUTOINSTALL_TORRENT_DIR = /systemimager/torrents
* Edit /etc/systemimager/pxelinux.cfg/syslinux.cfg to ensure boot CD has enough memory (otherwise we get <font color="red"><code>swapper: page allocation failure. order: 0, model: 0x20</code></font> errors):
** Change ramdisk_size=120000
* Make sure firewall allows incoming connections to port 873/TCP (for serving images over rsync)
* Make sure firewall allows incoming connections to port 873/TCP (for serving images over rsync)
* We don't use the tftp server for network booting the workstations as we don't have access to the DHCP server for the MD VLAN, rather a boot CD is created (or Grub is used). See [[CentosWorkstation]] for details.
* We don't use the tftp server for network booting the workstations as we don't have access to the DHCP server for the MD VLAN, rather a boot CD is created (or Grub is used). See [[CentosWorkstation]] for details.
Line 44: Line 243:
** This runs a very basic rsync copy of /home/ to cirrus.hpcu.uq.edu.au
** This runs a very basic rsync copy of /home/ to cirrus.hpcu.uq.edu.au
** updatedb updates the file database used by the <code>locate</code> command
** updatedb updates the file database used by the <code>locate</code> command
== Known Issues/Solving Problems ==
=== Gromacs Excessive Log Files ===
A cron job is run every 20 minutes as root to kill jobs that crash creating excessive log files:
<code><pre>
0,20,40 * * * * /opt/c3-4/cexec '/usr/sbin/lsof | /bin/grep --line-buffered "md[0-9]\+.log" | /bin/awk "{if (\$7>21474836480) print \$2;}" | /usr/bin/xargs --verbose -r /bin/kill -9' >& /var/log/GromacsBigLogfileKill.log
</pre></code>
=== High Load/Monitoring Network Traffic ===
The command /usr/sbin/iftop may be run by anyone in groups BMMG, MDGroup, SBeatson, KobeLab. This is to determine what the network traffic is doing (nfs activity can cause a high load). This was enabled by adding the following lines to /etc/sudoers:
<code><pre>
%BMMG    ALL=/usr/sbin/iftop
%MDGroup  ALL=/usr/sbin/iftop
%SBeatson ALL=/usr/sbin/iftop
%KobeLab  ALL=/usr/sbin/iftop
</pre></code>
To read the network usage on the MD VLAN interface on lychee run:
<code><pre>
sudo /usr/sbin/iftop -i eth0
</pre></code>
To read the network usage on the cluster interface on lychee run:
<code><pre>
sudo /usr/sbin/iftop -i eth2
</pre></code>
=== Fedora-DS ===
Fedora-ds runs out of open file descriptors (default 1024) very quickly. The default value has been changed to 8192, allowing the system to survive longer but it still runs out after months. When this problem happens, all workstations freeze. Restarting fedora-ds will temporally fix the problem (<code>/etc/init.d/fedora-ds restart</code> and wait a little while).
See [[Lychee_sytem#LDAP_server_open_file_descriptor_problem|here]] for reference.
=== NFS ===
If lychee is rebooted, NFS mounts from clients to lychee's NFS exports get stale. To fix this, you can try on the client:
<code><pre>mount -o remount -a</pre></code>
If this doesn't work the client has to be rebooted. Lychee's NFS can be restarted by <code>/etc/init.d/nfs restart</code>.

Latest revision as of 05:19, 9 October 2009

Lychee

lychee is one of our servers. This document provides a full description of its services. If a service is moved to another server, please just move the corresponding description to its page.

Web Server

Lychee acts as a webserver to the cluster (httpd), serving the centos repository. This is configured using chkconfig.

Software

/marksw

/marksw/ contains a large selection of pre-compiled software for use by the group. Each package is found in /marksw/$NAME/$VERSION/. The source code and details of how the software was built is in /marksw/$NAME/$VERSION/SOURCE. In the bash (csh) configuration file /marksw/BASHRC (/marksw/CSHRC), is a simple shell function, module, which can be used in the following ways:

  1. module avail
  2. module $NAME/$VERSION ...

The first use lists the possible $NAME/$VERSION options for the second use. The second use simply adds for each $NAME/$VERSION argument, /marksw/$NAME/$VERSION/bin and /marksw/$NAME/$VERSION/sbin to the beginning of the contents of the PATH environment variable; /marksw/$NAME/$VERSION/lib64 and /marksw/$NAME/$VERSION/lib to the beginning of the contents of the LD_LIBRARY_PATH environment variable; and /marksw/$NAME/$VERSION/man to the beginning of the contents of the MANPATH environment variable.

The module command is sourced in the global bash configuration file /etc/bashrc and in the global csh configuration file /etc/csh.local.

Specific developmental versions

The directories:

  1. /marksw/gromos.ethz.ch_repos_forcefields/
  2. /marksw/gromos.ethz.ch_repos_gromos++/
  3. /marksw/gromos.ethz.ch_repos_gromos96/
  4. /marksw/gromos.ethz.ch_repos_gromos_doc/
  5. /marksw/gromos.ethz.ch_repos_gromosXXc++/
  6. /marksw/gromos.ethz.ch_repos_gromosXXF77/

Contain dated subdirectories containing the various corresponding versions of source code, etc from the IGC Group at ETH. To grab the latest source code, copy one of the subdirectories and inside the copy just run svn up.

The corresponding binaries for the various source codes are in:

  1. /marksw/gromos++ (correponding to /marksw/gromos.ethz.ch_repos_gromos++/)
  2. /marksw/gromosXXc++ (corresponding to /marksw/gromos.ethz.ch_repos_gromosXXc++/)

Brisbane MD Group Topologies

Topologies created by the Brisbane MD Group (and not yet make public) are found in /marksw/BrisbaneMDGroupForceFields/.


Yum

Yum Repositories

Repository Software Installed

zsh yum-utils OpenIPMI OpenIPMI-tools openbabel openbabel-devel

/opt

Rasmol

  • Download source from [1]
  • tar -xzvf RasMol_Latest.tar.gz
  • cd Rasmol_2.7.4.2_23Mar08/src/
  • Edit Imakefile:
  • LDLIBS = -L/usr/lib64 '...'
  • Make sure /usr/lib64/libforms.so exists (package xforms-devel doesn't actually create symbolic links correctly)
  • ./build_all.sh
  • Move rasmol_32BIT to /opt/bin/rasmol

Acrobat

VMD

  • Done by Mitch/AJ

Pymol

  • Done by Mitch/AJ

AMD Core Math Library (ACML)

  • Download from here
  • Extract & run ./install-acml-4-2-0-gfortran-64bit.sh
    • Licence? accept
    • Location (/opt/acml4.2.0)? [Enter]

TwistedConch (Python SSH library)

OpenBabel Python Bindings

47c47
< _descdict = _getplugins(ob.OBDescriptor.FindType, descs)
---
> _descdict = { 'LogP':ob.OBLogP, 'MR':ob.OBMR, 'TPSA':ob.OBPSA } #_getplugins(ob.OBDescriptor.FindType, descs)
53c53
< _forcefields = _getplugins(ob.OBForceField.FindType, forcefields)
---
> _forcefields = { 'mmff94':ob.OBForceField("mmff94") } # _getplugins(ob.OBForceField.FindType, forcefields)
56c56
< _operations = _getplugins(ob.OBOp.FindType, operations)
---
> #_operations = _getplugins(ob.OBOp.FindType, operations)
334,335c334,335
<         if self.dim != 3:
<             self.make3D(forcefield)
---
>         #if self.dim != 3:
>         #    self.make3D(forcefield)
367c367
<         _operations['Gen3D'].Do(self.OBMol)
---
>         print 'Pybel.py: 367: Gen3D not available!' #_operations['Gen3D'].Do(self.OBMol)

FedoraDS (LDAP)

Add new user

  • Find UID that has not been used: /marksw/scripts/0.1/nextUID2.py lists the UIDs that have been used
  • Find GID that will be associated with new user: MDGroup = 500, BMMG = 540, SBeatson = 600, KobeLab = 520
  • ssh lychee -X
  • cd /opt/fedora-ds/
  • ./startconsole -u admin -a http://lychee.md.smms.uq.edu.au:60000/
    • lychee.md.smms.uq.edu.au -> Directory
      • Directory
        • md -> People -> Add User
          • Fill in info (including UID and GID in the POSIX tab)
            • Note: User names should be all lower case consisting of first name concatenated with the first letter of the surname
  • sudo rsync -a /etc/skel/ /home/NEWUSERNAME/
  • sudo chown -R NEWUSERNAME:NEWUSERSGID /home/NEWUSERNAME/

FlexLM Licence Server

Lychee runs the license server for the PGI compilers - see /etc/init.d/lmgrd and /opt/pgi/

Maui Resource Manager

Fairshare

To enable Fairshare, FSPOLICY must be set

   FSPOLICY              DEDICATEDPES
   FSDEPTH               10
   FSINTERVAL            24:00:00
   FSDECAY               0.80

Above setting means use each fairshare interval is 24hrs, total number of 10 intervals are taken into statistics, at a decay rate of 0.8.

   FSWEIGHT                1
   FSGROUPWEIGHT           50
   FSUSERWEIGHT            40
   FSQOSWEIGHT             10

also the fairshare of GROUP, USER, QOS must be set.

To check current FS of the jobs queueing, use

   sudo diagnose -p

Job Preemption

Preemption is the name given by maui to the ability to suspend running jobs to allow higher priority jobs to run, once the job has completed the job that has been suspended is restarted.

Preemption requires several different parameters to be set, if one is missed out preemption will not work. These are:

PREEMPTPOLICY

The PREEMPTPOLICY policy has two possible options, namely -

 REQUEUE - This terminates the job and requeues the job
 SUSPEND - This suspends the job and restarts it once the job that causes the
           preemption completes

BACKFILLPOLICY

The BACKFILLPOLICY has four possible options namely -

FIRSTFIT - This causes jobs to be scheduled based upon the priorities set on the queues, this is the default
BESTFIT - This will cause the priority of jobs to be changed to allow for best use of the batch system.
NONE - Disable BACKFILL 
GREEDY - 

RESERVATIONPOLICY

The RESERVATIONPOLICY parameter has three possible options namely -

 CURRENTHIGHEST
 HIGHEST
 NONE

N.B. In order for preemption to take place maui first sends the suspend signal to the job that has to suspend, on the next iteration of the schedular will start the preemptor job. Do not be surprised to see jobs suspended in the queue yet have the preemptor jobs still waiting to run.

PREEMPTOR's and PREEMPTEE's

The preemption facilities within Maui are not as flexible as those in other batching systems such as Sun Grid Engine. Queues can be either a preemptor (i.e., a queue whose jobs can preempt others) or a preemptee (i.e., a queue whose jobs can be preempted) but not both. You can define one queue as a preemptor and several as preemptee's or several queues as preemptor's and one as a preemptee.

QOSCFG

In order to configure premption activate the QOSWEIGHT, to do this set the variable QOSWEIGHT to a number greater than 1.

QOSWEIGHT 1 

You can then use the QOSCFG parameters to define which queues are preemptors and which are preempteed

QOSCFG[short]  QFLAGS=PREEMPTOR
QOSCFG[long]   QFLAGS=PREEMPTEE

CLASSCFG

Once you have defined which queues are preemptors and which are preemptees you the use the CLASSCFG to define priorities. As with the QOSCFG options you need to turn on the CLASSCFG options by setting the CLASSWEIGHT

 CLASSWEIGHT 1 

You then define CLASSCFG options for each QUEUE and define a priority and associate a QOSCFG with each CLASSCFG using the QDEF parameter

 CLASSCFG[short] QDEF=short PRIORITY=11000 
 CLASSCFG[long]	QDEF=long PRIORITY=10000 

N.B. With priorities the higher the number, the higher the priority

You can use this command to check if a job has correct PREEMPTOR/PREEMPTEE flags

   sudo checkjob <id> |grep Flag

Install SystemImager Image Server (for Imaging Workstations)

The following relates to the once-off installation of systemimager for serving images. See CentosWorkstation for information as to the actual use of systemimager for getting the images onto the image server for serving.

Backups

  • Root's crontab on lychee states:
0   23  *   *   sun  /marksw/scripts/0.1/backup_dir /home/
0   22  *   *   sat  updatedb 
    • This runs a very basic rsync copy of /home/ to cirrus.hpcu.uq.edu.au
    • updatedb updates the file database used by the locate command

Known Issues/Solving Problems

Gromacs Excessive Log Files

A cron job is run every 20 minutes as root to kill jobs that crash creating excessive log files:

0,20,40 * * * * /opt/c3-4/cexec '/usr/sbin/lsof | /bin/grep --line-buffered "md[0-9]\+.log" | /bin/awk "{if (\$7>21474836480) print \$2;}" | /usr/bin/xargs --verbose -r /bin/kill -9' >& /var/log/GromacsBigLogfileKill.log

High Load/Monitoring Network Traffic

The command /usr/sbin/iftop may be run by anyone in groups BMMG, MDGroup, SBeatson, KobeLab. This is to determine what the network traffic is doing (nfs activity can cause a high load). This was enabled by adding the following lines to /etc/sudoers:

%BMMG     ALL=/usr/sbin/iftop
%MDGroup  ALL=/usr/sbin/iftop
%SBeatson ALL=/usr/sbin/iftop
%KobeLab  ALL=/usr/sbin/iftop

To read the network usage on the MD VLAN interface on lychee run:

sudo /usr/sbin/iftop -i eth0

To read the network usage on the cluster interface on lychee run:

sudo /usr/sbin/iftop -i eth2

Fedora-DS

Fedora-ds runs out of open file descriptors (default 1024) very quickly. The default value has been changed to 8192, allowing the system to survive longer but it still runs out after months. When this problem happens, all workstations freeze. Restarting fedora-ds will temporally fix the problem (/etc/init.d/fedora-ds restart and wait a little while).

See here for reference.

NFS

If lychee is rebooted, NFS mounts from clients to lychee's NFS exports get stale. To fix this, you can try on the client:

mount -o remount -a

If this doesn't work the client has to be rebooted. Lychee's NFS can be restarted by /etc/init.d/nfs restart.