Alarm Handling

From MDWiki
Revision as of 01:04, 16 July 2007 by Matt (talk | contribs) (→‎Users:)
Jump to navigationJump to search

What to do if an alarm in 822 (HPC room) goes off

Users:

  1. Mute the alarm
  2. Ring Derek ext. 53826 or mobile (on 822 door) (if he is not there leave a message)
  3. Ring Admins David, Itamar, AJ, Matt, Mitch, or Marlies ext: 63996, leave a message there too if you cannot reach us. Mobile numbers on 822 door.
  4. If you cannot reach anyone of us ring security ext. 51234. They can let you into the room. (phone in computer room can ring mobiles). Security might try and ring Derek or one of us again. If you still cannot get through switch the rack off (black and white switches on right hand side). Then write an e-mail informing at least me and Derek of what has happened and what action you have taken. Admins will then sort things out when we are back in.

If there is an alarm coming from the air con too press the Escape button once to mute the alarm. Note down the time it has gone off, the text on the air con window should tell you that. Then press the Escape button again and note down the temperature it shows. 822 usually ~18 deg C. Rack power will automatically be cut off if it gets too hot.

Admins with remote access:

  1. Login to grape
  2. Run `/usr/local/bin/node-temperatures.sh' to query the service processes for some temperature stats, if the CPUs are above ~55º or if the hard disks are above ~35º, then
  3. Run `sudo cexec uptime' to check which nodes are not being used (alternatively find the subset of merlot04-30 that aren't printed in: `qstat -f | grep exec_host') and shut them down `sudo ssh NODENAME shutdown -h now'
  4. If the CPU or hard disk temperature is still high after a few minutes, run `sudo cexec shutdown -h now' to shutdown all the nodes. For courtesy save the outputs of `showq' and `qstat' for the convenience of people whose jobs will be killed.

Admins with local access:

  1. Check air conditioning temperature graph -- it will give a clue as to any problems with chilled water.
  2. Follow remote access procedures above
  3. You should check manually to see if the nodes have been completely shutdown. Power off the rack manually using the black and white switches on the right hand side.

What to do if power/chilled water will be turned off

  1. Contact Admins to turn off the rack. If they are unreachable, then
  2. Enter HPC room (call security on 51234 if you need access). Power off the rack manually using the black and white switches on the right hand side.