Alarm Handling: Difference between revisions

From MDWiki
Jump to navigationJump to search
No edit summary
Line 1: Line 1:
== What to do if an alarm in 822 (HPC room) goes off ==
====== Cluster Room Alarm Handling ======
The cluster room (822) gets hot when the air-conditioning stops working. This normally happens because the chilled water supply gets cut off.
There are two high temperature alarms in the room:


=== Users: ===
  - The alarm on the air-conditioning unit (long continuous whine)
# Mute the alarm
  - The alarm for the room (red flashing light + sound)
# Ring Derek ext. 53826 or mobile (on 822 door) (if he is not there leave a message)
# Ring Admins David, Itamar, AJ, Matt, Mitch, or Marlies ext: 63996, leave a message there too if you cannot reach us. Mobile numbers on 822 door.
# If you cannot reach anyone of us ring security ext. 51234. They can let you into the room(phone in computer room can ring mobiles). Security might try and ring Derek or one of us again. If you still cannot get through switch the rack off (black and white switches on right hand side). Then write an e-mail informing at least me and Derek of what has happened and what action you have taken. Admins will then sort things out when we are back in.


If there is an alarm coming from the air con too press the Escape button once to mute the alarm. Note down the time it has gone off, the text on the air con window should tell you that. Then press the Escape button again and note down the temperature it shows. 822 usually ~18 deg C. Rack power will automatically be cut off if it gets too hot.
If one of the alarms sounds please do the following:


=== Admins with remote access: ===
  - Mute the alarm
# Login to grape
  - Ring someone who is able to shut down the servers: Matt (0424 037 005), Roy (0432 375 635). If they cannot be contacted or if they are unable to shut down the servers remotely, then enter the room and press for 1 sec the power buttons {{icon-power-button.gif}} on each of the computers in the racks. This should trigger an automatic shutdown within. If some machines are still powered on after 5 mins, then hold down the power buttons on those machines until they turn off (usually about 5-10 sec).
# Run `/usr/local/bin/node-temperatures.sh' to query the service processes for some temperature stats, if the CPUs are above ~55º or if the hard disks are above ~35º, then
  - If you cannot get access to the room, call security on 51234.
# Run `sudo cexec uptime' to check which nodes are not being used (alternatively find the subset of merlot04-30 that aren't printed in: `qstat -f | grep exec_host') and shut them down `sudo ssh NODENAME shutdown -h now'
# If the CPU or hard disk temperature is still high after a few minutes, run `sudo cexec shutdown -h now' to shutdown all the nodes. For courtesy save the outputs of `showq' and `qstat' for the convenience of people whose jobs will be killed.


=== Admins with local access: ===
In the event that there is no-one to turn off the machines, the power to the room will automatically be cut off if it gets too hot (>35 ºC)
# Check air conditioning temperature graph -- it will give a clue as to any problems with chilled water.
# Follow remote access procedures above
# You should check manually to see if the nodes have been completely shutdown. Power off the rack manually using the black and white switches on the right hand side.
 
== What to do if power/chilled water will be turned off ==
# Contact Admins to turn off the rack. If they are unreachable, then
# Enter HPC room (call security on 51234 if you need access). Power off the rack manually using the black and white switches on the right hand side.

Revision as of 02:14, 2 April 2009

Cluster Room Alarm Handling

The cluster room (822) gets hot when the air-conditioning stops working. This normally happens because the chilled water supply gets cut off. There are two high temperature alarms in the room:

 - The alarm on the air-conditioning unit (long continuous whine)
 - The alarm for the room (red flashing light + sound)

If one of the alarms sounds please do the following:

 - Mute the alarm
 - Ring someone who is able to shut down the servers: Matt (0424 037 005), Roy (0432 375 635). If they cannot be contacted or if they are unable to shut down the servers remotely, then enter the room and press for 1 sec the power buttons Template:Icon-power-button.gif on each of the computers in the racks. This should trigger an automatic shutdown within. If some machines are still powered on after 5 mins, then hold down the power buttons on those machines until they turn off (usually about 5-10 sec).
 - If you cannot get access to the room, call security on 51234.

In the event that there is no-one to turn off the machines, the power to the room will automatically be cut off if it gets too hot (>35 ºC)