Alarm Handling: Difference between revisions

From MDWiki
Jump to navigationJump to search
No edit summary
 
No edit summary
Line 1: Line 1:
== If an alarm in 822 (front room) goes off ==
== What to do if an alarm in 822 (HPC room) goes off ==


=== For Users: Local access ===
=== Users: ===
# Mute the alarm
# Mute the alarm
# Ring Derek ext. 53826 or mobile: 0411015593 (if he is not there leave a message)
# Ring Derek ext. 53826 or mobile: 0411015593 (if he is not there leave a message)
# Ring AJ (0417063485), Matt (0424037005) or Mitch (0414800280), or Marlies (0404262445) /ext: 63996, leave a message there too if you cannot reach us.
# Ring AJ (0417063485), Matt (0424037005), Mitch (0414800280), or Marlies (0404262445) /ext: 63996, leave a message there too if you cannot reach us.
# If you cannot reach anyone of us ring security ext. 51234. They can let you into the room.  (phones in both computer rooms can ring mobiles). There is also heaps of info on the door of 822 on what to do. Security might try and ring Derek or one of us again. If you still cannot get through switch the racks off (RACKS ONLY!). Then write an e-mail informing at least me and Derek of what has happened and what action you have taken. Admins will then sort things out when we are back in.
# If you cannot reach anyone of us ring security ext. 51234. They can let you into the room.  (phone in computer room can ring mobiles). There is also heaps of info on the door of 822 on what to do. Security might try and ring Derek or one of us again. If you still cannot get through switch the racks off (RACKS ONLY!). Then write an e-mail informing at least me and Derek of what has happened and what action you have taken. Admins will then sort things out when we are back in.


If there is an alarm coming from the air con too press the Escape button once to mute the alarm. Note down the time it has gone off, the text on the air con window should tell you that. Then press the Escape button again and note down the temperature it shows. 822 usually ~18 deg C.
If there is an alarm coming from the air con too press the Escape button once to mute the alarm. Note down the time it has gone off, the text on the air con window should tell you that. Then press the Escape button again and note down the temperature it shows. 822 usually ~18 deg C. Racks will automatically shutdown if it gets too hot.


=== For Admins: Remote access ===
=== Admins with remote access: ===
# Login to grape
# Login to grape
# Run `/usr/local/bin/node-temperatures.sh' to query the service processes for some temperature stats, if the CPUs are above ~55º or if the hard disks are above ~35º, then
# Run `/usr/local/bin/node-temperatures.sh' to query the service processes for some temperature stats, if the CPUs are above ~55º or if the hard disks are above ~35º, then
Line 15: Line 15:
# If the CPU or hard disk temperature is still high after a few minutes, run `sudo cexec shutdown -h now' to shutdown all the nodes. For courtesy save the outputs of `showq' and `qstat' for the convenience of people whose jobs will be killed.
# If the CPU or hard disk temperature is still high after a few minutes, run `sudo cexec shutdown -h now' to shutdown all the nodes. For courtesy save the outputs of `showq' and `qstat' for the convenience of people whose jobs will be killed.


=== For Admins: Local access ===
=== Admins with local access: ===
#
# Check air conditioning temperature graph -- it will give a clue as to any problems with chilled water.
# Follow Remote access procedures
# Follow remote access procedures above
# You should check manually to see if the nodes have been completely shutdown. Power off the rack manually using the black and white switches on the right hand side.
# You should check manually to see if the nodes have been completely shutdown. Power off the rack manually using the black and white switches on the right hand side.
== What to do if power/chilled water will be turned off ==
# Contact AJ, Matt or Mitch to turn off the rack. If they are unreachable, then
# Enter HPC room (call security on 51234 if you need access). Power off the rack manually using the black and white switches on the right hand side.

Revision as of 00:42, 16 July 2007

What to do if an alarm in 822 (HPC room) goes off

Users:

  1. Mute the alarm
  2. Ring Derek ext. 53826 or mobile: 0411015593 (if he is not there leave a message)
  3. Ring AJ (0417063485), Matt (0424037005), Mitch (0414800280), or Marlies (0404262445) /ext: 63996, leave a message there too if you cannot reach us.
  4. If you cannot reach anyone of us ring security ext. 51234. They can let you into the room. (phone in computer room can ring mobiles). There is also heaps of info on the door of 822 on what to do. Security might try and ring Derek or one of us again. If you still cannot get through switch the racks off (RACKS ONLY!). Then write an e-mail informing at least me and Derek of what has happened and what action you have taken. Admins will then sort things out when we are back in.

If there is an alarm coming from the air con too press the Escape button once to mute the alarm. Note down the time it has gone off, the text on the air con window should tell you that. Then press the Escape button again and note down the temperature it shows. 822 usually ~18 deg C. Racks will automatically shutdown if it gets too hot.

Admins with remote access:

  1. Login to grape
  2. Run `/usr/local/bin/node-temperatures.sh' to query the service processes for some temperature stats, if the CPUs are above ~55º or if the hard disks are above ~35º, then
  3. Run `sudo cexec uptime' to check which nodes are not being used (alternatively find the subset of merlot04-30 that aren't printed in: `qstat -f | grep exec_host') and shut them down `sudo ssh NODENAME shutdown -h now'
  4. If the CPU or hard disk temperature is still high after a few minutes, run `sudo cexec shutdown -h now' to shutdown all the nodes. For courtesy save the outputs of `showq' and `qstat' for the convenience of people whose jobs will be killed.

Admins with local access:

  1. Check air conditioning temperature graph -- it will give a clue as to any problems with chilled water.
  2. Follow remote access procedures above
  3. You should check manually to see if the nodes have been completely shutdown. Power off the rack manually using the black and white switches on the right hand side.

What to do if power/chilled water will be turned off

  1. Contact AJ, Matt or Mitch to turn off the rack. If they are unreachable, then
  2. Enter HPC room (call security on 51234 if you need access). Power off the rack manually using the black and white switches on the right hand side.