Booting the Cage
The cage we are using is cx24y2c2. These steps are performed on the RAS front-end server rscsmw2.
Add CIT to path
Ensure your initialization scripts source /cluster_new/config/cluster.env
rscsmw2:/home/kbferre $ . /cluster_new/config/cluster.env
Sudo to Root
rscsmw2:/home/kbferre $ sudo su
Set/Check VM for our Cage
Check to make sure we run a separate VM for this cage
rscsmw2:/home/kbferre $ am --flat --rs_vm cx24y2c2
Set VM for our Cage
rscsmw2:/home/kbferre $ am --set --rs_vm cx24y2c2/ cx24y2c2
Set/Check VM Repository for our cage
We first must make sure we run our VM. The following allows us to view the VM the cage will be using
rscsmw2:/home/kbferre $ am --flat --cray_bin cx24y2c2
If this does not point to our software install directory then follow the next step
Set VM Repository for our cage
rscsmw2:/home/kbferre $ am --set --cray_bin "/usr/local5/kbferre/${RS_SRC_DIR}/install" cx24y2c2
Check to make ensure the set succeeded:
rscsmw2:/home/kbferre $ am --flat --cray_bin cx24y2c2
Set VM Repository for node 95 in our cage
rscsmw2:/home/kbferre $ am --set --cray_bin "/usr/local5/kbferre/${RS_SRC_DIR}/install" n95.cx24y2
Check Node Roles
Ensure that the role of the nodes are correct. /cluster_new/vms/cx24y2c2/nodelist contains the roles for our cage. i denotes a service node, c denotes a compute node, and n denotes no node exists in the slot.
Example nodelist file
Use the following to view the role of a node:
rscsmw2:/home/kbferre $ am --flat --rs_role n32.cx24y2
The following sets the role of a node to service
rscsmw2:/home/kbferre $ am --set --rs_role i n32.cx24y2
Prepare Linux Node
Now we prepare the software for our Linux Service node: n64.cx24y2
Create the boot parameters file
rscsmw2:/home/kbferre $ makeboot_sandia n64.cx24y2
Move the boot parameters file
rscsmw2:/home/kbferre $ makeboot_parameters_2.6 n64.cx24y2
The boot parameters can be checked by looking at ${RS_SRC_DIR}/install/linux/bootable/parameters. An example is here:
Power Cycle the Cage
A power cycle is not usually required and should be avoided at all costs as it usually causes more problems than it solves, but is a good idea as a last resort.
Power off the cage
rscsmw2:/home/kbferre $ rs_power --off cx24y2c2
Wait a few minutes (on the suggestion of Cray), and then power back on:
rscsmw2:/home/kbferre $ rs_power --on cx24y2c2
It may take a few minutes for the managment processors on the cage to initialize fully
Start the Harness
Note: If you were already running make sure your wamd as well as any console processes are killed before starting over. WAMD will be started in the next step
Check if the harness is already running
rscswm2:/home/kbferre $ harness --status cx24y2c2
If the harness is running continue with the next step, if it is not running continue with starting the harness
Stop the harness software
rscswm2:/home/kbferre $ harness --stop cx24y2c2
Check to make sure it stopped
rscswm2:/home/kbferre $ harness --status cx24y2c2
Start the Harness
rscswm2:/home/kbferre $ harness --init cx24y2c2
Check to make sure it is running
rscsmw2:/home/kbferre $ harness --status cx24y2c2
Start Harness on Slot 7 (nodes 92,93,94,95)
rscsmw2:/home/kbferre $ harness --status cx24y2c2s7
Start WAMD
In a separate window and not as root start wamd
rscsmw2:/home/kbferre $ harness --wamd cx24y2c2
You will not get this window back. Hardwre errors on the cage will be logged here.
Initialize the hardware
Initialize cage cx24y2c2
rscswm2:/home/kbferre $ rs_init cx24y2c2
Initialize single node 80 in cage cx24y2c2
rscswm2:/home/kbferre $ rs_init n80.cx24y2
Note: If another node exists on the board with node 80, it will also get initialized
As this starts up it prints a bunch of messages to the wamd window of the form
[SNIP] deadstart (73 2) (77 3) deadstart (74 2) (78 3) deadstart (68 2) (72 3) deadstart (69 2) (73 3) deadstart (93 0) (94 5) deadstart (94 0) (95 5) deadstart (70 2) (74 3) [SNIP]
Boot Service Node
rscsmw2:/home/kbferre $ rs_boot --l n64.cx24y2
After a certain amount of time this will look like a typical Linux boot. A good sign is when you see the HOSTNAME output to the console similar to:
[SNIP]
Initrd: done.
**********************************************************
HOSTNAME: rsclogin112
**********************************************************
1: lo: <LOOPBACK> mtu 16436 qdisc noop
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0: <NOARP> mtu 1480 qdisc noop
link/sit 0.0.0.0 brd 0.0.0.0
3: ss: <NOARP> mtu 16000 qdisc noop qlen 8
link/ether 00:00:00:00 brd ff:ff:ff:ff
[SNIP]
You can Ctrl^C this window to get the prompt back after the login node has finished booting
Boot Compute Nodes
rscsmw2:/home/kbferre $ rs_boot --qk cx24y2c2This can take 10-15 mins and as far as I tell prints little useful information to the screen Typical Output
Boot Compute Node 95
rscsmw2:/home/kbferre $ rs_boot --qk n92.cx24y2