Booting the Cage

The cage we are using is cx24y2c2. These steps are performed on the RAS front-end server rscsmw2.

Add CIT to path

Ensure your initialization scripts source /cluster_new/config/cluster.env

rscsmw2:/home/kbferre $ . /cluster_new/config/cluster.env

Sudo to Root

rscsmw2:/home/kbferre $ sudo su

Set/Check VM for our Cage

Check to make sure we run a separate VM for this cage

  rscsmw2:/home/kbferre $ am --flat --rs_vm cx24y2c2

Set VM for our Cage

  rscsmw2:/home/kbferre $ am --set --rs_vm cx24y2c2/ cx24y2c2

Set/Check VM Repository for our cage

We first must make sure we run our VM. The following allows us to view the VM the cage will be using

  rscsmw2:/home/kbferre $ am --flat --cray_bin cx24y2c2

Typical Output

If this does not point to our software install directory then follow the next step

Set VM Repository for our cage

  rscsmw2:/home/kbferre $ am --set --cray_bin "/usr/local5/kbferre/${RS_SRC_DIR}/install" cx24y2c2

Check to make ensure the set succeeded:

  rscsmw2:/home/kbferre $  am --flat --cray_bin cx24y2c2

Typical Output

Set VM Repository for node 95 in our cage

  rscsmw2:/home/kbferre $ am --set --cray_bin "/usr/local5/kbferre/${RS_SRC_DIR}/install" n95.cx24y2

Check Node Roles

Ensure that the role of the nodes are correct. /cluster_new/vms/cx24y2c2/nodelist contains the roles for our cage. i denotes a service node, c denotes a compute node, and n denotes no node exists in the slot.

Example nodelist file

Use the following to view the role of a node:

rscsmw2:/home/kbferre $  am --flat --rs_role n32.cx24y2

Typical Output

The following sets the role of a node to service

rscsmw2:/home/kbferre $ am --set --rs_role i n32.cx24y2

Prepare Linux Node

Now we prepare the software for our Linux Service node: n64.cx24y2

Create the boot parameters file

rscsmw2:/home/kbferre $ makeboot_sandia n64.cx24y2

Typical Output

Move the boot parameters file

rscsmw2:/home/kbferre $ makeboot_parameters_2.6 n64.cx24y2

Typical Output

The boot parameters can be checked by looking at ${RS_SRC_DIR}/install/linux/bootable/parameters. An example is here:

Power Cycle the Cage

A power cycle is not usually required and should be avoided at all costs as it usually causes more problems than it solves, but is a good idea as a last resort.

Power off the cage

rscsmw2:/home/kbferre $ rs_power --off cx24y2c2

Typical Output

Wait a few minutes (on the suggestion of Cray), and then power back on:

rscsmw2:/home/kbferre $ rs_power --on cx24y2c2

Typical Output

It may take a few minutes for the managment processors on the cage to initialize fully

Start the Harness

Note: If you were already running make sure your wamd as well as any console processes are killed before starting over. WAMD will be started in the next step

Check if the harness is already running

rscswm2:/home/kbferre $ harness  --status cx24y2c2

Typical Output

If the harness is running continue with the next step, if it is not running continue with starting the harness

Stop the harness software

rscswm2:/home/kbferre $ harness  --stop cx24y2c2

Check to make sure it stopped

rscswm2:/home/kbferre $ harness --status cx24y2c2

Start the Harness

rscswm2:/home/kbferre $ harness --init cx24y2c2

Typical Output

Check to make sure it is running

rscsmw2:/home/kbferre $ harness --status cx24y2c2

Typical Output

Start Harness on Slot 7 (nodes 92,93,94,95)

rscsmw2:/home/kbferre $ harness --status cx24y2c2s7

Start WAMD

In a separate window and not as root start wamd

rscsmw2:/home/kbferre $ harness --wamd cx24y2c2

You will not get this window back. Hardwre errors on the cage will be logged here.

Initialize the hardware

Initialize cage cx24y2c2

rscswm2:/home/kbferre $ rs_init cx24y2c2

Initialize single node 80 in cage cx24y2c2

rscswm2:/home/kbferre $ rs_init n80.cx24y2

Note: If another node exists on the board with node 80, it will also get initialized

Typical Output

As this starts up it prints a bunch of messages to the wamd window of the form

[SNIP]
deadstart (73 2) (77 3)
deadstart (74 2) (78 3)
deadstart (68 2) (72 3)
deadstart (69 2) (73 3)
deadstart (93 0) (94 5)
deadstart (94 0) (95 5)
deadstart (70 2) (74 3)
[SNIP]

Boot Service Node

rscsmw2:/home/kbferre $ rs_boot --l n64.cx24y2

After a certain amount of time this will look like a typical Linux boot. A good sign is when you see the HOSTNAME output to the console similar to:

[SNIP]
Initrd: done.
**********************************************************
    HOSTNAME: rsclogin112
**********************************************************
1: lo: <LOOPBACK> mtu 16436 qdisc noop
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: sit0: <NOARP> mtu 1480 qdisc noop
    link/sit 0.0.0.0 brd 0.0.0.0
3: ss: <NOARP> mtu 16000 qdisc noop qlen 8
    link/ether 00:00:00:00 brd ff:ff:ff:ff
[SNIP]

You can Ctrl^C this window to get the prompt back after the login node has finished booting

Boot Compute Nodes

rscsmw2:/home/kbferre $ rs_boot --qk cx24y2c2
This can take 10-15 mins and as far as I tell prints little useful information to the screen Typical Output

Boot Compute Node 95

rscsmw2:/home/kbferre $ rs_boot --qk n92.cx24y2
 
/var/www/ssl/data/pages/kurt/os_noise/devharness/boot_cage.txt · Last modified: 2008/01/07 12:37 (external edit)     Back to top