For example, let’s imagine that you have accidentally damaged GRUB, and you’re far away from your office. What should you do — ask someone to enter the server room and recover it, or can you try and fix it remotely? The answer is that you can easily recover GRUB from your current location!
Let’s first understand the scenario, so that you know what I’m talking about. We have a couple of different projects hosted on a single branded (HP ProLiant 580) server whose configuration includes two RAID controllers, two physical Xeon processors and 24 GB of RAM — enough power for a dozen virtual machines. During the initial setup of Debian, the server was configured to boot from and use only the first RAID controller, while the second controller was configured as RAID 5 and left intact.
After a while, some more projects were added on the system, and it became clear that more space was needed; so it was decided to occupy the second RAID array as well. However, it was highly desirable to have it as RAID1+0 space; so a reconfiguration was done. However, after the first reboot, not a single ping was received from the server. What went wrong? Global loss of data, or something trickier?
My first response was to check whether GRUB was still good, or had been destroyed. If the latter, what would be my next step — a full Debian reinstallation, together with all hosted virtual machines, and associated user data? From a rational point of view, this would be a huge loss of time. Maybe I ought to try and restore GRUB first — but how? Fortunately, Hewlett-Packard hardware comes with the iLO service on-board.
iLO — integrated Lights-Out
For years, OEMs of complex computer systems have tried to expand the functionality of their enterprise-level product lines, for example, by utilising out-of-band management (a system feature using which working traffic goes through a casual network card, while service traffic comes via a dedicated channel). In fact, a server has an additional Ethernet controller that’s intended for service functions only.
For instance, through this dedicated channel, access to the server is organised — it can be seen as a secured and absolutely independent connection to a server. Don’t misuse it with a VLAN subnet, because it’s a completely separate network. Thus far, the traffic that goes through this dedicated channel is truly for management purposes. So, you can establish a virtual TTY connection to a UNIX server, or organise a fully featured graphical access to the server by means of iLO — the integrated Lights-Out service — and not be afraid of overloading other channels.
Basically, iLO is a special portion of firmware (i.e., BIOS) developed by Hewlett-Packard that transforms TCP/IP packets into serial line signals. There are competitors to iLO implemented by other vendors. In particular, Sun Microsystems had developed a feature pack called Advanced Lights Out Management (ALOM). Fujitsu has Integrated Remote Management Controller (iRMC), and Dell has Dell Remote Access Controller, or DRAC. All of them have one goal — to offer a virtual KVM.
The HP iLO implementation is available in two options: you can either connect to the server via SSH and perform all tasks in a text-console session, or you can launch your Web browser and have a full-featured graphical desktop environment in order to manipulate the hardware. In the latter case, you get a far more functional solution, because you can use the local screen and peripherals by your side as virtual devices (the real ones are left in the server room). We’ll be using only the advanced features of iLO.
Let’s get started with Mozilla Firefox, and locate the URL to the server’s iLO home page. You need to have a Web browser with a Java plugin to log in to iLO. After you’ve been authorised via the SSL connection, you will see the iLO main page (Figure 1).
Here, you will find the following tabs: System Status, Remote Console, Virtual Media, Power Management and Administration. Basically, all information about the HP server is available at a glance. However, our primary interest is hidden under the Remote Console and Virtual Media tabs.
From the first tab, you can access a graphical display of the server console, as seen in Figure 2. The Virtual Media tab allows you to connect floppy, USB stick or CD-ROM images as if they were real hardware on the server — see Figure 3. What’s interesting about all this is the fact that those images are on your side — on the same host on which you’re running the Web browser. Handy, isn’t it?
So we have access to the console as if we ourselves have a real keyboard and a screen. We can see the configuration of the RAID controller, and the server’s boot priority (the sequence of devices that the server checks in order to boot off — CD ROM, Ethernet card, or directly from HDD). In order to change it, you should press the F9 key right after the initialisation of both RAID controllers (they’re named HP Smart Array P400 Controller). After you press the F9 key, you’re in the Setup Utility (Figure 4).
Here, you need to pay attention to the option Standard Boot Order (IPL). Entering that option gives you a listing of devices, as shown in Figure 5.
The first device to be polled is CD-ROM, which would be a real CD inserted into the CD bay slot, or an image mounted via the network from your PC. If neither is available, the next device to be checked is a hard drive, seen as IPL:2, and so on. Now, you need to check what the boot sequence is for the composite IPL:2 device.
Go to the (main BIOS set-up screen) menu option Boot Controller Order. The screen is shown in Figure 6.
It is clear that the boot priority is as follows: first the PCI Embedded HP Smart Array P400 Controller (also known as Slot 0), then the PCI Embedded HP Integrated PCI IDE Controller, and finally the second RAID controller, in Slot 11.
In order to see the layout of the relevant RAID array, you need to do a warm reboot, and press F8 immediately after the initialisation of the respective controller. Figure 7 shows the main setup menu for the RAID controller in Slot 11.
Pay attention to the last option, Select as Boot Controller. This option is responsible for setting the boot priority among RAID disks. If you selected this option, then the loader will begin using this particular disk for its purposes. Therefore, even if you’ve correctly selected all of the previous options in the ROM-based Setup Utility, and forgotten about this tricky option, then expect your system to not be bootable. But don’t panic. We know how to restore the MBR, and we’ll do just that. Let’s save private GRUB!
Get my data back!
To rescue all user data as well as all the Linux systems configuration, we could grab ourselves any Live CD, boot off it, and copy the files to SAN storage. After that, we could reinstall the entire system from scratch.
However, I think this approach is wrong. Instead of copying several thousand gigabytes from RAID arrays via the local network, we can save our time. The first thing we can avoid is the need to integrate network drivers into the Live CD. This particular HP model has 2 Gigabit NICs, and once these drivers are put into the initramfs
file, all we need is to find and use them. Otherwise, you’ll have to change the network topology — find a person who will switch cables from one Ethernet port to another. That’s not easy to explain to a stranger who is miles away from you, in the middle of the night, especially when the server room hosts dozens of communication racks and cabinets.
Let’s make sure that the drives on the first RAID controller (where we store the data and the system) are consistent. In order to do this, continuously press the F8 button right after the controller initialisation of Slot 0 to get into the RAID configuration.
According to the layout visible in Figure 8, the data we need has remained untouched on the disks. This is great; full speed ahead! Our task is to boot the system, then transfer the /boot/
contents onto our PC. Only after that will we create a bootable ISO image with the kernel and initramfs
files rescued from the system. This ISO image will load the whole system as it was before the crash. Once the system is loaded fully, we will try to rerun the last command, grub-install
. Easier said than done, right? Let’s see.
I used the kernel option init=/bin/sh
, when I started a usual installation of Debian Lenny. This allows me to have a console in runlevel 1, after all kernel initialisation and system configuration takes place. Let’s see exactly what virtual devices are connected to this HP server:
# dmesg | grep HP usb 5-1: Manufacturer: HP HP CISS Driver (v 3.6.20) usb 5-2: Manufacturer: HP usb 5-2.2: Manufacturer: HP input: HP Virtual Keyboard as /devices/pci0000:00/0000:00:1e.0/0000:01:04.4/usb5/5-1/5-1:1.0/input/input1 generic-usb 0003:03F0:1027.0001: input,hidraw0: USB HID v1.01 Keyboard [HP Virtual Keyboard] on usb-0000:01:04.4-1/input0 input: HP Virtual Keyboard as /devices/pci0000:00/0000:00:1e.0/0000:01:04.4/usb5/5-1/5-1:1.1/input/input2 generic-usb 0003:03F0:1027.0002: input,hidraw1: USB HID v1.01 Mouse [HP Virtual Keyboard] on usb-0000:01:04.4-1/input1 scsi 2:0:0:0: CD-ROM HP Virtual DVD-ROM PQ: 0 ANSI: 0 CCS
In the above code-snippet, the HP RAID driver is registered as “CISS”. Let’s repeat the grep, but with ciss
as the filter:
# dmesg | grep ciss cciss 0000:25:00.0: PCI INT A-> GSI 34 (level, low) -> IRQ 34 cciss 0000:25:00.0: irq 69 for MSI/MSI-X cciss 0000:25:00.0: irq 70 for MSI/MSI-X cciss 0000:25:00.0: irq 71 for MSI/MSI-X cciss 0000:25:00.0: irq 72 for MSI/MSI-X IRQ 71/cciss0: IRQF_DISABLED is not guaranteed on shared IRQs cciss0: <0x3230> at PCI 0000:25:00.0 IRQ 71 using DAC cciss/c0d0: unknown partition table cciss 0000:02:00.0: PCI INT A-> GSI 16 (level, low) -> IRQ 16 cciss 0000:02:00.0: irq 73 for MSI/MSI-X cciss 0000:02:00.0: irq 74 for MSI/MSI-X cciss 0000:02:00.0: irq 75 for MSI/MSI-X cciss 0000:02:00.0: irq 76 for MSI/MSI-X IRQ 75/cciss1: IRQF_DISABLED is not guaranteed on shared IRQs cciss1: <0x3230> at PCI 0000:02:00.0 IRQ 75 using DAC cciss/c1d0: p1 p2 cciss/cldl: unknown partition table
As you see, this gives us two partitions found on one RAID controller (cciss/c1d0
), and an empty partition table for the other. There’s obviously a destroyed MBR there. We can try to mount the first partition now:
# mount /dev/cciss/c1d0p1 /tmp/1 -t ext2 EXT2-fs warning (device cciss/c1d0p1): ext2_fill_super: mounting ext3 filesystem as ext2
# ls -l /tmp/1> rw-r--r-- 1 0 0 1724047 Oct 21 07:45 System.map-2.6.32-4-pve rw-r--r-- 1 0 0 106083 Oct 21 07:45 config-2.6.32-4-pve drwxr-xr-x 2 0 0 4096 Dec 8 06:54 grub rw-r--r-- 1 0 0 10404494 Nov 13 07:17 initrd.img-2.6.32-4-pve rw-r--r-- 1 0 0 10402231 Nov 8 13:42 initrd.img-2.6.32-4-pve.bak drwx------ 2 0 0 16384 Nov 8 12:38 lost+found rw-r--r-- 1 0 0 124152 Oct 8 2008 memtest86+.bin rw-r--r-- 1 0 0 2507616 Oct 21 07:45 vmlinuz-2.6.32-4-pve
The partition that we’ve mounted is actually the original /boot
directory. Now we need to transfer the necessary files onto our local Linux machine, and from these, create a mini boot ISO image, specifically designed for our server. How do we do this? We still have an unused virtual USB drive slot. Let’s first prepare a loopback device file on our local system:
# dd if=/dev/zero bs=1M count=40 of=blank.iso # /sbin/mkfs.ext3 -F blank.iso
Next, we connect the blank.iso
image with the ext3 filesystem to the server USB slot via the Java applet (the Virtual Floppy/USBKey option seen in Figure 3). After this, tailing dmesg
on the server console shows up something like the following:
usb 5-2.1: new full speed USB device using uhci_hcd and address 10 usb 5-2.1: New USB device found, idVendor=03f0, idProduct=1427 usb 5-2.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0 usb 5-2.1: Product: Virtual Floppy usb 5-2.1: Manufacturer: HP usb 5-2.1: configuration #1 chosen from 1 choice scsi8 : SCSI emulation for USB Mass Storage devices # scsi 8:0:0:0: Direct-Access HP Virtual ReyDriue 0.01 PQ: 0 ANSI: 2 sd 8:0:0:0: Attached scsi generic sg2 type 0 sd 8:0:0:0: [sda] 81920 512-byte logical blocks: (41.9 MB/40.0 MiB) sd 8:0:0:0: [sda] Write Protect is off sd 8:0:0:0: [sda] Assuming drive cache: write through sd 8:0:0:0: [sda] Assuming drive cache: write through sda: unknown partition table sd 8:0:0:0: [sda] Assuming drive cache: write through sd 8:0:0:0: [sda] Attached SCSI removable disk
Now we mount the newly detected (virtual) USB drive:
# mount /dev/sda /tmp/2 -t ext3
We copy the kernel, the initramfs
file with integrated drivers, and settings from the GRUB loader into the mounted image, sync, to ensure that buffers are flushed (data is written across the network onto our disk) and we then unmount our virtual USB drive:
# cp /tmp/1/initrd* /tmp/2 # cp /tmp/1/vmlinuz* /tmp/2 # cp /tmp/1/grub/menu* /tmp/2 # sync && umount /tmp/2
On the local machine, we disconnect the ISOs from the virtual USB and virtual CD using the Virtual Media tab of the Java applet. Now, we mount this image file locally, and with the help of the Syslinux utility, create a bootable ISO image.
First, we create a directory structure as follows:
./isolinux/isolinux.bin ./isolinux/isolinux.cfg ./vmlinuz-2.6.32-4-pve ./initrd.img-2.6.32-4-pve ./grub/menu.lst
The generated directory tree for the mini ISO image should conform to Syslinux rules. Next, we launch the mkisofs
command with these options:
# mkisofs -o mini_iso.iso -b isolinux/isolinux.bin -c isolinux/boot.cat -no-emul-boot -boot-load-size 4 -boot-info-table CD_root
Booting with the new mini-boot ISO
We connect the new mini_iso.iso
as a virtual CD drive on the server, and in the iLO page’s Administration tab, click the Reset button. We switch to the virtual console and watch the system boot messages. As is visible in Figure 9, the system is intact, and boots as it should.
Cool! Now we need to point the boot loader to a new location — to be precise, to the beginning of the /dev/cciss/c1d0
device, where it was logically expected, but it wasn’t present. But first, let’s verify what we can see at the beginning of the second disk:
# hexdump -C /dev/cciss/c0d0 | less 00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
This tells us that the MBR table for the second RAID array is completely blank. Well, we’ve proved our hypothesis that the system fault was because of a blank MBR. It’s time to run GRUB, to generate an MBR for the new location:
# /usr/sbin/grub-install /dev/cciss/c1d0
Let’s make sure that the MBR is generated and written correctly for the first disk:
# hexdump -v /dev/cciss/c1d0 | less 0000000 48eb 0090 0000 0000 0000 0000 0000 0000 0000010 0000 0000 0000 0000 0000 0000 0000 0000 0000020 0000 0000 0000 0000 0000 0000 0000 0000 0000030 0000 0000 0000 0000 0000 0000 0000 0203 0000040 00ff 2000 0001 0000 0200 90fa f690 80c2 0000050 0275 80b2 59ea 007c 3100 8ec0 8ed8 bcd0 0000060 2000 a0fb 7c40 ff3c 0274 c288 be52 7d7f 0000070 34e8 f601 80c2 5474 41b4 aabb cd55 5a13 0000080 7252 8149 55fb 75aa a043 7c41 c084 0575 0000090 e183 7401 6637 4c8b be10 7c05 44c6 01ff 00000a0 8b66 441e c77c 1004 c700 0244 0001 8966 00000b0 085c 44c7 0006 6670 c031 4489 6604 4489 00000c0 b40c cd42 7213 bb05 7000 7deb 08b4 13cd 00000d0 0a73 c2f6 0f80 ea84 e900 008d 05be c67c 00000e0 ff44 6600 c031 f088 6640 4489 3104 88d2 00000f0 c1ca 02e2 e888 f488 8940 0844 c031 d088 0000100 e8c0 6602 0489 a166 7c44 3166 66d2 34f7 0000110 5488 660a d231 f766 0474 5488 890b 0c44 0000120 443b 7d08 8a3c 0d54 e2c0 8a06 0a4c c1fe 0000130 d108 6c8a 5a0c 748a bb0b 7000 c38e db31 0000140 01b8 cd02 7213 8c2a 8ec3 4806 607c b91e 0000150 0100 db8e f631 ff31 f3fc 1fa5 ff61 4226 0000160 be7c 7d85 40e8 eb00 be0e 7d8a 38e8 eb00 0000170 be06 7d94 30e8 be00 7d99 2ae8 eb00 47fe 0000180 5552 2042 4700 6f65 006d 6148 6472 4420 0000190 7369 006b 6552 6461 2000 7245 6f72 0072 00001a0 01bb b400 cd0e ac10 003c f475 00c3 0000 00001b0 0000 0000 0000 0000 0000 0000 0000 0280 00001c0 0001 8183 8020 0040 0000 0000 0010 8200 00001d0 8001 fe8e ffe0 0040 0010 ac80 087a 0000 00001e0 0000 0000 0000 0000 0000 0000 0000 0000 00001f0 0000 0000 0000 0000 0000 0000 0000 aa55
Yes, the GRUB loader is generated on the first RAID disk. Now we flush any disk caches and do a warm reboot of the system, meanwhile disconnecting our mini-boot ISO from the virtual CD slot:
# sync && reboot
After reboot, we can enjoy the restored and, more importantly, the completely working system.
Here, I’ve explained in detail how to recover a Linux system with the help of the integrated iLO software from Hewlett-Packard. It no longer matters if you’re far away from office, so long as you’re able to log in to the company LAN via VPN, and able to access the server’s iLO home page. From here, you can quite easily perform non-trivial tasks like a full system reconfiguration of a working system, complete installation, or restoring after a crash.
Resources
- Wikipedia article on HP iLO
- Resource on HP’s website on iLO
- Wikipedia entry on lights-out management/out-of-band management
- hThe Syslinux Project wiki