Q: Our company is going to have to power down all the servers in our data center / computer room because of required electrical maintenance. What steps should I take ahead of time to prepare for this?
A: There are 3 main items that need to be looked at in this case:
- System Documentation
- System Hardware Verification
This is some of the same information that you should have available in your Disaster Recovery / Business Continuity plan.
There are many tools available to help you document your operating system configuration. My preferred tool is CFG2HTML as it is available for many different Unix versions & Linux distributions. CFG2HTML is free, is extensible, and is easily customisable as it is based around a collection of scripts that collect the system information.
All documentation should be kept on some sort of media, such as a CD / DVD or USB stick, so that it is easily accessible during an outage. It may not hurt to have a hard copy of the documentation in a binder as well, just in case….
My checklist of documentation (slightly biased toward HP-UX since that is where the majority of my expertise lies):
- Run CFG2HTML on all servers and gather all 3 files that are generated on each server.
- For any HP-UX LVM based volume groups on HP-UX run the command ‘vgexport -p -v -s -m /tmp/<vgname>.map‘ to gather volume group configuration. This information can be used to recreate your volume groups if disk device files change as a result of the outage / system reboot. If you are running a different volume manager, Veritas Volume Manager (VxVM) for example, and there is a command equivalent to the ‘vgexport‘ command above, run it as appropriate so you have the ability to rebuild your volume groups if need be.
- If you have any NFS servers or clients, make sure you understand the relationships between the servers so that they can be booted in the correct order. For example:
- NFS server first.
- Combination of NFS server / NFS client.
- Strictly NFS client.
- If there are any other dependencies or boot order requirements, make sure those are understood and documented. For example:
- DNS server
- Corporate backup server
- Corporate job scheduler
- Document any special steps that need to be taken, or special instructions that need to be followed, when restarting applications. For example:
- Starting Oracle databases.
- Starting applications that talk to the databases
- Starting processes that depend on application
- If you have any disk arrays, save a copy of the array configuration so that you know what disks are assigned to each disk group, the size of the LUNs configured and if there is any zoning information. If a file can be generated that can be reloaded by the array, save a copy of it so that the configuration can be reapplied easily.
- Make a copy of the configuration of your networking equipment. This should include all network switches, routers, firewalls, and any SAN / fibre channel switches.
- Understand the power-on sequence of all equipment in the data center. Such as:
- Network equipment first
- Disk arrays
- Tape libraries
In a situation like this, backups are of extreme importance. There are 2 types of backups that should be taken on all servers:
- Operating system image backup. The following lists some tools that can be used to accomplish this.
- MS Windows — A tool like Acronis
- IBM AIX — A ‘mksysb‘ backup
- HP-UX — An Ignite/UX backup (network- or tape-based) or a DRD image of the boot disk
- Linux — A tool like CloneZilla
- A backup of the data. This needs to be separate from the OS image backup, especially if your data is on a SAN. That way if your boot disks fail, you can recover the OS but your data should still be available on the SAN.
You must make absolutely certain that all backups are finished and that they ran successfully before powering down the servers. If the backups were not successful, then you should not proceed unless the risk of data loss is completely understood and management signs off that the risk is acceptable.
SYSTEM HARDWARE VERIFICATION
Prior to shutting down any systems, all hardware should be checked to make sure there are no undetected failures. This is especially true of disk drives. If there is some form of RAID protection (RAID1 or RAID 5/6 for example), then it is possible that there have been disk failures that have gone undetected. If the failures go undetected for too long, you run the risk of data loss. If there are any failed disk drives in a disk array or in a server, they should be replaced and the data resynchronized (in the case of RAID 1 or RAID 5/6).
If you have servers that have not been rebooted in a long time, it may be a good idea to go ahead and reboot the server prior to the power-down to make sure it can boot successfully. A successful reboot of the server prior to the power-down should lessen the chances of something going wrong when you power-up the server after maintenance.
– See more at: https://serviceitdirect.com/blog/data-center-power-down-checklist#sthash.ijUHItcQ.dpuf