|
|
As a diagnostic tool, a crash dump becomes the only way of diagnosing
a problem after the fact.
What is a panic?
Put simply, a panic is a system failure. Panics can be caused by
either hardware failure(s) or operating system software failures.
Hardware failures may result in a non usable system until such a time
as the failing component can be replaced.
How to take a crash dump:
This section will provide information regarding crash dump procedures
on the 1.2, 1.2.x, 1.3 and 1.3.1 operating systems. For 1.2, see the
section relevant for your machine.
U6000/15/60/65/100/300/520: 1.2, 1.2.1, 1.2.4 :
To obtain a complete dump, the primary swap space must be at least the
size of system memory.
To force a crash dump simply press the soft reset button once. By resetting the system manually, an "NMI Push Button" dump will be created. (U6000/60s have only one reset button which is located on the I/O card just above the ECOM connector. This button should be pressed ONCE for a soft reset.)
By default, the system will attempt to write the following files to the /crash directory of the root file system:
crash.MMDD
sym.MMDD
kernel.MMDD
This is what you will see on the console:
A crash dump of n K is in the swap area.
It will take a while to save the crash dump on the disk / tape / floppy.
Do you want to save the crash dump ? ( y / n) : y
If there is not enough space on the root file system to save the dump, the following menu will be displayed:
Need n to save crash dump.
Root has only n free.
F - write to floppy disk
T - write to tape
S - spawn a shell
X - skip it:S
at this point, you may choose to write to floppy (NOT advisable), write to tape, spawn a shell or skip saving the dump. If you select S, you will see:
Enter Control-D to exit the shell.
#
From this prompt, you have the ability to remove files from the root file system in an attempt to free up enough space. Once you have freed enough space, exit the shell using ctrl-d and the savedump script will start again.*
* Prior to 1.2QT4, there is a problem with the /sbin/savedump script when / is a ufs file system, please see patch 16423602 for more information.
The only difference on these machines is that there is not an NMI push button unless you put a newer board in the machine. At which time, you must also apply patch 16393045 so that there is software support to take an NMI dump as well.
To obtain a readable crash dump, please first install patch 16209716B.
The crash dump procedure has been modified to improve reliability and diagnostic value of information contained in the crash dump.
To obtain a more accurate snapshot of the system and improve the reliability of the information in the crash dump, all interrupts that are not required during the dump process are masked. Timeout and STREAMS routines are no longer executed during the crash dump procedure.
When a 1.3 crash dump is taken to disk or restored from tape, the following files are created:
crash.MMDD
sym.MMDD
kernel.MMDD
mtune.MMDD
stune.MMDD
pkginfo.MMDD
mdevice.MMDD
sdevice.MMDD
contents.MMDD
hinv.MMDD
By default, if the files are written to disk, they are written to the /crash directory.
This is what you will see on the system console:
Starting crash dump
Do you wish to take the dump on tape?
Press < t > for tape dump, < s > for swap or < e > to exit:
Entering a t will write the dump to tape, s will write the dump information to swap and e will skip writing the dump.
The system will wait for 10 minutes at the prompt. If no response is entered, the system will proceed to dump to swap. If there is not enough room in swap, the system will display the following:
Main memory is greater than primary swap space.
Main memory dump to primary swap will be partial.
Press < t > for tape dump, < s > for swap:
At this point, if s is entered, the system will take the dump, but it will be only a partial dump.
If t is entered, the following will be displayed:
Insert writable tape in drive#0
Press any key to start dump, < e > to exit.
You need to insert a write-enabled tape into the tape drive and press any key to start the dump. The dump process will write to multiple tapes if necessary.
While the system is processing a crash dump, the system console will display:
Trying to dump n Pages
Elapsed Time = hh:mm:ss Estimated End Time = hh:mm:ss
If you chose to dump directly to tape, during the reboot procedure, you will be asked to reinsert the last tape so that other files like symbol, and mtune can be copied to the same tape. Before rebooting the system, the console will show:
hinv file will be appended to the tape after the system comes up
Do not remove the tape from the drive.
When the system is rebooting, the following sequence of events will occur:
... Normal boot messages
Node: systemV4Direct dump to tape had been taken earlier
Please load last of the dump tape(s) for dumping unix filesPress Y to continue or N to skip it : Y
The copy of unix files will take some time .. please waitSaving symbol, unix, contents, mdevice, sdevice, mtune, stune, pkginfo files.
contents
mdevice
sdevice
pkginfo
mtune
stune
symfile
unix
n blocks written
Then the system continues to boot displaying that drivers are loaded. The hinv file will be appended to the tape AFTER the system has reached multiuser mode. Do not remove the tape from the drive until AFTER the hinv file has been created!. Prior to the setting keyboard message, you will see:
hinv file written to tape
Any fixed block tape device can be configured as the default drive on which a crash dump will be taken. The default can be modified by editing the file /etc/conf/pack.d/kernel/space.c. Change the value of the iDumpTapeMin entry which indicates the minor number of the tape device that will, by default, be used for crash dumps. A newly installed 1.3QT1 system would show:
/* The major and minor device numbers used by sysdump while dumping to
the swap and the tape */
int iDumpTapeMaj= UCST_CMAJOR_0, iDumpTapeMin = 0;
in the /etc/conf/pack.d/kernel/space.c file. To enable the dump procedure to write to a tape other than QIC, change the value in the space.c file, perform an idbuild and then reboot the system.
The crash dump process in 1.3.1 is similar to that in 1.3. However, changes have been made so that a selective file dump is possible. In the past, when a crash dump was taken, the system copied the entire memory to swap. As indicated in 1.3, this was changed somewhat to allow for dumping the image directly to tape. This step was not enough once memory configurations grew to 1 gigabyte and more. The crash dump has become unmanageable due to the large disk, swap or tape space required to save the dump.
In order to handle these large dumps more efficiently, the kernel code and the kcrash utility have been modified to dump only the kernel pages required for crash anaylsis, rather than dumping all of memory. This creates a more manageable dump. The size of the dump and the amount of time required to take the dump decrease.
The new process allows the user to choose to get a complete dump (like those in previous releases) or a selective dump.
When the 1.3.1 system is processing a dump, the console displays:
Do you wish to take a complete or selective dump?
Press < c > for complete dump, < s > for selective dump:
After the type of dump is chosen, the rest of the dump proceeds as in 1.3.
U6000-SVR4 1.4
The crash dump process in 1.4 is somewhat similar to that in 1.3.1 and
1.3.2. In 1.4, the default dump type is selective. However, configuration
of dump type is very different in 1.4.
The default dump type is chosen at operating system installation time. The
crash dump process will use this configuration and the user will not be
prompted to choose the type of dump (selective or complete).
To modify the default dump type, the /etc/conf/cf.d/stune file must be
modified. The name of the kernel configuration parameter for dump type
in 1.4 is RAS_DUMP_METHOD. The following list shows valid types:
0 = Complete Dump
1 = Selective Dump
Prior to the 1.4 operating system release, a dump was first taken to swap and then either written to disk or tape. In 1.4, the following prompt will be given:
Select a device to take the dump to.
tape is the default.
Press < t > for tape dump, < s > for swap:
If a t is entered, then a list of configured tape drives will be displayed. This allows for a dump to be written to any tape drive on the system. For example:
Select Tape Drive to save the dump to :1. Drive 1 ( TDC 4100 )
2. Drive 2 ( SDT-5000 )
X. Exit
Enter your choice :
Also, in 1.4, more files are written to the tape than in 1.3.x. These additional files are:
errlog.MMDD
errptfl.MMDD
The errptfl is derived from 'errpt -a' for the day of the crash.
It is also possible, in 1.4, to take a dump to a raw partition. This is particularly useful on large memory systems. Prior to 1.4, the downtime during a crash dump procedure could be quite long as the system could not be brought into multi-user mode until all the files were saved.
A new utility called mkdumpdisk allows a system to be configured with a raw disk slice as a dump device. The slice, referred to as "dumpdisk" is where all dumps will be saved. In this type of configuration, swap is not involved in the dump process. When the system is rebooted after a dump, the system will be brought into multi-user mode without having to wait for the dump files to be saved.
NOTE: the system console will not be accessible until the dump on
"dumpdisk" has been either discarded or saved to a permanent location. No
other users will be affected.
Restoring a Crash Dump:
To analyze a memory dump, you are going to need root access to an SVR4 system with a 1.4 kcrash executable or a kcrash executable from the same operating system that the dump was created on. You will also need quite a bit of disk space. Often, the dump arrives on a tape with little or no additional documentation. If the tape label indicates the file contents and the format, use the appropriate commands to restore from that format. If the tape format is not indicated, use the sequence suggested below to recover the submitted files from the tape.
Hopefully, the problem report or some other documentation with the dump indicates the system memory size. If it does, locate a file system with enough free disk space. (memory size plus 2-4 megabytes). If you do not know the memory size, assume 256MB or more.
Next, make sure that you set your ulimit high enough so that you can create the files. First use ulimit -a to display the current values of the rlimit structure:
# ulimit -a
time(seconds) unlimited
file(blocks) 146485
data(kbytes) 16384
stack(kbytes) 16384
coredump(blocks) 2048
nofiles(descriptors) 64
vmemory(kbytes) 16384
The value of 'file(blocks) 146485' would allow you to create close to a 74MB file. To change this value without modifying the kernel, use:
# ulimit -f unlimited
Set the current directory to the file system that will hold the dump image and create a directory named for the dump, then cd to that directory. This organization helps keep track of materials. As analysis proceeds, notes and additional files can be saved in the dump directory and stay associated with the occurence even if you must delete the crash dump itself to save space.
It will take some time to restore the crash dump so you may want to start the restore and come back later to start the analysis.
Loading a savedump tape 1.2:
Tapes created directly by the system when booting after a panic
contain three files: a crash image, a symbol file and a kernel
image. The crash image was created by a utility called diskutil.
Do Not use dd to load these tapes.
Diskutil places a header on the front of the crash image to accommodate multiple tapes, identify the size of the dump, etc. This header must be removed before kcrash can be used to examine the crash dump.
These dumps should be restored using /sbin/restoredump. This script will restore the crash, symbol and kernel files from tape to disk, but is hard-coded to load these files to the /crash directory. To modify where the files are restored, find the three lines in the restoredump script which read:
C="/crash/crash.`date +%m%d`"and change them to read:
S="/crash/sym.`date +%m%d`"
K="/crash/kernel.`date +%m%d`"
C="crash.`date +%m%d`"Using the modified restoredump script will cause the files to be loaded in the directory from which the command is run. Note that the names of the three files loaded will have the current system date and not the date the crash dump was created.
S="sym.`date +%m%d`"
K="kernel.`date +%m%d`"
Loading a savedump tape 1.3, 1.3.x and 1.4:
When a tape is created with savedump, it should be restored using restoredump. An enhancement to 1.3 is that a dump can be restored from any tape drive that is attached to your system without you having to edit the /sbin/restoredump script.
You must be root or root-equivalent to run restoredump:
# restoredump
Dump in default directory (/crash) (y/n)?n
- Enter the directory where the crash is to be stored
/home/dumps/n
Select Tape Drive from which dump is to be restored1. Drive 1 (VIPER_2525_25462 )
X. ExitEnter your choice : 1
Restore Crash Dump Started.
Insert tape and press RETURN to start
Restoring nn Mb crash dump
Restored n % Approx. End Time : a b d HH:MM:SS yyyy
Restore of Symbol, kernel, contents, mdevice, sdevice,
mtune, stune files started
n blocks
Restore Done.
Restore hinv File Started
hinv file restoredCrash Dump File : /home/dumps/n/crash.MMDD Created.
Rewinding tape
#
The symbol, kernel, contents, mdevice, sdevice, mtune, stune and hinv files will be restored to a directory called cMMDD.nnnn in the current directory.
Since restoredump shows the size of the crash dump, if you do not have enough space in the current directory, simply abort restoredump using ctrl-d.
Loading a dump created with savedump without using restoredump
If you should have a raw savedump image on disk, use the diskutil
command. (The -R option is not documented on the diskutil(1M) man page)
#/etc/diskutil -R infile outfile
Be aware that this will create a second copy of the crash image on disk. There is no way for diskutil to strip the header information in place. It may be necessary due to disk space limitations to load the dump image to tape and then load it back to disk. If this is so, use:
#dd if=dumpfile of=/dev/rmt4
#rm dumpfile
#/etc/diskutil -R tapedevice dumpfile
Check the write-protect on the dump image tape and be sure it is set to SAFE. Load the tape into the drive and attempt to load the dump image into the drive using:
#cpio -icvQdm < /dev/rmt4
Do not use the 'u' option when copying from an unknown tape. You may overwrite files that you really do not want to lose. The command may fail right away with an error message that the tape is not a cpio archive. If so, rewind and try the tar command.
#tar xf /dev/rmt4
If the tape is a tar archive, follow the same steps as for a cpio archive, substituting 'tar' for 'cpio' in all operations. If neither tape archive command recognizes the tape format, load the dump image with dd as described below. If the cpio command works, note the file names that are loaded. If the dump was created with absolute path names, be sure not to overwrite your /unix file. To selectively restore or rename files, use the 'r' option of cpio. If anything goes wrong on the first effort, rewind the tape with:
#tsioctl -c rewind /dev/rmt4
and retry the cpio command.
When the files from the first archive on the tape have been read, check for any further tape archives with the command:
#dd if=/dev/rmt4 of=tapefile.1 bs=64k
Repeat this command, incrementing the tag number on 'tapefile' until dd exits immediately with an 'errno 5: I/O failure' which indicates that you have reached the end of recorded data on the tape. Normally, there will only be one archive on the tape, but sometimes additional supporting files have been written to the tape by appending additional archives. If the dd command exits successfully, check the archive contents using cpio and tar similar to starting with an unknown tape:
#cpio -icvd < tapefile.1
Once you have identified the tape format and contents, add the information to the tape label so anyone reloading the tape will not have to use trial and error to read the tape contents.
If no other method works, try dd :
#dd if=/dev/rmt4 of=tapefile.1 bs=64k
To recover any additional files use the above command, but increment the tag number of 'tapefile'. Usually, the first file is the crash file, the second the symbol file and the third the kernel image (/unix).
Loading absolute path files:
If cpio -r does not work, this method will:
#cp /unix /unix.save
#cpio -icvQum /unix < /dev/rmt4
#cp /unix /crash/sym.foo
#mv /unix.save /unix
#tsioctl -c retension /dev/rmt0
and clean the tape drive using an appropriate drive cleaning kit. If the read error persists, try the same operations on a different tape drive. Then, if the tape is still not readable, you can confidently say there is a hard read error on the tape. Note whether the error occurs immediately, indicating there is nothing written on the tape, or whether it occurs after reading some data.