Troubleshooting SEVM/VxVM 3.x
This article is designed to walk you through various common Sun StorEdge Volume Manager/Veritas Volume Manager issues allowing you to troubleshoot them on a Solaris based system. It is not intended to allow you to become VxVM specialists overnight and it does not cover anything for clusters but simply a starter of how to troubleshoot common volume manager issues.
Mapping SEVM to VxVM
Traditionally Sun Microsystems have rebadged the Veritas Volume Manager packages and then released their own version. The following table links the SEVM versions onto the Veritas packages in a fashion as follows:
Sun Version | Veritas Version |
---|---|
SSA Volume Manager 2.3 | 2.3 |
Sun Volume Manager 2.4 | 2.4 |
Sun Enterprise Volume Manager 2.5 | 2.5 |
Sun StorEdge Volume Manager 2.6 | 2.5.3 |
3.0.2 | |
3.0.3 | |
3.0.4 | |
3.1 | |
3.1.1 | |
3.2 − imminent |
As a consequence any packages which are badged as Sun Volume Manager will have SUNW package names, once we get to Volume Manager version 3.x then we are merely reselling the Veritas version (VRTS packages).
Troubleshooting and Common Procedures
Dead disk
Disk problems are most commonly found because a user has called in a fault because he cannot access his data, the root user has received an email or someone was looking at the GUI or vxprint output and noticed a problem. The vxprint
output is the first place to look for the affected volume:
v smurfp3 fsgen DISABLED ACTIVE 40960 SELECT smurfp3−01 pl smurfp3−01 smurfp3 DISABLED NODEVICE 41456 STRIPE 2/128 RW sd disk08−09 smurfp3−01 disk08 450240 20720 0/0 − NDEV sd disk10−10 smurfp3−01 disk10 636160 20720 1/0 c2t3d1 ENA
From the output, it is clear that there is a problme with the volume geoffp3 . The volume is a stripe so there is no redundancy, the volume is in the state of DISABLED ACTIVE and the plex is in the state DISABLED NODEVICE. We can see why they are in this state by looking at the associated subdisks and find that one of them is in a state of NDEV for no device.
If we now take a look at the output from "vxdisk list":
s4m-vm# vxdisk list DEVICE TYPE DISK GROUP STATUS c0t0d0s2 sliced rootdisk rootdg online c2t0d0s2 sliced rootmirror rootdg online c2t0d1s2 sliced disk01 datadg online c2t0d2s2 sliced datadg02 datadg online c2t1d0s2 sliced - - error c2t1d1s2 sliced datadg03 datadg online c2t1d2s2 sliced disk11 datadg online c2t2d0s2 sliced disk06 datadg online c2t2d1s2 sliced - - online c2t2d2s2 sliced disk88 rootdg online c2t3d0s2 sliced disk01 app_dg online c2t3d1s2 sliced disk10 datadg online c2t3d2s2 sliced disk03 datadg online c2t4d0s2 sliced disk05 datadg online c2t4d1s2 sliced - - online c2t4d2s2 sliced - - online c2t5d0s2 sliced disk04 datadg online c2t5d1s2 sliced disk02 app_dg online c2t5d2s2 sliced app_dg01 app_dg online - - disk08 datadg failed was: c2t4d2s2
The disk that was associated with the subdisk has been moved into an error state and now has no disk name or assigned group. At the bottom of the list we can see the disk in question has failed, so check the messages file for any scsi or fibre channel errors from this disk. If the disk is part of an SSA or a Photon, the output from luxadm may provide important information about the condition of the disk
You can use format to check whether the disk is contained in a storage array, for example a disk from an SSA will show the following:
1. c2t0d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80> /sbus@1f,0/SUNW,soc@1,0/SUNW,pln@a0000000,7411f0/ssd@0,0
An SSA will have the soc and the pln drives in the physical drive path.
If the drive is in a Photon attached to a PCI system, the device path should look like this:
2. c2t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248> /pci@1f,2000/SUNW,ifp@1/ssd@w2200002037207492,0
or on an SBUS system:
2. c2t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248> /pci@1f,2000/SUNW,socal@1/sf@1,0/ssd@w2200002037207492,0
To remove the disk from VxVM control, use vxdiskadm option 4 − remove a disk for replacement, replace the disk using whatever procedures are relevant to the disk type and then run the following command:
# vxdctl enable
This will force VM to go out and probe all the attached drives. On completion, the entry in "vxdisk list" will show a state of error since the disk wil have no private region.
To bring the disk back under VM control use vxdiskadm option 5
.
# vxdiskadm Volume Manager Support Operations Menu: VolumeManager/Disk 1 Add or initialize one or more disks 2 Encapsulate one or more disks 3 Remove a disk 4 Remove a disk for replacement 5 Replace a failed or removed disk 6 Mirror volumes on a disk 7 Move volumes from a disk 8 Enable access to (import) a disk group 9 Remove access to (deport) a disk group 10 Enable (online) a disk device 11 Disable (offline) a disk device 12 Mark a disk as a spare for a disk group 13 Turn off the spare flag on a disk list List disk information ? Display help about menu ?? Display help about the menuing system q Exit from menus Select an operation to perform: 5 Replace a failed or removed disk Menu: VolumeManager/Disk/ReplaceDisk
Use this menu operation to specify a replacement disk for a disk that you removed with the "Remove a disk for replacement" menu operation, or that failed during use. You will be prompted for a disk name to replace and a disk device to use as a replacement. You can choose an uninitialized disk, in which case the disk will be initialized, or you can choose a disk that you have already initialized using the Add or initialize a disk menu operation. Select a removed or failed disk [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE Disk group: app_dg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE Disk group: datadg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm datadg01 − − − − NODEVICE dm datadg03 − − − − REMOVED Select a removed or failed disk [<disk>,list,q,?] datadg03 Select disk device to initialize [<address>,list,q,?] list DEVICE DISK GROUP STATUS c0t0d0 rootdisk rootdg online c2t0d0 rootmirror rootdg online c2t0d1 disk01 datadg online c2t0d2 datadg02 datadg online c2t1d0 - - error c2t1d1 - - error c2t1d2 disk11 datadg online c2t2d0 disk06 datadg online c2t2d1 - - online c2t2d2 disk88 rootdg online c2t3d0 disk01 app_dg online c2t3d1 disk10 datadg online c2t3d2 disk03 datadg online c2t4d0 disk05 datadg online c2t4d1 - - online c2t4d2 disk08 datadg online c2t5d0 disk04 datadg online c2t5d1 disk02 app_dg online c2t5d2 app_dg01 app_dg online Select disk device to initialize [<address>,list,q,?] c2t1d1 The requested operation is to initialize disk device c2t1d1 and to then use that device to replace the removed or failed disk datadg03 in disk group datadg. Continue with operation? [y,n,q,?] (default: y) y Replacement of disk datadg03 in group datadg with disk device c2t1d1 completed successfully. Replace another disk? [y,n,q,?] (default: n) n
A check of vxdisk list
will now show that the disk has been put back into VM and the volume:
c2t1d1s2 sliced datadg03 datadg online
A check of the volume will now show that the volume is on the state where the plex is in the RECOVER state as shown below:
V NAME USETYPE KSTATE STATE LENGTH READPOL PREFPLEX PL NAME VOLUME KSTATE STATE LENGTH LAYOUT NCOL/WID MODE SD NAME PLEX DISK DISKOFFS LENGTH [COL/]OFF DEVICE MODE v smurfvol fsgen DISABLED ACTIVE 1024000 SELECT smurfvol-01 pl smurfvol-01 smurfvol DISABLED RECOVER 1026032 STRIPE 3/128 RW sd disk01-01 smurfvol-01 disk01 64400 341600 0/0 c2t0d1 ENA sd disk11-01 smurfvol-01 disk11 54320 341600 1/0 c2t1d1s2 ENA sd disk06-01 smurfvol-01 disk06 56240 341600 2/0 c2t2d0 ENA
This is because it required a complete recovery of the data from another plex orfrom backups (remember that this is a simple stripe ie no redundancy). In the chance that the disk was not actually replaced, then it is possible that the data still resides on the disks since the initialisation process does nto, in reality, delete the data, it just rewrites the private region.
However, before we can retore the data to this volume we need to get both the plex and the volume back into the ENABLED ACTIVE state. This is done by referring to the flow chart in appendix C, where we can see that we need to run the following commands:
s4m-vm# vxmend −g datadg fix stale smurfvol−01 # vxmend −g datadg fix clean smurfvol−01 # vxvol −g datadg start smurfvol
This volume is now in the ENABLED ACTIVE state and is ready to for a newfs and restore from backups.
Online failing
Often a disk shows its state as online failing
. This is not that same as the failed disk in the example above. It does, however, highlight that VM has had a problme with a disk. The error is seen in the output from "vxdisk list" and appear the the form as below:
s4m-vm# vxdisk list DEVICE TYPE DISK GROUP STATUS c0t0d0s2 sliced rootdisk rootdg online c2t0d0s2 sliced rootmirror rootdg online c2t0d1s2 sliced disk01 datadg online c2t0d2s2 sliced datadg02 datadg online c2t1d0s2 sliced - - error c2t1d1s2 sliced datadg03 datadg online failing c2t1d2s2 sliced disk11 datadg online
In this case the volume will still be accessible, this flag (in the private region) has been set because VM has had a problme writing to the public region of the disk. Once this has occurred, VM will try to read the private region, if this is possible it will set the status of that disk to "online failing".
On this occaision your should use the usual methods to confirm whether or not there really is a fault, and if not simply turn off the flag with the following command:
s4m-vm# vxedit −g datadg set failing=off datadg03
Disabled diskgroup
When a diskgroup becomes disabled, run vxdg list
first to check on the state of the groups:
s4m−vm# vxdg list NAME STATE ID rootdg enabled 964110771.1025.s4m−vm app_dg disabled 927324467.2928.s4m−vm datadg enabled 931533699.3474.s4m−vm
First try to deport the group using the following:
s4m−vm# vxdg deport app_dg
If this works then attempt to re−import the group using the following command:
s4m−vm# vxdg −fC import app_dg
This command differs from the usual import command in that we have supplied the flags to force the import and to clear all the locks on the diskgroup.
"vxdg list" will confirm the result.
If this doesn’t work it may be because the vxconfigd has become "confused" and needs to be killed and restarted with the following command:
s4m-vm# vxconfigd −k −x cleartempdir
No valid configuration copies in the rootdg
When running vxdctl enable
you get the error:
No valid configuration copies in the rootdg
Now is not the time to reboot!
First check the hostid field in the volboot file (this should be the same as the hostname for the server):
s4m−vm# more /etc/vx/volboot volboot 3.1 0.1 hostid s4m−vm end ############################################################### ############################################################### #############################
Now compare this to the following (using the private region of the disk):
s4m−vm# /etc/vx/diag.d/vxprivutil scan /dev/dsk/c2t1d1s3 diskid: 996858917.2138.s4m−vm group: name=datadg id=931533699.3474.s4m−vm flags: private autoimport hostid: s4m−vm version: 2.1 iosize: 512 public: slice=4 offset=0 len=1043840 private: slice=3 offset=1 len=1119 update: time: 996858919 seqno: 0.5 headers: 0 248 configs: count=1 len=795 logs: count=1 len=120
If the two hostids differ then run the following comand to reset them:
s4m-vm# vxdctl init <hostname> s4m-vm# vxdctl enable
Unrecoverable diskgroup configuration
In the event that a diskgroup is unable to be imported because the private regions on a couple of the disks have become corrupt, you may still be able to find a valid private region on another disk which has the relevant data to rebuild the volumes.
First use vxprivutil
to check the disks for a valid private region:
s4m−vm# /etc/vx/diag.d/vxprivutil scan /dev/dsk/c2t2d2s3 diskid: 985018813.1916.s4m−vm group: name=rootdg id=964110771.1025.s4m−vm flags: private autoimport hostid: s4m-vm version: 2,1 iosize: 512 public: slice=4 offset=0 len=1043840 private: slice=3 offset=1 len=1119 update: time: 996243427 seqno: 0.22 headers: 0 248 configs: count=1 len=795 logs: count=1 len=120
Providing you can find a valid private region you can check the configuration that it holds using the following:
s4m−vm# /etc/vx/diag.d/vxprivutil dumpconfig /dev/dsk/c2t2d2s3 | vxprint −D − −Ath
This will produce the usual output that you would see when you run vxprint −Ath
only this time we are taking it directly from the disk surface.
Now that we have confirmed the contents of the private region on disk c2t2d2, we can use the same vxprivutil command to dump the config to a file:
s4m−vm# /etc/vx/diag.d/vxprivutil dumpconfig /dev/dsk/c2t3d2s3 | \ vxprint −D − −hmQqspv > rootdg.file
The file rootdg.file will now contain all the vxprint
output in a database format which can be used with vxmake
to rebuild the diskgroup/volumes. Before we can do this, however, we need to rebuild the diskgroup and reinitialise all the disks into it with the same name as before the problem occurred. Once this has been done we can use the following command to rebuild the volume structures around the data:
s4m−vm# vxmake −d rootdg.file
This will rebuild all the volumes in the rootdg. If we only wish to recover a single volume then we need to extract a small amount of data from the file rootdg.file:
s4m−vm# cat rootdg.file | vxprint −D − −hmQqspv <vo> > vol.file
Now when we run the vxmake on this file it will only try to rebuild our volume.
Veritas recommend that, particularly on large mission critical systems, a cron job is created to periodically dump out the contents of the private region to a file using the following command:
s4m−vm# vxprint −hmQqspv > <file>