Troubleshooting SEVM/VxVM 3.x

This article is designed to walk you through various common Sun StorEdge Volume Manager/Veritas Volume Manager issues allowing you to troubleshoot them on a Solaris based system. It is not intended to allow you to become VxVM specialists overnight and it does not cover anything for clusters but simply a starter of how to troubleshoot common volume manager issues.

Mapping SEVM to VxVM

Traditionally Sun Microsystems have rebadged the Veritas Volume Manager packages and then released their own version. The following table links the SEVM versions onto the Veritas packages in a fashion as follows:

Sun VersionVeritas Version
SSA Volume Manager 2.3 2.3
Sun Volume Manager 2.4 2.4
Sun Enterprise Volume Manager 2.5 2.5
Sun StorEdge Volume Manager 2.6 2.5.3
  3.0.2
  3.0.3
  3.0.4
  3.1
  3.1.1
  3.2 − imminent

As a consequence any packages which are badged as Sun Volume Manager will have SUNW package names, once we get to Volume Manager version 3.x then we are merely reselling the Veritas version (VRTS packages).

Troubleshooting and Common Procedures

Dead disk

Disk problems are most commonly found because a user has called in a fault because he cannot access his data, the root user has received an email or someone was looking at the GUI or vxprint output and noticed a problem. The vxprint output is the first place to look for the affected volume:

v  smurfp3      fsgen        DISABLED ACTIVE     40960   SELECT   smurfp3−01
pl smurfp3−01   smurfp3      DISABLED NODEVICE   41456   STRIPE   2/128       RW
sd disk08−09    smurfp3−01   disk08   450240     20720   0/0      −           NDEV
sd disk10−10    smurfp3−01   disk10   636160     20720   1/0      c2t3d1      ENA

From the output, it is clear that there is a problme with the volume geoffp3 . The volume is a stripe so there is no redundancy, the volume is in the state of DISABLED ACTIVE and the plex is in the state DISABLED NODEVICE. We can see why they are in this state by looking at the associated subdisks and find that one of them is in a state of NDEV for no device.

If we now take a look at the output from "vxdisk list":

s4m-vm# vxdisk list
DEVICE       TYPE      DISK         GROUP       STATUS
c0t0d0s2     sliced    rootdisk     rootdg      online
c2t0d0s2     sliced    rootmirror   rootdg      online
c2t0d1s2     sliced    disk01       datadg      online
c2t0d2s2     sliced    datadg02     datadg      online
c2t1d0s2     sliced    -            -           error
c2t1d1s2     sliced    datadg03     datadg      online
c2t1d2s2     sliced    disk11       datadg      online
c2t2d0s2     sliced    disk06       datadg      online
c2t2d1s2     sliced    -            -           online
c2t2d2s2     sliced    disk88       rootdg      online
c2t3d0s2     sliced    disk01       app_dg      online
c2t3d1s2     sliced    disk10       datadg      online
c2t3d2s2     sliced    disk03       datadg      online
c2t4d0s2     sliced    disk05       datadg      online
c2t4d1s2     sliced    -            -           online
c2t4d2s2     sliced    -            -           online
c2t5d0s2     sliced    disk04       datadg      online
c2t5d1s2     sliced    disk02       app_dg      online
c2t5d2s2     sliced    app_dg01     app_dg      online
-            -         disk08       datadg      failed was: c2t4d2s2

The disk that was associated with the subdisk has been moved into an error state and now has no disk name or assigned group. At the bottom of the list we can see the disk in question has failed, so check the messages file for any scsi or fibre channel errors from this disk. If the disk is part of an SSA or a Photon, the output from luxadm may provide important information about the condition of the disk

You can use format to check whether the disk is contained in a storage array, for example a disk from an SSA will show the following:

1. c2t0d0 <SUN2.1G cyl 2733 alt 2 hd 19 sec 80>
          /sbus@1f,0/SUNW,soc@1,0/SUNW,pln@a0000000,7411f0/ssd@0,0

An SSA will have the soc and the pln drives in the physical drive path.

If the drive is in a Photon attached to a PCI system, the device path should look like this:

2. c2t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
          /pci@1f,2000/SUNW,ifp@1/ssd@w2200002037207492,0

or on an SBUS system:

2. c2t0d0 <SUN18G cyl 7506 alt 2 hd 19 sec 248>
          /pci@1f,2000/SUNW,socal@1/sf@1,0/ssd@w2200002037207492,0

To remove the disk from VxVM control, use vxdiskadm option 4 − remove a disk for replacement, replace the disk using whatever procedures are relevant to the disk type and then run the following command:

# vxdctl enable

This will force VM to go out and probe all the attached drives. On completion, the entry in "vxdisk list" will show a state of error since the disk wil have no private region.

To bring the disk back under VM control use vxdiskadm option 5.

# vxdiskadm
Volume Manager Support Operations
Menu: VolumeManager/Disk
 1     Add or initialize one or more disks
 2     Encapsulate one or more disks
 3     Remove a disk
 4     Remove a disk for replacement
 5     Replace a failed or removed disk
 6     Mirror volumes on a disk
 7     Move volumes from a disk
 8     Enable access to (import) a disk group
 9     Remove access to (deport) a disk group
 10    Enable (online) a disk device
 11    Disable (offline) a disk device
 12    Mark a disk as a spare for a disk group
 13    Turn off the spare flag on a disk
 list  List disk information
 ?     Display help about menu
 ??    Display help about the menuing system
 q     Exit from menus

Select an operation to perform: 5

Replace a failed or removed disk
Menu: VolumeManager/Disk/ReplaceDisk
Use this menu operation to specify a replacement disk for a disk that you removed with the "Remove a disk for replacement" menu operation, or that failed during use. You will be prompted for a disk name to replace and a disk device to use as a replacement. You can choose an uninitialized disk, in which case the disk will be initialized, or you can choose a disk that you have already initialized using the Add or initialize a disk menu operation. Select a removed or failed disk [<disk>,list,q,?] list Disk group: rootdg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE Disk group: app_dg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE Disk group: datadg DM NAME DEVICE TYPE PRIVLEN PUBLEN STATE dm datadg01 − − − − NODEVICE dm datadg03 − − − − REMOVED Select a removed or failed disk [<disk>,list,q,?] datadg03 Select disk device to initialize [<address>,list,q,?] list DEVICE DISK GROUP STATUS c0t0d0 rootdisk rootdg online c2t0d0 rootmirror rootdg online c2t0d1 disk01 datadg online c2t0d2 datadg02 datadg online c2t1d0 - - error c2t1d1 - - error c2t1d2 disk11 datadg online c2t2d0 disk06 datadg online c2t2d1 - - online c2t2d2 disk88 rootdg online c2t3d0 disk01 app_dg online c2t3d1 disk10 datadg online c2t3d2 disk03 datadg online c2t4d0 disk05 datadg online c2t4d1 - - online c2t4d2 disk08 datadg online c2t5d0 disk04 datadg online c2t5d1 disk02 app_dg online c2t5d2 app_dg01 app_dg online Select disk device to initialize [<address>,list,q,?] c2t1d1 The requested operation is to initialize disk device c2t1d1 and to then use that device to replace the removed or failed disk datadg03 in disk group datadg. Continue with operation? [y,n,q,?] (default: y) y Replacement of disk datadg03 in group datadg with disk device c2t1d1 completed successfully. Replace another disk? [y,n,q,?] (default: n) n

A check of vxdisk list will now show that the disk has been put back into VM and the volume:

c2t1d1s2     sliced   datadg03   datadg      online

A check of the volume will now show that the volume is on the state where the plex is in the RECOVER state as shown below:

V  NAME         USETYPE     KSTATE    STATE    LENGTH  READPOL   PREFPLEX
PL NAME         VOLUME      KSTATE    STATE    LENGTH  LAYOUT    NCOL/WID   MODE
SD NAME         PLEX        DISK      DISKOFFS LENGTH  [COL/]OFF DEVICE     MODE

v  smurfvol     fsgen       DISABLED  ACTIVE   1024000 SELECT    smurfvol-01
pl smurfvol-01  smurfvol    DISABLED  RECOVER  1026032 STRIPE    3/128      RW
sd disk01-01    smurfvol-01 disk01    64400    341600  0/0       c2t0d1     ENA
sd disk11-01    smurfvol-01 disk11    54320    341600  1/0       c2t1d1s2   ENA
sd disk06-01    smurfvol-01 disk06    56240    341600  2/0       c2t2d0     ENA

This is because it required a complete recovery of the data from another plex orfrom backups (remember that this is a simple stripe ie no redundancy). In the chance that the disk was not actually replaced, then it is possible that the data still resides on the disks since the initialisation process does nto, in reality, delete the data, it just rewrites the private region.

However, before we can retore the data to this volume we need to get both the plex and the volume back into the ENABLED ACTIVE state. This is done by referring to the flow chart in appendix C, where we can see that we need to run the following commands:

s4m-vm# vxmend −g datadg fix stale smurfvol−01
# vxmend −g datadg fix clean smurfvol−01
# vxvol −g datadg start smurfvol

This volume is now in the ENABLED ACTIVE state and is ready to for a newfs and restore from backups.

Online failing

Often a disk shows its state as online failing. This is not that same as the failed disk in the example above. It does, however, highlight that VM has had a problme with a disk. The error is seen in the output from "vxdisk list" and appear the the form as below:

s4m-vm# vxdisk list
DEVICE       TYPE      DISK         GROUP       STATUS
c0t0d0s2     sliced    rootdisk     rootdg      online
c2t0d0s2     sliced    rootmirror   rootdg      online
c2t0d1s2     sliced    disk01       datadg      online
c2t0d2s2     sliced    datadg02     datadg      online
c2t1d0s2     sliced    -            -           error
c2t1d1s2     sliced    datadg03     datadg      online failing
c2t1d2s2     sliced    disk11       datadg      online

In this case the volume will still be accessible, this flag (in the private region) has been set because VM has had a problme writing to the public region of the disk. Once this has occurred, VM will try to read the private region, if this is possible it will set the status of that disk to "online failing".

On this occaision your should use the usual methods to confirm whether or not there really is a fault, and if not simply turn off the flag with the following command:

s4m-vm# vxedit −g datadg set failing=off datadg03

Disabled diskgroup

When a diskgroup becomes disabled, run vxdg list first to check on the state of the groups:

s4m−vm# vxdg list
NAME         STATE     ID
rootdg       enabled   964110771.1025.s4m−vm
app_dg       disabled  927324467.2928.s4m−vm
datadg       enabled   931533699.3474.s4m−vm

First try to deport the group using the following:

s4m−vm# vxdg deport app_dg

If this works then attempt to re−import the group using the following command:

s4m−vm# vxdg −fC import app_dg

This command differs from the usual import command in that we have supplied the flags to force the import and to clear all the locks on the diskgroup.

"vxdg list" will confirm the result.

If this doesn’t work it may be because the vxconfigd has become "confused" and needs to be killed and restarted with the following command:

s4m-vm# vxconfigd −k −x cleartempdir

No valid configuration copies in the rootdg

When running vxdctl enable you get the error:

No valid configuration copies in the rootdg

Now is not the time to reboot!

First check the hostid field in the volboot file (this should be the same as the hostname for the server):

s4m−vm# more /etc/vx/volboot
volboot 3.1 0.1
hostid s4m−vm
end
###############################################################
###############################################################
#############################

Now compare this to the following (using the private region of the disk):

s4m−vm# /etc/vx/diag.d/vxprivutil scan /dev/dsk/c2t1d1s3
diskid:  996858917.2138.s4m−vm
group:   name=datadg id=931533699.3474.s4m−vm
flags:   private autoimport
hostid:  s4m−vm
version: 2.1
iosize:  512
public:  slice=4 offset=0 len=1043840
private: slice=3 offset=1 len=1119
update:  time: 996858919 seqno: 0.5
headers: 0 248
configs: count=1 len=795
logs:    count=1 len=120

If the two hostids differ then run the following comand to reset them:

s4m-vm# vxdctl init <hostname>
s4m-vm# vxdctl enable

Unrecoverable diskgroup configuration

In the event that a diskgroup is unable to be imported because the private regions on a couple of the disks have become corrupt, you may still be able to find a valid private region on another disk which has the relevant data to rebuild the volumes.

First use vxprivutil to check the disks for a valid private region:

s4m−vm# /etc/vx/diag.d/vxprivutil scan /dev/dsk/c2t2d2s3
diskid:  985018813.1916.s4m−vm
group:   name=rootdg id=964110771.1025.s4m−vm
flags:   private autoimport
hostid:  s4m-vm
version: 2,1
iosize:  512
public:  slice=4 offset=0 len=1043840
private: slice=3 offset=1 len=1119
update:  time: 996243427 seqno: 0.22
headers: 0 248
configs: count=1 len=795
logs:    count=1 len=120

Providing you can find a valid private region you can check the configuration that it holds using the following:

s4m−vm# /etc/vx/diag.d/vxprivutil dumpconfig /dev/dsk/c2t2d2s3 | vxprint −D − −Ath

This will produce the usual output that you would see when you run vxprint −Ath only this time we are taking it directly from the disk surface.

Now that we have confirmed the contents of the private region on disk c2t2d2, we can use the same vxprivutil command to dump the config to a file:

s4m−vm# /etc/vx/diag.d/vxprivutil dumpconfig /dev/dsk/c2t3d2s3 | \
vxprint −D − −hmQqspv > rootdg.file

The file rootdg.file will now contain all the vxprint output in a database format which can be used with vxmake to rebuild the diskgroup/volumes. Before we can do this, however, we need to rebuild the diskgroup and reinitialise all the disks into it with the same name as before the problem occurred. Once this has been done we can use the following command to rebuild the volume structures around the data:

s4m−vm# vxmake −d rootdg.file

This will rebuild all the volumes in the rootdg. If we only wish to recover a single volume then we need to extract a small amount of data from the file rootdg.file:

s4m−vm# cat rootdg.file | vxprint −D − −hmQqspv <vo> > vol.file

Now when we run the vxmake on this file it will only try to rebuild our volume.

Veritas recommend that, particularly on large mission critical systems, a cron job is created to periodically dump out the contents of the private region to a file using the following command:

s4m−vm# vxprint −hmQqspv > <file>