Troubleshooting ZFS

Most ZFS errors I've experienced generally fall into one of three categories.

These categories are:

CategoryDescription
missing devices Missing devices placed in a "faulted" state.
damaged devices Caused by things like transient errors from the disk or controller, driver bugs or accidental overwrites (usually on misconfigured devices).
data corruption Data damage to top-level devices; usually requires a restore. Since ZFS is transactional, this only happens as a result of driver bugs, hardware failure or filesystem misconfiguration.

It is important to check for all three categories of errors. One type of problem is often connected to a problem from a different family. Fixing a single problem is usually not sufficient.

Data integrity can be checked by running a manual scrubbing:

# zpool scrub <pool-name>
# zpool status -v <pool-name>

The last command checks the status after the scrubbing is complete.

The zfs status command also reports on recovery suggestions for any errors it finds. These are reported in the action section.

To diagnose a problem, use the output of the status command and the fmdmessages in /var/adm/messages.

The config section of the status section reports the state of each device. The state can be:

Device StateComments
ONLINENormal
FAULTEDMissing, damaged, or mis-seated device
DEGRADEDDevice being resilvered
UNAVAILABLEDevice cannot be opened
OFFLINEAdministrative action

The status command also reports READ, WRITE or CHKSUM errors.

To check if any problem pools exist, use

# zpool status -x

This command only reports problem pools.

If a ZFS configuration becomes damaged, it can be fixed by running export and import.

Devices can fail for any of several reasons:

ReasonComments
"Bit rot"Corruption caused by random environmental effects.
Misdirected Reads/WritesFirmware or hardware faults cause reads or writes to be addressed to the wrong part of the disk.
Administrative Error
Intermittent, Sporadic or Temporary OutagesCaused by flaky hardware or administrator error.
Device OfflineUsually caused by administrative action.

Once the problems have been fixed, transient errors should be cleared:

# zpool clear <pool-name>

In the event of a panic-reboot loop caused by a ZFS software bug, the system can be instructed to boot without the ZFS filesystems:

ok> boot -m milestone=none

When the system is up, remount / as rw and remove the file /etc/zfs/zpool.cache. The remainder of the boot can proceed with:

# svcadm milestone all

At that point import the good pools. The damaged pools may need to be re-initialized.