Troubleshooting ZFS
Most ZFS errors I've experienced generally fall into one of three categories.
These categories are:
Category | Description |
---|---|
missing devices | Missing devices placed in a "faulted" state. |
damaged devices | Caused by things like transient errors from the disk or controller, driver bugs or accidental overwrites (usually on misconfigured devices). |
data corruption | Data damage to top-level devices; usually requires a restore. Since ZFS is transactional, this only happens as a result of driver bugs, hardware failure or filesystem misconfiguration. |
It is important to check for all three categories of errors. One type of problem is often connected to a problem from a different family. Fixing a single problem is usually not sufficient.
Data integrity can be checked by running a manual scrubbing:
# zpool scrub <pool-name> # zpool status -v <pool-name>
The last command checks the status after the scrubbing is complete.
The zfs status command also reports on recovery suggestions for any errors it finds. These are reported in the action section.
To diagnose a problem, use the output of the status command and the fmdmessages in /var/adm/messages.
The config section of the status section reports the state of each device. The state can be:
Device State | Comments |
---|---|
ONLINE | Normal |
FAULTED | Missing, damaged, or mis-seated device |
DEGRADED | Device being resilvered |
UNAVAILABLE | Device cannot be opened |
OFFLINE | Administrative action |
The status command also reports READ, WRITE or CHKSUM errors.
To check if any problem pools exist, use
# zpool status -x
This command only reports problem pools.
If a ZFS configuration becomes damaged, it can be fixed by running export and import.
Devices can fail for any of several reasons:
Reason | Comments |
---|---|
"Bit rot" | Corruption caused by random environmental effects. |
Misdirected Reads/Writes | Firmware or hardware faults cause reads or writes to be addressed to the wrong part of the disk. |
Administrative Error | |
Intermittent, Sporadic or Temporary Outages | Caused by flaky hardware or administrator error. |
Device Offline | Usually caused by administrative action. |
Once the problems have been fixed, transient errors should be cleared:
# zpool clear <pool-name>
In the event of a panic-reboot loop caused by a ZFS software bug, the system can be instructed to boot without the ZFS filesystems:
ok> boot -m milestone=none
When the system is up, remount / as rw and remove the file /etc/zfs/zpool.cache. The remainder of the boot can proceed with:
# svcadm milestone all
At that point import the good pools. The damaged pools may need to be re-initialized.