Troubleshooting ZFS

Published 18 Jan 2014 Read time 2 min(s) 0 sec(s) (2469 views). Solaris

Most ZFS errors I've experienced generally fall into one of three categories.

These categories are:

Category	Description
missing devices	Missing devices placed in a "faulted" state.
damaged devices	Caused by things like transient errors from the disk or controller, driver bugs or accidental overwrites (usually on misconfigured devices).
data corruption	Data damage to top-level devices; usually requires a restore. Since ZFS is transactional, this only happens as a result of driver bugs, hardware failure or filesystem misconfiguration.

It is important to check for all three categories of errors. One type of problem is often connected to a problem from a different family. Fixing a single problem is usually not sufficient.

Data integrity can be checked by running a manual scrubbing:

# zpool scrub <pool-name>
# zpool status -v <pool-name>

The last command checks the status after the scrubbing is complete.

The zfs status command also reports on recovery suggestions for any errors it finds. These are reported in the action section.

To diagnose a problem, use the output of the status command and the fmdmessages in /var/adm/messages.

The config section of the status section reports the state of each device. The state can be:

Device State	Comments
ONLINE	Normal
FAULTED	Missing, damaged, or mis-seated device
DEGRADED	Device being resilvered
UNAVAILABLE	Device cannot be opened
OFFLINE	Administrative action

The status command also reports READ, WRITE or CHKSUM errors.

To check if any problem pools exist, use

# zpool status -x

This command only reports problem pools.

If a ZFS configuration becomes damaged, it can be fixed by running export and import.

Devices can fail for any of several reasons:

Reason	Comments
"Bit rot"	Corruption caused by random environmental effects.
Misdirected Reads/Writes	Firmware or hardware faults cause reads or writes to be addressed to the wrong part of the disk.
Administrative Error
Intermittent, Sporadic or Temporary Outages	Caused by flaky hardware or administrator error.
Device Offline	Usually caused by administrative action.

Once the problems have been fixed, transient errors should be cleared:

# zpool clear <pool-name>

In the event of a panic-reboot loop caused by a ZFS software bug, the system can be instructed to boot without the ZFS filesystems:

ok> boot -m milestone=none

When the system is up, remount / as rw and remove the file /etc/zfs/zpool.cache. The remainder of the boot can proceed with:

# svcadm milestone all

At that point import the good pools. The damaged pools may need to be re-initialized.

Troubleshooting ZFS

Be the first to comment.

Leave a response