Solaris Fault Management Architecture

Solaris Fault Management Architecture (FMA) introduces a new software architecture and methodology for fault management across Sun's product line. The Solaris fault manager is a dynamically extensible framework designed to record errors and faults and to assist in automating fault diagnosis and recovery

The Solaris fault manager framework includes a fault management daemon, and libraries of fault management APIs and interfaces to support plug-in extensions of the daemon. The fault management daemon records all error and fault events it processes to persistent logs.

The daemon assists in fault diagnosis by consuming error report events from an error channel and dispatching those events to appropriate diagnosis engines. Diagnosis engines (DEs) are plug-in extensions of the fault management daemon, dynamically loaded and run as though part of the daemon. The daemon assists in fault recovery by publishing the results of diagnosis and resource status to layered administrative tools and agent software that can respond to the diagnosed fault.

The glossary below represents a list of error and fault management terminologies.

  • Error – An unexpected condition, result, signal or datum.
  • Fault – A defect that may produce an error.
  • Error report – Data generated as the result of observing or detecting an error. Sometimes used as a synonym for error event.
  • Error event – An instance of an error report encoded in the protocol.
  • Fault event – An instance of a fault diagnosis encoded in the protocol.
  • FMRI – A Fault Managed Resource Identifier (FMRI) is the resource name used to identify components for the purposes of fault and error event propagation. A FMRI may be represented as a set of name-value pairs or as a formatted text string.
  • Diagnosis engine – Software that can infer the existence of a specific fault from observations of errors.
  • Fault manager – Software component responsible for fault diagnosis via one or more diagnosis engines and state management.
  • Fault region – Logical partition of hardware or software elements that can enumerate a specific set of faults that are defined to be contained within the enclosing regional boundary.
  • Fault management plug-in – A dynamically loaded library which extends the capabilities of the fault manager daemon.
  • Fault management exercise – The end-to-end process of error observation through to fault diagnosis on to action that eliminates the fault.
  • Resource – A set of data described in the Solaris FMA resource identification protocol. The resource's description in the protocol is termed an FMRI (Fault Managed Resource Identifier).
  • Suspect list – A group of one or more fault events and supporting data such as a unique identifier and diagnosis code string, encoded in the Solaris FMA event protocol.
  • ASRU – The Automated System Reconfiguration Unit or ASRU is the part of the system which can be disabled or unconfigured by software or hardware. This reconfiguration is performed after fault diagnosis to prevent additional errors.
  • FRU – The Field Replaceable Unit or FRU is a part that must be repaired in order to remove the fault from the system and can be changed by service personnel.

Solaris FMA Features

The Solaris Fault Management Architecture model provides for three activities with which fault management code must concern itself. These activities are:

  • Error handling – the immediate synchronous handling of the error
  • Fault diagnosis – the asynchronous correlation of the error report with other error reports and related data in order to diagnose and react to the underlying fault
  • Response – fault management software autonomically responds to perform isolation and self-healing tasks

Error handling begins upon detection of an error. Error handlers are expected to capture sufficient data to diagnose the underlying problem, isolate the effects of the error and generate an appropriate error report to describe tge error to other software such as a diagbosis engine

Fault diagnosis is the asynchronous analysis of a problem or defect in the system. Typically, a diagnosis is inferred from a telemetry of error reports. Once a diagnosis of the problem has been determined, system fault management software autonomically responds to perform isolation and self-healing tasks. For example, a hardware component called out in a diagnosis can be automatically reconfigured or disabled until it is repaired and we are able to integrate it back into the system. Other responses to a diagnosis could be printing a message, failing over a service, or paging an administrator. A flowchart of fault management activities is shown in Figure 1 below. An ereport event denotes the asynchronous transmission of an error report to a piece of software responsible for its diagnosis. A fault event denotes the asynchronous transmission of a fault diagnosis to agent software responsible for taking some action in response to the diagnosis.

The Solaris FMA program is designed to build the infrastructure necessary to change our software model to a fault-centric model where error reports are systematically correlated into a binary telemetry flow described by an event protocol. These are dispatched to an appropriate diagnosis engine software that diagnoses the corresponding fault and produces another telemetry stream of fault events that describe each fault to the agent software. Using data from a fault event, an agent can then describe the fault impact and corrective action to an administrator or field engineer. Whenever possible, the appropriate corrective action will be performed automatically in response to the fault and in conjunction with administrator defined policies. The FMA program delivers a common infrastructure that permits all Solaris platforms to export a common administrative and support model, and provides structured data gathering facilities so that FMA capabilities can be measured and improved.

Solaris FMA models a system as a recursive hierarchical or overlapping set of fault regions. Each fault region is a logical partition of hardware or software elements that can enumerate a specific set of faults that are defined to be contained within the regional boundary. Many models of a system are possible, depending on the set of faults and the fault regions that are defined.

Figure 2, for example, illustrates how a DRAM fault is considered to be contained in enclosing fault regions for the physical page in memory, system board and hardware domain.

Solaris FMA Functions

The system administrator has the ability to configure and display status information for the Fault Management Architecture using the following daemons and commands.

The fmd Daemon

The fmd daemon runs in the background on each Solaris Operating System. The fmd daemon receives telemetry information relating to problems detected by the system software, diagnoses these problems, and initiates proactive self-healing activities such as disabling faulty components. When appropriate, the fault manager also sends a message to the syslogd service to notify an administrator that a problem has been detected. The message directs administrators to a knowledge article on Sun's web site, http://www.sun.com/msg/, which explains more about the problem’s impact and appropriate responses.

/usr/lib/fm/fmd/fmd [-V] [-f file] [-o opt=val][-R dir]
The fmadm Command

The fmadm utility can be used by administrators and service personnel to view and modify system configuration parameters maintained by the Solaris Fault Manager, fmd(1M).

/usr/sbin/fmadm [-q] [subcommand [arguments]]

The following are subcommands of the fmadm utility program:

  • fmadm config – display fault manager configuration
  • fmadm faulty [-ai] – display list of faulty resources
  • fmadm flush <fmri> ... – flush cached state for resource
  • fmadm load path – load specified fault manager module
  • fmadm repair fmri – record repair to resource
  • fmadm reset [-s serd] module – reset module or sub-component
  • fmadm rotate logname – rotate log file
  • fmadm unload module – unload specified fault manager module

The Fault Manager associates the following states with every resource for which telemetry information has been received:

  • ok – The resource is present and in use and Fault Manager detects no known problems.
  • unknown – The resource is not present or not usable but has no known problems. This might indicate the resource has been disabled or unconfigured by an administrator.
  • degraded – The resource is present and usable, but Fault Manager has diagnosed one or more problems in the resource.
  • faulted – The resource is present but is not usable because one or more problems have been diagnosed by the Fault Manager. The resource has been disabled to prevent further damage to the system.
The fmdump Command

The fmdump utility can be used to display the contents of any of the log files associated with the Solaris Fault Manager, fmd(1M). The Fault Manager runs in the background on each Solaris system.

/usr/sbin/fmdump [-efvV] [-c class] [-R dir] [-t time]
[-T time] [-uuid] [file]

For example:

# fmdump -u uuid1 -u uuid2 -t 02Dec03

Select fault diagnosis events that exactly match the specified uuid. Each diagnosis is associated with a Universal Unique Identifier (UUID) for identification purposes.

In the above example, information on the fault diagnosis events uuid1 or uuid2 at time on or after 02Dec03 is displayed.

The fmstat Command

The fmstat utility can be used by administrators and service personnel to report statistics associated with the Solaris Fault Manager, fmd(1M) and its associated set of modules.

Fault Manager runs in the background on each Solaris Operating System. It receives telemetry information relating to problems detected by the system software, diagnoses these problems, and initiates proactive self- healing activities such as disabling faulty components.

fmstat [-asz] [-m module] [interval [count]]

For example:

# fmstat
module             ev_recv ev_acpt wait  svc_t    %w  %b  open solve  memsz  bufsz
cpumem-retire            0       0  0.0 10010.0    0   0     0     0      0      0
fmd-self-diagnosis     393       0  0.0   25.5     0   0     0     0      0      0
syslog-msgs              2       0  0.0 3337.2     0   0     0     0      0      0