NASA-HDBK-1002

- Version
- 209 Downloads
- 3.22 MB File Size
- 1 File Count
- May 14, 2016 Create Date
- May 14, 2016 Last Updated
Fault Management Handbook
1. SCOPE
Fault Management (FM) is an engineering activity; it is the part of systems engineering (SE)
focused on the detection of faults and accommodation for off—nominal behavior of a system, as
well as a subsystem that has to be designed, developed, integrated, tested and operated. FM
encompasses functions that enable an operational system to prevent, detect, isolate, diagnose,
and respond to anomalous and failed conditions interfering with intended operations. From a
methodological perspective, FM includes processes to analyze, specify, design, verify, and
validate these functions. From a technological perspective, FM includes the hardware and
control elements, often embodied in software and procedures, of an operational system by which
the capability is realized and a situation awareness capability such as caution/waming functions
to notify operators and crew of anomalous conditions, hazards, and automated responses. The
goal of FM is the preservation of system assets, including crew, and of intended system
functionality (via design or active control) in the presence of predicted or existing failures.
FM demands a system-level perspective, as it is not merely a localized concern. A system’s
design is not complete until potential failures are addressed, and comprehensive FM relies on the
cooperative design and operation of separately deployed system elements (e. g., in the space
systems domain: flight, ground, and operations deployments) to achieve overall reliability,
availability, and safety objectives. Like all other system elements, FM is constrained by
programmatic and operational resources. Thus, FM practitioners are challenged to identify,
evaluate, and balance risks to these objectives against the cost of designing, developing,
validating, deploying, and operating additional FM functionality.
FM has emerged and developed along several paths in response to NASA’S mission needs (e. g.,
deep space vs. earth orbiters vs. human spaceflight) as reflected by the different approaches used
in many organizations (e.g., JPL vs. GSFC vs. I SC), and by the ongoing activities to gain
community consensus on the nomenclature. In fact, the term “fault management” is in itself
something of a misnomer—the discipline of FM is concerned with failures in general and not
just faults (which are failure causes rooted within the system as described in section 4).
However, present use of the term “fault management” is synergistic with usage in the field of
network management, where the International Organization for Standardization1 (ISO) defines
FM as “the set of functions that detect, isolate, and correct malfunctions. . ..” Likewise, the
above—stated goal of FM (i.e., preservation of system assets and intended system functionality in
the presence of failures) is consistent with the ISO—stated goal of having “a dependable/reliable
system in the context of faults.”
1.1 Relevance
FM provides a system’s response to off-nominal conditions, which is crucial to the successful
design, development, and operation of all critical systems (e. g., communications networks,
transportation systems, and power generation and distribution grids). However, the
architectures, processes, and technologies driving FM designs are sensitive to the needs and
nature of the development organization, the risk posture, the type of system under development,
File | Action |
---|---|
NASA-HDBK-1002 Fault Management Handbook.pdf | Download |
Comment On This Post