Renew your license to continue

Your evaluation license of Visibility for Confluence expired. Please use the Buy button to purchase a new license.

Renew your license to continue

Your evaluation license has expired. Contact your administrator to renew your Scaffolding Forms & Templates license.

7.23 - Software Fault Prevention and Tolerance

Web Resources

View this section on the website

See edit history of this section

Post feedback on this section

Section Labels:

Unknown macro: {page-info}

1. Introduction
2. Resources

1. Software Fault Prevention and Tolerance

Mission or safety-critical spaceflight systems should be developed to both reduce the likelihood of software faults pre-flight and to detect/mitigate the effects of software errors should they occur in-flight. New data is available that categorizes software errors from significant historic spaceflight software incidents with implications and considerations to better develop and design software to both minimize and tolerate these most likely software failures. ⁴³⁶

1.1 New Historical Data Compilation Summary

Previously unquantified in this manner, this data characterizes a set of 55 high-impact historic aerospace "software failure" incidents. ⁴³⁷ Key findings are that software is much more likely to fail by producing erroneous output rather than failing silent, and that rebooting is ineffective to clear these erroneous situations.

	Erroneous	Fail-Silent
Error Manifestations	85%	15%
Reboot Effectiveness	2%	38%

The origin of each error is categorized to focus specific development, test, and validation techniques for error prevention in each category. This new data focuses on manifestations of unexpected flight software behavior independent of ultimate root cause. It is provided for considerations to improve software design, test, and operations for resilience to the most common software errors and to augment established processes for NASA software development.

Error Origin	% of Total
Code / Logic	58%
Configurable Data	16%
Unexpected Sensor Input	15%
Command/Operator Input	11%

Forty percent (40%) of software errors were due to absence of code, which includes missing requirements or capabilities, and inability to handle unanticipated situations. Only 18% of these incidents fall within the software discipline itself, with no incidents related to choice of platform or toolset.

Other Categories	Individually % of Total
Absence of Code	40%
Unknown-unknowns	16%
Computer Science Discipline	18%

1.2 Implications and Considerations

These findings indicate that for software fault tolerance, primary consideration should be given to software behaving erroneously rather than going silent, especially at critical moments, and that reboot recoverability can be unreliable. Special care should be taken to validate configurable data and commands prior to each use.

“Test-like-you-fly”, including sensor hardware-in-the-loop, combined with robust off-nominal testing should be used to uncover missing logic arising from unanticipated situations. Some best practice strategies to emphasize pre-flight and during operations based on this data are shown below.

Software Error Prevention Strategies
Utilize a disciplined software engineering and assurance approach with applicable standards NPR 7150.2 NASA Software Engineering Requirements ⁰⁸³ SOFTWARE ASSURANCE AND SOFTWARE SAFETY STANDARD ²⁷⁸
Perform off-nominal scenario, fault, and input testing to expose missing code not covered by requirements alone, with multidisciplinary involvement
Employ logic for handling off-nominal sensor and data input, handling exceptions, and performing check-point restart
Validate mission data prior to each use
“Test like you Fly” with hardware-in-the-loop, especially sensors, over expected mission durations if possible
Employ two-stage commanding with operator implication acknowledgement for critical commands

1.3 Best Practices for Safety-Critical Software Design

Although best efforts can be made prior to flight, software behavior reflects a model of real-world events that cannot be fully proven or predicted, and traditional system design usually employs only one primary flight software load, even if replicated on multiple strings. Like designing avionic systems to protect for radiation and mistrusted communication ("Byzantine"-faults), safety-critical systems must be designed for resilience to erroneous software behavior. NASA Human-Rating requirements ⁰²⁴ call for in-flight mitigation to hazardous erroneous software behavior, detection and annunciation of critical software faults, manual override of automation, and at least single fault tolerance to software errors without use of emergency systems. Each project/designer must evaluate these requirements against safety hazards and time-to-effect and then invoke appropriate automation fail-down strategies. Common mitigation techniques during flight are shown below.

In-Flight Software Error Detection and Mitigation Strategies
Provide crew/ground insight, control, and override
Employ independent monitoring of critical vehicle automation Manual or automated detection, followed by response
Employ software backups (targeted to full) which are: Simple (compared to primary flight software) Dissimilar (especially in requirements and test)
Enter safe mode (reduced capability primary software subset) Examples: restore power/communication, conserve fuel
Uplink new software and/or data (time permitting)
Design system to reduce/eliminate dependency on software
Reboot (often ineffective for logic/data errors)

1.4 Summary

Significant software failures have occurred steadily since first use in space. New data has characterized the behavior of these failures to better understand manifestation patterns and origin. The strategies outlined here should be considered during vehicle design, and throughout the software development and operations lifecycle to minimize the occurrence and impact of errant software behavior.

1.5 Terminology

Software Failure – Software behaving in an unexpected manner causing loss of life, injury, loss/end of mission, or significant close-call
Byzantine – Active, but possibly corrupted/untrusted communication

1.6 Additional Guidance

Links to Additional Guidance materials for this subject have been compiled in the Relevant Links table. Click here to see the Additional Guidance in the Resources tab.

2. Resources

2.1 References

Click here to view master references table.

Renew your license to continue

Your evaluation license has expired. Contact your administrator to renew your Reporting for Confluence license.

Renew your license to continue

Your evaluation license of Visibility for Confluence expired. Please use the Buy button to purchase a new license.

2.2 Tools

Tools to aid in compliance with this SWE, if any, may be found in the Tools Library in the NASA Engineering Network (NEN).

NASA users find this in the Tools Library in the Software Processes Across NASA (SPAN) site of the Software Engineering Community in NEN.

The list is informational only and does not represent an “approved tool list”, nor does it represent an endorsement of any particular tool. The purpose is to provide examples of tools being used across the Agency and to help projects and centers decide what tools to consider.

2.3 Additional Guidance

Additional guidance related to this requirement may be found in the following materials in this Handbook:

Related Links
SWE-023 - Software Safety-Critical Requirements SWE-134 - Safety-Critical Software Design Requirements SWE-219 - Code Coverage for Safety Critical Software

2.4 Center Process Asset Libraries

SPAN - Software Processes Across NASA
SPAN contains links to Center managed Process Asset Libraries. Consult these Process Asset Libraries (PALs) for Center-specific guidance including processes, forms, checklists, training, and templates related to Software Development. See SPAN in the Software Engineering Community of NEN. Available to NASA only. https://nen.nasa.gov/web/software/wiki ¹⁹⁷

See the following link(s) in SPAN for process assets from contributing Centers (NASA Only).

SPAN Links
Center Libraries

Renew your license to continue

Your evaluation license of Visibility for Confluence expired. Please use the Buy button to purchase a new license.

Content

Space Tools

1. Software Fault Prevention and Tolerance

1.1 New Historical Data Compilation Summary

1.2 Implications and Considerations

1.3 Best Practices for Safety-Critical Software Design

1.4 Summary

1.5 Terminology

1.6 Additional Guidance

2. Resources

2.1 References

2.2 Tools

2.3 Additional Guidance

2.4 Center Process Asset Libraries