bannerd


7.23 - Software Fault Prevention and Tolerance

1. Software Fault Prevention and Tolerance

Mission or safety-critical spaceflight systems should be developed to both reduce the likelihood of software faults pre-flight and to detect/mitigate the effects of software errors should they occur in-flight. New data is available that categorizes software errors from significant historic spaceflight software incidents with implications and considerations to better develop and design software to both minimize and tolerate these most likely software failures. 436

1.1 New Historical Data Compilation Summary

Previously unquantified in this manner, this data characterizes a set of 55 high-impact historic aerospace "software failure" incidents.  437  Key findings are that software is much more likely to fail by producing erroneous output rather than failing silent, and that rebooting is ineffective to clear these erroneous situations.


ErroneousFail-Silent
Error Manifestations 85%15% 
Reboot Effectiveness 2%38%


The origin of each error is categorized to focus specific development, test, and validation techniques for error prevention in each category. This new data focuses on manifestations of unexpected flight software behavior independent of ultimate root cause. It is provided for considerations to improve software design, test, and operations for resilience to the most common software errors and to augment established processes for NASA software development. 

Error Origin% of Total
Code / Logic58%
Configurable Data16%
Unexpected Sensor Input15%
Command/Operator Input11%


Forty percent (40%) of software errors were due to absence of code, which includes missing requirements or capabilities, and inability to handle unanticipated situations. Only 18% of these incidents fall within the software discipline itself, with no incidents related to choice of platform or toolset. 

Other CategoriesIndividually % of Total
Absence of Code40%
Unknown-unknowns16%
Computer Science Discipline18%

1.2 Implications and Considerations

These findings indicate that for software fault tolerance, primary consideration should be given to software behaving erroneously rather than going silent, especially at critical moments, and that reboot recoverability can be unreliable. Special care should be taken to validate configurable data and commands prior to each use.

“Test-like-you-fly”, including sensor hardware-in-the-loop, combined with robust off-nominal testing should be used to uncover missing logic arising from unanticipated situations. Some best practice strategies to emphasize pre-flight and during operations based on this data are shown below. 

Software Error Prevention Strategies
  • Utilize a disciplined software engineering and assurance approach with  applicable standards
    • NPR 7150.2 NASA Software Engineering Requirements   083  
    • SOFTWARE ASSURANCE AND SOFTWARE SAFETY STANDARD   278
  • Perform off-nominal scenario, fault, and input testing to expose missing code not covered by requirements alone, with multidisciplinary involvement
  • Employ logic for handling off-nominal sensor and data input, handling  exceptions, and performing check-point restart
  • Validate mission data prior to each use
  • Test like you Fly” with hardware-in-the-loop, especially sensors, over  expected mission durations if possible
  • Employ two-stage commanding with operator implication acknowledgement for critical commands

1.3 Best Practices for Safety-Critical Software Design

Although best efforts can be made prior to flight, software behavior reflects a model of real-world events that cannot be fully proven or predicted, and traditional system design usually employs only one primary flight software load, even if replicated on multiple strings. Like designing avionic systems to protect for radiation and mistrusted communication ("Byzantine"-faults), safety-critical systems must be designed for resilience to erroneous software behavior. NASA Human-Rating requirements   024 call for in-flight mitigation to hazardous erroneous software behavior, detection and annunciation of critical software faults, manual override of automation, and at least single fault tolerance to software errors without use of emergency systems. Each project/designer must evaluate these requirements against safety hazards and time-to-effect and then invoke appropriate automation fail-down strategies. Common mitigation techniques during flight are shown below.

In-Flight Software Error Detection and Mitigation Strategies
  • Provide crew/ground insight, control, and override
  • Employ independent monitoring of critical vehicle automation
    • Manual or automated detection, followed by response
  • Employ software backups (targeted to full) which are:
    •  Simple (compared to primary flight software)
    •  Dissimilar (especially in requirements and test)
  • Enter safe mode (reduced capability primary software subset)
    • Examples: restore power/communication, conserve fuel
  • Uplink new software and/or data (time permitting)
  • Design system to reduce/eliminate dependency on software
  • Reboot (often ineffective for logic/data errors)

See also SWE-134 - Safety-Critical Software Design Requirements, SWE-023 - Software Safety-Critical Requirements, SWE-219 - Code Coverage for Safety Critical Software

1.4 Summary

Significant software failures have occurred steadily since first use in space. New data has characterized the behavior of these failures to better understand manifestation patterns and origin. The strategies outlined here should be considered during vehicle design, and throughout the software development and operations lifecycle to minimize the occurrence and impact of errant software behavior.

1.5 Terminology

  • Software Failure – Software behaving in an unexpected manner causing loss of life, injury, loss/end of mission, or significant close-call
  • Byzantine – Active, but possibly corrupted/untrusted communication

1.6 Additional Guidance

Links to Additional Guidance materials for this subject have been compiled in the Relevant Links table. Click here to see the Additional Guidance in the Resources tab.

2. Resources

2.1 References

2.2 Tools


Tools to aid in compliance with this SWE, if any, may be found in the Tools Library in the NASA Engineering Network (NEN). 

NASA users find this in the Tools Library in the Software Processes Across NASA (SPAN) site of the Software Engineering Community in NEN. 

The list is informational only and does not represent an “approved tool list”, nor does it represent an endorsement of any particular tool.  The purpose is to provide examples of tools being used across the Agency and to help projects and centers decide what tools to consider.

2.3 Additional Guidance

Additional guidance related to this requirement may be found in the following materials in this Handbook:

2.4 Center Process Asset Libraries

SPAN - Software Processes Across NASA
SPAN contains links to Center managed Process Asset Libraries. Consult these Process Asset Libraries (PALs) for Center-specific guidance including processes, forms, checklists, training, and templates related to Software Development. See SPAN in the Software Engineering Community of NEN. Available to NASA only. https://nen.nasa.gov/web/software/wiki  197

See the following link(s) in SPAN for process assets from contributing Centers (NASA Only). 

SPAN Links



  • No labels