


- 1. The Requirement
- 2. Rationale
- 3. Guidance
- 4. Small Projects
- 5. Resources
- 6. Lessons Learned
- 7. Software Assurance
1. Requirements
4.3.1 The space system shall provide at least single failure tolerance to catastrophic events, with specific levels of failure tolerance and implementation (similar or dissimilar redundancy) derived via an integration of the design and safety analysis (required by NPR 8705.2). 024
1.1 Notes
Redundancy alone does not meet the intent of this requirement. When a critical system fails because of improper or unexpected performance due to unanticipated conditions, similar redundancy can be ineffective at preventing the complete loss of the system. Dissimilar redundancy can be very effective provided there is sufficient separation among the redundant legs. (For example, dissimilar redundancy where the power for all redundant capability was routed through a common conduit would not survive a failure where the conduit was severed). It is also highly desirable that the spaceflight system performance degrades in a predictable fashion to allow sufficient time for failure detection and, when possible, system recovery even when experiencing multiple failures.
There are examples of dissimilar redundancy in current systems. For Earth reentry, the Soyuz spacecraft has a dissimilar backup ballistic entry mode to protect for loss of the primary attitude control system and a backup parachute for landing. Other examples include backup batteries for critical systems that protect from loss of the primary electrical system and the use of pressure suits during reentry to protect for loss of cabin pressure.
Ultimately, the program and Technical Authorities evaluate and agree on the failure scenarios/modes and determine the appropriate level of failure tolerance and the practicality of using dissimilar redundancy or backup systems to protect for common cause failures.
Where failure tolerance is not the appropriate approach to control hazards, specific measures need to be employed to:
- Identify applicable hazards and their associated controls;
- Ensure the robustness of the design; and
- Ensure adequate attention/focus is being applied to the design, manufacture, test, analysis, and inspection of the items.
In the area of design, in addition to the application of specifically approved standards and specifications, these measures can include the identification of specific design features that minimize the probability of occurrence of failure modes, such as the application of stringent factors of safety or other design margins. For manufacturers, these measures can include establishing special process controls and documentation, special handling, and highlighting the importance of the item for those involved in the manufacturing process. For testing, this can include accelerated life testing, fleet leader testing program, testing to understand failure modes, or other testing to establish additional confidence and margin in the design. For analysis (in lieu of tests), these measures can include correlation with the testing representative of the actual configuration and the collection, management, and analysis of data used in trending failures, verifying loss of crew requirements, and evaluating flight anomalies. For inspection, these measures can include the identification of specific inspection criteria to be applied to the item or the application of Government Mandatory Inspection Points or similar audits for important characteristics of the item. This approach to hazard control takes advantage of existing standards or standards approved by the Technical Authorities to control hazards associated with the physical properties of the hardware and are typically controlled via the application of margin to the environments experienced by the design or system properties affected by the environment. Acceptance of these approaches by the Technical Authorities avoids processing waivers for numerous hazard causes where failure tolerance is not the appropriate approach. This includes, but is not limited to, Electro-Magnetic Interference, Ionizing Radiation, Micrometeoroid Orbital Debris, structural failure, pressure vessel failure, and aerothermal shell shape for flight.
1.2 History
1.3 Applicability Across Classes
Class A B C D E F Applicable?
Key: - Applicable |
- Not Applicable
2. Rationale
The objective is to arrive at the safest practical design to accomplish a mission. Since space system development will always have mass, volume, schedule, and cost constraints, choosing where and how to apply failure tolerance requires integrated analyses at the system level to assess safety and mission risks, guided by a commonly understood level of risk tolerance at the system and local (individual hazard) levels.
3. Guidance
First and foremost, failure tolerance is applied at the overall system level, including all the system's capabilities, including automation and software. While failure tolerance is frequently used to describe minimum acceptable redundancy, it may also be used to describe two similar systems, dissimilar systems, cross-strapping, or functional interrelationships that ensure minimally acceptable system performance despite failures or additional features designed to mitigate the effects of failures. When assessing failure tolerance at the integrated system level, increased complexity and additional utilization of system resources required must be considered. For software systems, failure tolerance to software performing erroneously must be considered as well as software ceasing to perform during critical situations. For software failure tolerance specifically, more detailed considerations and strategies are summarized in NESC Technical Bulletin 23-06: Considerations for Software Fault Prevention and Tolerance 687 .
Failure of primary structure, structural failure of pressure vessel walls, and structural failure of pressurized lines are exempted from the failure tolerance requirement provided the potentially catastrophic failures are controlled through a defined process in which approved standards and margins are implemented that account for the absence of failure tolerance.
Other potentially catastrophic hazards that cannot be controlled using failure tolerance are exempted from the failure tolerance requirements with mandatory concurrence (as required by NPR 8705.2) from the Technical Authorities and the Director, JSC (for crew risk acceptance) provided the hazards are controlled through a defined process in which approved standards and margins are implemented that account for the absence of failure tolerance.
The levied requirements above are applied, and assessed, against the entire system and do not specifically call out hardware or software. There must not be any single point of failure that could lead to a catastrophic hazard (aside from the exemptions listed in the requirements). The difference between software and hardware is moot as they both are held to the same requirement.
In typical spacecraft designs, the capabilities needed to perform critical functions are predominantly implemented in flight software. Software malfunctions that are the result of hardware faults can usually be isolated to a specific device and are detected by such means as self-checking logic on multiple processors and output voting used with redundant computer sets. The most insidious failure mode is the result of a software common mode failure in which every instance of the same software image executing on multiple devices encounters the same unexpected condition simultaneously (task overrun, stack overflow, divide by zero, etc.) or in a cascading sequence, adversely affecting the proper operation of the software and precluding the ability of redundancy management schemes from isolating the problem to a specific device. In a worst-case scenario, the only option for restoring complete functionality might be to restart one or more flight computers which, depending upon the mission phase in which the fault is encountered, may exceed the time to effect for catastrophic hazards. Even with rigorous software development processes with an extensive verification and validation campaign, past missions have shown that software-related failures can and do occur.
See Topic 7.24 - Human Rated Software Requirements for other Software Requirements related to Human Rated Software.
3.1 Control/Mitigation Strategies
In order to address potential software common mode failures, the options in Table 1 document the control/mitigation strategies that are available (in order of effectiveness). These strategies do not displace software development and software safety requirements which are mandatory to minimize the risk of software common mode failures. A control/mitigation strategy is selected for each catastrophic hazard, which can be caused by a software common mode failure. For example, a system that lacks redundancy but can recover from a software common mode failure prior to the time to effect is considered failure tolerant.
Table 1: Control/Mitigation Strategies
Software Common Mode Failure Hazard Control Strategy | Failure Tolerant Requirement Compliance |
System Failure Tolerance | Yes |
Recover/Repair | Yes |
Accept Risk | NCR/Variance Required |
3.2 Minimize Risk
It is mandatory to minimize the risk of software common mode failures by software development processes and software/avionics designs that are robust and include considerations for safety critical software. Meeting the requirements will ensure the software development activities meet industry best practices, will ensure software defects are minimized, and provide the primary control for software common mode failures.
This activity includes extending the hazard analysis process to assess the specific avionics and software architecture sensitivity to software common mode failures within the operational scenarios and mission phases as defined in the concept of operations. Software common mode failures are credible and their impact on a potential hazard must be documented as part of the corresponding hazard report(s). This analysis must also identify the time to criticality of each applicable hazard in order to determine if the software can be recovered or repaired before a catastrophic hazard is realized. The hazard analysis performed on the software and avionics must be reflected in hazard reports but if the entire analysis is too detailed or complex to be reflected within a hazard report, it is acceptable to supplement the hazard report content with additional analysis verification of the “software will work as designed” control of software common mode failure cause(s) and be delivered as verification evidence.
The hazard analysis must identify and control potential software failures. This activity will have a direct impact on the overall safety and reliability of the system. It should be understood that Design For Minimum Risk (DFMR) is not recognized as an acceptable hazard control strategy but is acknowledged that when properly implemented, can provide an acceptable Variance/Non-Compliance Report rationale. Some examples of these design controls/mitigations are robust multilayered fault management architecture, minimized fault containment regions, and distributed architectures.. The applicable design controls and mitigations should also be documented in the respective hazard report(s) along with appropriate verifications.
3.3 System Failure Tolerance
- What
- A system design that fully controls a hazard should a software common mode failure occur
- When to use
- Whenever possible, this is the preferred strategy. Providing a fully functional redundant set of capabilities to avoid a catastrophic hazard (which may include dissimilar software) is viewed as an equivalent risk and meets the intent of failure tolerance requirements.
- How to Implement
- Incorporate functional redundancy within the system to completely control the respective catastrophic hazard
- Document as control(s) in respective hazard report(s) with appropriate verifications
- Examples:
- Implementation of a subset of critical vehicle system functions by dissimilar software, hardware, or combination of hardware and software
- Isolate and distribute software functions through such technologies as distributed network based architectures or virtual partitions (e.g. ARINC 653: Avionics Application Standard Software Interface).
- Provide capabilities for the vehicle to avoid catastrophic hazards by being able to safely operate in a limited or degraded state in the event of a software common mode failure
- Provide capabilities for the crew to bypass software functions and manually control flight systems/effectors directly upon the occurrence of a software failure.
3.4 Recover/Repair
- What
- Process and design mitigation to recover and/or repair software and system state after software common mode failure occurs before a catastrophic hazard is realized
- When to use
- When the time to criticality before a catastrophic hazard occurs is sufficient to recover the software
- How to Implement
- The analysis performed as part of the risk reduction activities includes the time to criticality determination.
- Document recovery/repair plan and supporting design features if applicable in the respective hazard report(s) with appropriate verifications
- Examples:
- Reboot and reinitialize, the system watchdog function
- Runtime software audits that repair damaged data structures
- ‘Fault-down/safe mode’ embedded software
- Passive aborts
- Backup systems (based on separate requirements, no shared code)
3.5 Accept Risk (NCR/Variance)
- What
- Submit a Non-Compliance Report (NCR) and a program variance to characterize the remaining risk and to provide flight rationale
- When to use
- If unreasonable to implement failure tolerance at the system level
- If, as part of the Hazard Analysis processes, it is determined that there are flight phases in which hazard controls and mitigations do not provide sufficient time to prevent a catastrophic event (‘blackout zones’ due to time to effect)
- How to Implement
- Leverage the process and design controls implemented as part of the Minimize Risk activity as a risk acceptance rationale
- NCR and Variance content must be directed at specific hazard(s). A general blanket NCR/variance will not be approved
- The NCR and Variance maturity should mature in concert with the Hazard Report Phases I, II, and III development.
- Phase 0/I: Identify the need for an NCR and Variance (blackout zone analysis)
- Phase II: Submit and approval NCR(s) and Variance(s)
- Phase III: Provide risk acceptance verification evidence in the hazard report(s)
For the Safety Review Panel (SRP)/Program Safety Technical Review Board (STRB) to recommend approval of an NCR/Variance, items listed below should be considered by the Provider. Each of these items includes aspects to be enforced throughout the entire software life cycle.
- Rigorously enforced coding standards as demonstrated by objective evidence gathered by Software Quality Assurance (SQA) with the following characteristics; Industry recognized, verifiable by static analysis, Software Safety approved, mandatory inspection points, and code audits for legacy software. The coding standards should be supplemented with the appropriate Computer-Based Control Systems (CBCS) requirements called out in SSP 50808 and NASA-NPR-7150.2, including recovery, safe termination/initialization, and input validation.
- Utilization of static code analysis tools to verify compliance with coding standards. Utilization of code path analysis that reports the results for critical phases (blackout zones) and the analysis configuration. Code paths that are not covered should be identified and explained. Assessments of additional testing strategies that can be applied to blackout zones. Examples include Parameter Value Coverage (PVC), mission duration testing, stress, Monte Carlo, and performance end-to-end testing.
- Implement a clearly defined software severity classification system for all software non-conformances that could impact flight safety (including tools, and applicable ground systems). At a minimum, classes should include loss of life or loss of vehicle, mission success, visible to the user with operational workarounds, and an ‘other’ class that does not meet previous criteria (ex. Severity 1 - 4).
- Providers should clearly define their ‘safety-critical’ software (per NASA-STD-8739.8 278 ) and are strongly suggested to list the rationale for their classification (i.e. tied to a particular hazard report, processes hazardous commands).
- Changes, including high-severity non-conformances that have the potential to impact flight-critical software, should require formal board approval led by Provider Management with representatives from Software Development, Test, Requirements, Software Safety, Operations/Crew, IV&V, Subsystem Owner, and NASA Organizations.
- Mandatory assessments of vendor-reported defects for all Commercial-Off-The-Shelf (COTS)/Modified-Off-The-Shelf (MOTS) based software that have a direct impact on Flight Critical Software. This includes Operating Systems, run-time systems, flight-critical device drivers/firmware, code generators, compilers, math libraries, IT security vulnerabilities, and build and Configuration Management (CM) tools. Performed pre-flight, with mandatory code audits for critical defects.
- Mandatory process assessments for all high-severity safety-critical software issues (closed loop process). For high-severity process escapes, identify the earliest process step where the escape could have been identified and all subsequent steps including steps beyond where the escape was discovered, and determine what process step update(s) are needed to preclude such an escape at that step again. (Implementation of SSP 50808 Section 3.3.11.3.3 – NONCONFORMANCE/PROBLEM REPORTING AND CORRECTIVE ACTION). The rigorous process with corrective action feedback needs to be in place at initial code development with evidence of effective process changes NLT the required SSP 50808 NASA software quality assurance audit. In-progress development requires a legacy code audit by the provider.
- Mandatory code audit assessments for all high-severity software issues; determine where similar escapes could have occurred, define audit criteria, perform audit and share results with NASA.
- Collect metrics with objective evidence demonstrating that escapes are being caught earlier in the development process and that the process modifications are effective.
- IV&V provider must assess their own escapes to determine similar process changes.
- Independent Moderator role: Mandatory for all critical code reviews and requirements change process, separate reporting chain (can be an internal SQA person).
- Implement clearly defined quality and performance software metrics.
- Metrics should provide management with an assessment of the quality of the software development processes. (E.g. improving trend (fewer errors), worse, or about the same). Metrics should also take into account the severity of the issues.
- Performance metrics should include processor and memory utilization, and peak and margin measurements.
- The provider reports, as a recurring flight-by-flight verification, metrics supporting compliance with the processes established in the NCR. For identified deviations or escapes from these processes, the provider shall provide a corrective action plan that will prevent the recurrence of the identified process failure.
- For software issues discovered after verification testing has been initiated, or on a released flight software product, completion of the analysis and flight rationale is required before the flight readiness review milestone. Issues after this point should be reported and worked as part of the mission operations functions, including on-orbit attached phases.
- Define and provide a design strategy to assure control for the occurrence of software common mode failures. Implementation needs to be able to handle transient errors preclude the manifestation of cascading failures in the case of non-transient errors and ensure software can be recovered regardless of the faulted state. Strategies should address items such as task restart and exception handling on a flight phase basis. This should be addressed as part of process implementation (Implementation of SSP 50808 requirement 3.3.11.1.1.7.1.7 – RECOVER FOR CBCS ANOMALY).
The level and type of redundancy (similar or dissimilar) is an important and often controversial aspect of system design. Redundancy does not solely make a system safe. It is the responsibility of the engineering and safety teams to determine when redundancy must be included, how to best implement it or to fully describe if a single-string design approach is sufficient and communicate associated risks. The resulting design should optimize safety given the mission requirements and constraints. For software/automation, redundancy strategies can include human involvement and automated or manual backup strategies. See References for more details.
Redundancy alone does not meet the intent of this requirement. When a critical system fails because of improper or unexpected performance due to unanticipated conditions, similar redundancy can be ineffective at preventing the complete loss of the system. Dissimilar redundancy can be very effective provided there is sufficient separation among the redundant legs. For example, dissimilar redundancy where a crew member uses or portion of the same software that is behaving erroneously to override automated with manual control may not survive the failure since the approaches share a common failure point. Areas in the system architecture should be analyzed for this commonality. It is also highly desirable that the spaceflight system performance degrades predictably to allow sufficient time for failure detection and, when possible, system recovery even when experiencing multiple failures.
There are examples of dissimilar redundancy in current systems. For Earth reentry, the Soyuz spacecraft has a dissimilar backup ballistic entry mode to protect from loss of the primary attitude control system and a backup parachute for landing. Other examples include backup batteries for critical systems that protect from loss of the primary electrical system and the use of pressure suits during reentry to protect from loss of cabin pressure.
3.7 Level Of Fault Tolerance
Ultimately, the program and technical authorities evaluate and agree on the failure scenarios/modes and determine the appropriate level of failure tolerance and the practicality of using dissimilar redundancy or backup systems to protect against common cause failures.
Where failure tolerance is not the appropriate approach to control hazards, specific measures need to be employed to:
- Identify applicable hazards and their associated controls;
- Ensure robustness of the design; and
- Ensure adequate attention/focus is being applied to the design, manufacture, test, analysis, and inspection of the items.
In the area of design, in addition to the application of specifically approved standards and specifications, these measures can include the identification of specific design features that minimize the probability of occurrence of failure modes, such as the application of stringent factors of safety or other design margins.
For manufacturers, these measures can include establishing special process controls and documentation, special handling, and highlighting the importance of the item for those involved in the manufacturing process.
For testing, this can include accelerated life testing, fleet leader testing program, testing to understand failure modes, or other testing to establish additional confidence and margin in the design.
For analysis (instead of tests), these measures can include correlation with the testing representative of the actual configuration and the collection, management, and analysis of data used in trending failures, verifying loss of crew requirements, and evaluating flight anomalies.
For inspection, these measures can include the identification of specific inspection criteria to be applied to the item or the application of Government Mandatory Inspection Points or similar audits for important characteristics of the item.
This approach to hazard control takes advantage of existing standards or standards approved by the Technical Authorities to control hazards associated with the physical properties of the hardware and is typically controlled via the application of margin to the environments experienced by the design or system properties affected by the environment. Acceptance of these approaches by the Technical Authorities avoids processing waivers for numerous hazard causes where failure tolerance is not the appropriate approach. This includes but is not limited to, electromagnetic interference, Ionizing Radiation, Micrometeoroid Orbital Debris, structural failure, pressure vessel failure, and aerothermal shell shape for flight.
To ensure that the space system provides at least a single failure tolerance to catastrophic events, with specific levels of failure tolerance and implementation derived via an integration of the design and safety analysis as required by NPR 8705.2 024, the following software tasks should be implemented:
- Failure Tolerance Analysis: Conduct a thorough failure tolerance analysis to identify potential catastrophic events if software should fail both erroneously and silently, and determine the necessary levels of failure tolerance for each situation. This analysis should consider various failure modes and their impact on the system's safety and mission success.
- Redundancy Implementation: Implement redundancy in the system design to achieve the required levels of failure tolerance. For software, this should include dissimilar redundancy to ensure that single-point software failures do not lead to catastrophic outcomes. The choice of redundancy strategy should be based on the results of the safety analysis.
- Safety Analysis Integration: Integrate safety analysis methods such as Fault Tree Analysis (FTA) and Failure Modes and Effects Analysis (FMEA) considering software failures into the design process. These analyses help identify critical areas where redundancy is needed and ensure that the implemented redundancy effectively mitigates the identified risks.
- Independent Verification and Validation (IV&V): Conduct IV&V activities to ensure that the failure tolerance and redundancy measures are correctly implemented and function as intended. IV&V should verify that the system can tolerate single erroneous software failures without leading to catastrophic events.
- Simulation and Testing: Perform extensive simulations and testing to validate the failure tolerance of the system. This includes testing the system under various failure scenarios to ensure that the redundancy mechanisms work correctly and the system can recover from single failures.
- Configuration Management: Maintain strict configuration management to ensure that all redundant components and systems are correctly configured and managed. This reduces the risk of errors due to incorrect or inconsistent configurations that could compromise the system's failure tolerance.
- Real-time Monitoring and Alerts: Implement real-time monitoring systems to detect failures and initiate appropriate responses. These systems should provide alerts for any conditions that may compromise the system's redundancy and failure tolerance.
- Documentation and Training: Provide comprehensive documentation and training for all personnel involved in the operation and maintenance of the space system. This includes procedures for handling failures, troubleshooting guides, and emergency protocols to ensure that the team is well-prepared to manage any situation.
- Continuous Improvement: Establish a process for continuous improvement based on lessons learned from previous missions and testing. This includes updating the failure tolerance analysis and redundancy implementation as new information becomes available.
By implementing these tasks, the space system can be designed to provide at least a single failure tolerance to catastrophic events, ensuring mission success and safety.
3.8 Additional Guidance
Additional guidance related to this requirement may be found in the following materials in this Handbook:
Related Links |
---|
3.9 Center Process Asset Libraries
SPAN - Software Processes Across NASA
SPAN contains links to Center managed Process Asset Libraries. Consult these Process Asset Libraries (PALs) for Center-specific guidance including processes, forms, checklists, training, and templates related to Software Development. See SPAN in the Software Engineering Community of NEN. Available to NASA only. https://nen.nasa.gov/web/software/wiki 197
See the following link(s) in SPAN for process assets from contributing Centers (NASA Only).
SPAN Links |
---|
To be developed later. |
3.10 Summary
To control and mitigate software common mode failures, there are required risk minimization activities as well as three options for control/mitigation strategies (System Failure Tolerance, Recover/Repair, and Risk Acceptance). If Risk Acceptance is the selected option, the Variance, and NCR acceptance rationale should describe the risk minimization activities and address how these eight items are being implemented or provide compatible alternatives that meet their intent.
4. Small Projects
No additional guidance is available for small projects. The community of practice is encouraged to submit guidance candidates for this paragraph.
5. Resources
5.1 References
5.2 Tools
NASA users find this in the Tools Library in the Software Processes Across NASA (SPAN) site of the Software Engineering Community in NEN.
The list is informational only and does not represent an “approved tool list”, nor does it represent an endorsement of any particular tool. The purpose is to provide examples of tools being used across the Agency and to help projects and centers decide what tools to consider.
6. Lessons Learned
6.1 NASA Lessons Learned
No Lessons Learned have currently been identified for this requirement.
6.2 Other Lessons Learned
No other Lessons Learned have currently been identified for this requirement.
7. Software Assurance
7.1 Tasking for Software Assurance
- Confirm implementation of redundancy in the software design to achieve the required levels of failure tolerance.
- Confirm that the hazard reports or safety data packages contain all known software contributions or events where software, either by its action, inaction, or incorrect action, leads to a hazard.
- Assess that the hazard reports identify the software components associated with the system hazards per the criteria defined in NASA-STD-8739.8, Appendix A.
- Ensure that redundancy management is robust and supports fault tolerance requirements.
7.2 Software Assurance Products
- 8.54 - Software Requirements Analysis
- 8.55 - Software Design Analysis
- 8.58 - Software Safety and Hazard Analysis
Objective Evidence
- Software design showing the required levels of failure tolerance and common cause analysis.
- Software design that shows how the system design meets the required levels of failure tolerance.
- Completed Hazard Analyses and Hazard Reports identifying all of the potential hazard faults with their associated single failure tolerance and redundancy
- Completed software safety and hazard analysis results
- Software Requirements Analysis results
7.3 Metrics
To ensure that the space system provides at least a single failure tolerance to catastrophic events, with specific levels of failure tolerance and implementation derived via an integration of the design and safety analysis, the following considerations and requirements need to be addressed:
- Failure Tolerance Requirement:
- The system must be designed to tolerate at least one failure without leading to a catastrophic event. This involves incorporating redundancy (either similar or dissimilar) into the system design.
- Design and Safety Analysis Integration:
- The specific levels of failure tolerance and the implementation approach (similar or dissimilar redundancy) should be determined through a thorough integration of the design and safety analysis. This is mandated by NPR 8705.2, which outlines the requirements for safety and risk management in NASA programs and projects.
- Hazard Analysis:
- Conduct a comprehensive hazard analysis to identify potential hazards that could lead to catastrophic events. The hazard analysis should consider the software’s ability, by design, to cause or control a given hazard. This includes analyzing software common-mode failures and ensuring redundancy management supports fault tolerance requirements.
- Redundancy Management:
- Ensure that redundancy management is robust and supports fault tolerance requirements. This may involve similar redundancy (identical components) or dissimilar redundancy (different types of components performing the same function to avoid common-mode failures).
- Verification and Validation (V&V):
- Perform rigorous verification and validation to ensure that the failure tolerance and redundancy mechanisms are correctly implemented and effective. This includes independent verification and validation (IV&V) activities to provide an objective assessment of the system’s ability to tolerate failures.
- Software Assurance and Safety:
- Follow the guidelines in the Software Assurance and Software Safety Standard (NASA-STD-8739.8) to ensure that all safety-critical software requirements are met. This includes conducting software safety analysis to supplement the system hazard analysis and identifying must-work and must-not-work functions.
- Documentation and Reporting:
- Maintain thorough documentation of the design, analysis, and verification processes. This includes hazard reports, test plans, and results, as well as records of any tailoring or deviations from standard requirements, with appropriate risk evaluations and justifications.
- Continuous Monitoring and Improvement:
- Continuously monitor the system for any potential failures and update the safety and design analysis as needed. Ensure that any identified defects or issues are addressed promptly and that the system remains compliant with the established safety and failure tolerance requirements.
Examples metrics:
- % of traceability completed for all hazards to software requirements and test procedures
- # of hazards with completed test procedures/cases vs. total # of hazards over time
- # of Non-Conformances identified while confirming hazard controls are verified through test plans/procedures/cases
- # of Hazards containing software that has been tested vs. total # of Hazards containing software
- # of safety-related Non-Conformances
- # of Safety Critical tests executed vs. # of Safety Critical tests witnessed by SA
- Software code/test coverage percentages for all identified safety-critical components (e.g., # of paths tested vs. total # of possible paths)
- # of safety-critical requirement verifications vs. total # of safety-critical requirement verifications completed
- Test coverage data for all identified safety-critical software components
These steps ensure that the space system meets the requirement for single failure tolerance to catastrophic events, with specific levels of failure tolerance and implementation derived through integrated design and safety analysis. For detailed guidance, referring to the NASA Procedural Requirements (NPR 8705.2) and the Software Assurance and Software Safety Standard (NASA-STD-8739.8) would provide a comprehensive framework.
See also Topic 8.18 - SA Suggested Metrics
7.4 Guidance
To ensure that the space system provides at least a single failure tolerance to catastrophic events, with specific levels of failure tolerance and implementation derived via an integration of the design and safety analysis as required by NPR 8705.2 024, the following software tasks should be implemented:
Failure Tolerance and Common Cause Analysis: Conduct a thorough failure tolerance analysis to identify potential catastrophic events and determine the necessary levels of failure tolerance. This analysis should consider various failure modes and their impact on software safety and mission success. The level and type of redundancy (similar or dissimilar) is an important and often controversial aspect of system design. Redundancy does not solely make a system safe. It is the responsibility of the engineering and safety teams to determine when redundancy must be included, how to best implement it or to fully describe if a single-string design approach is sufficient and communicate associated risks. The resulting design should optimize safety given the mission requirements and constraints. For software/automation, redundancy strategies can include human involvement and automated or manual backup strategies.
Redundancy alone does not meet the intent of this requirement. When a critical system fails because of improper or unexpected performance due to unanticipated conditions, similar redundancy can be ineffective at preventing the complete loss of the system. Dissimilar redundancy can be very effective provided there is sufficient separation among the redundant legs. For example, dissimilar redundancy where a crew member uses or portion of the same software that is behaving erroneously to override automated with manual control may not survive the failure since the approaches share a common failure point. Areas in the system architecture should be analyzed for this commonality. It is also highly desirable that the spaceflight system performance degrades predictably to allow sufficient time for failure detection and, when possible, system recovery even when experiencing multiple failures.
There are examples of dissimilar redundancy in current systems. For Earth reentry, the Soyuz spacecraft has a dissimilar backup ballistic entry mode to protect from loss of the primary attitude control system and a backup parachute for landing. Other examples include backup batteries for critical systems that protect from loss of the primary electrical system and the use of pressure suits during reentry to protect from loss of cabin pressure.
Ultimately, the program and technical authorities evaluate and agree on the failure scenarios/modes and determine the appropriate level of failure tolerance and the practicality of using dissimilar redundancy or backup systems to protect against common cause failures.
- Redundancy Implementation: Ensure implementation of redundancy in the system and software design to achieve the required levels of failure tolerance. This can include both similar and dissimilar redundancy to ensure that single-point failures do not lead to catastrophic outcomes. The choice of redundancy type should be based on the results of the safety analysis.
- Software Safety and Hazard Analysis: Develop and maintain a Software Safety Analysis throughout the software development life cycle. Assess that the Hazard Analyses (including hazard reports) identify the software components associated with the system hazards per the criteria defined in NASA-STD-8739.8, Appendix A. (See SWE-205 - Determination of Safety-Critical Software) Perform these on all new requirements, requirement changes, and software defects to determine their impact on the software system's reliability and safety. Confirm that all safety-critical requirements related to single failure tolerance and redundant capabilities have been implemented and adequately tested to prevent catastrophic events during mission-critical operations. It may be necessary to discuss these findings during the Safety Review so the reviewers can weigh the impact of implementing the changes. (See Topic 8.58 – Software Safety and Hazard Analysis.
- Hazard Analysis/Hazard Reports: Confirm that a comprehensive hazard analysis was conducted to identify potential hazards that could result from critical software behavior. This analysis should include evaluating existing and potential hazards and recommending mitigation strategies for identified hazards. The Hazard Reports should contain the results of the analyses and proposed mitigations (See Topic 5.24 - Hazard Report Minimum Content)
- Software Safety Analysis: To develop this analysis, utilize safety analysis techniques such as 8.07 - Software Fault Tree Analysis and 8.05 - SW Failure Modes and Effects Analysis to identify potential hazards and failure modes. This helps in designing controls and mitigations for the operation of critical functions. When generating this SA product, see Topic 8.09 - Software Safety Analysis for additional guidance.
- Simulation and Testing: Ensure that extensive simulations and testing to validate the failure tolerance of the software. This includes testing the system and software under various failure scenarios to ensure that the redundancy mechanisms work correctly and the system can recover from single failures.
- Configuration Management: Ensure strict configuration management to ensure that all redundant software components and systems are correctly configured and managed. This reduces the risk of errors due to incorrect or inconsistent configurations that could compromise the system's failure tolerance.
- Real-time Monitoring and Alerts: Ensure that the software design and software requirements implement real-time monitoring systems to detect failures and initiate appropriate responses. The software systems should provide alerts for any conditions that may compromise the system's redundancy and failure tolerance.
- Training and Documentation: Ensure that comprehensive documentation and training for all personnel involved in the operation and maintenance of the space system software is provided. This includes procedures for handling failures, troubleshooting guides, and emergency protocols to ensure that the team is well-prepared to manage any situation.
- Continuous Improvement: Ensure that a process for continuous improvement is based on lessons learned from previous missions and testing. This includes updating the failure tolerance analysis and redundancy implementation as new information becomes available.
By implementing these tasks, the space system software can be designed to provide at least a single failure tolerance to catastrophic events, ensuring mission success and safety.
7.5 Additional Guidance
Additional guidance related to this requirement may be found in the following materials in this Handbook:
Related Links |
---|