SWE-134 - Safety-Critical Software Design Requirements

Web Resources

See edit history of this section

Post feedback on this section

Section Labels:

Unknown macro: {page-info}

1. The Requirement
2. Rationale
3. Guidance
4. Small Projects
5. Resources
6. Lessons Learned
7. Software Assurance
8. Objective Evidence

1. Requirements

3.7.3 If a project has safety-critical software or mission-critical software, the project manager shall implement the following items in the software:

a. The software is initialized, at first start and restarts, to a known safe state.
b. The software safely transitions between all predefined known states.
c. Termination performed by software functions is performed to a known safe state.
d. Operator overrides of software functions require at least two independent actions by an operator.
e. Software rejects commands received out of sequence when execution of those commands out of sequence can cause a hazard.
f. The software detects inadvertent memory modification and recovers to a known safe state.
g. The software performs integrity checks on inputs and outputs to/from the software system.
h. The software performs prerequisite checks prior to the execution of safety-critical software commands.
i. No single software event or action is allowed to initiate an identified hazard.
j. The software responds to an off-nominal condition within the time needed to prevent a hazardous event.
k. The software provides error handling.
l. The software can place the system into a safe state.

1.1 Notes

These requirements apply to components that reside in a mission-critical or safety-critical system, and the components control, mitigate, or contribute to a hazard as well as software used to command hazardous operations/activities.

1.2 History

SWE-134 - Last used in rev NPR 7150.2D

Rev	SWE Statement
A	2.2.12 When a project is determined to have safety-critical software, the project shall ensure the following items are implemented in the software: a. Safety-critical software is initialized, at first start and at restarts, to a known safe state. b. Safety-critical software safely transitions between all predefined known states. c. Termination performed by software of safety-critical functions is performed to a known safe state. d. Operator overrides of safety-critical software functions require at least two independent actions by an operator. e. Safety-critical software rejects commands received out of sequence, when execution of those commands out of sequence can cause a hazard. f. Safety-critical software detects inadvertent memory modification and recovers to a known safe state. g. Safety-critical software performs integrity checks on inputs and outputs to/from the software system. h. Safety-critical software performs prerequisite checks prior to the execution of safety-critical software commands. i. No single software event or action is allowed to initiate an identified hazard. j. Safety-critical software responds to an off nominal condition within the time needed to prevent a hazardous event. k. Software provides error handling of safety-critical functions. l. Safety-critical software has the capability to place the system into a safe state. m. Safety-critical elements (requirements, design elements, code components, and interfaces) are uniquely identified as safety-critical. n. Incorporate requirements in the coding methods, standards, and/or criteria to clearly identify safety-critical code and data within source code comments.
Difference between A and B	No change
B	3.7.2 When a project is determined to have safety-critical software, the project manager shall implement the following items in the software: a. Safety-critical software is initialized, at first start and at restarts, to a known safe state. b. Safety-critical software safely transitions between all predefined known states. c. Termination performed by software of safety-critical functions is performed to a known safe state. d. Operator overrides of safety-critical software functions require at least two independent actions by an operator. e. Safety-critical software rejects commands received out of sequence, when execution of those commands out of sequence can cause a hazard. f. Safety-critical software detects inadvertent memory modification and recovers to a known safe state. g. Safety-critical software performs integrity checks on inputs and outputs to/from the software system. h. Safety-critical software performs prerequisite checks prior to the execution of safety-critical software commands. i. No single software event or action is allowed to initiate an identified hazard. j. Safety-critical software responds to an off nominal condition within the time needed to prevent a hazardous event. k. Software provides error handling of safety-critical functions. l. Safety-critical software has the capability to place the system into a safe state. m. Safety-critical elements (requirements, design elements, code components, and interfaces) are uniquely identified as safety-critical. n. Requirements are incorporated in the coding methods, standards, and/or criteria to clearly identify safety-critical code and data within source code comments.
Difference between B and C	Changed "When a project is determined to have" to "If a project has " safety-critical software; Added mission-critical software to the requirement; Removed "Safety-Critical" from items a. - l. as the entire requirement pertains to it; Changed "has the capability to" to "can" in item l.; Deleted items m. and n.
C	3.7.3 If a project has safety-critical software or mission-critical software, the project manager shall implement the following items in the software: a. The software is initialized, at first start and restarts, to a known safe state. b. The software safely transitions between all predefined known states. c. Termination performed by the software functions is performed to a known safe state. d. Operator overrides of software functions require at least two independent actions by an operator. e. The software rejects commands received out of sequence when the execution of those commands out of sequence can cause a hazard. f. The software detects inadvertent memory modification and recovers to a known safe state. g. The software performs integrity checks on inputs and outputs to/from the software system. h. The software performs prerequisite checks prior to the execution of safety-critical software commands. i. No single software event or action is allowed to initiate an identified hazard. j. The software responds to an off-nominal condition within the time needed to prevent a hazardous event. k. The software provides error handling. l. The software can place the system into a safe state.
Difference between C and D	No change
D	3.7.3 If a project has safety-critical software or mission-critical software, the project manager shall implement the following items in the software: a. The software is initialized, at first start and restarts, to a known safe state. b. The software safely transitions between all predefined known states. c. Termination performed by software functions is performed to a known safe state. d. Operator overrides of software functions require at least two independent actions by an operator. e. Software rejects commands received out of sequence when execution of those commands out of sequence can cause a hazard. f. The software detects inadvertent memory modification and recovers to a known safe state. g. The software performs integrity checks on inputs and outputs to/from the software system. h. The software performs prerequisite checks prior to the execution of safety-critical software commands. i. No single software event or action is allowed to initiate an identified hazard. j. The software responds to an off-nominal condition within the time needed to prevent a hazardous event. k. The software provides error handling. l. The software can place the system into a safe state.

1.3 Applicability Across Classes

Class	A	B	C	D	E	F
Applicable?

Key: - Applicable | - Not Applicable

1.4 Related Activities

This requirement is related to the following Activities:

Related Links
A.02 Software Assurance and Software Safety A.03 Software Requirements

2. Rationale

Implementing safety-critical software or mission-critical software design requirements helps ensure that the systems are safe and that the safety-critical software or mission-critical software requirements and processes are followed.

Safety-critical and mission-critical software are integral to ensuring the reliability and safety of projects where failure could result in catastrophic consequences, including loss of life, equipment, data, or mission objectives. Implementing the specified requirements ensures that the software operates in a robust, predictable, and safe manner even in adverse or unexpected circumstances.

Each provision of this requirement is designed to reduce risks associated with software malfunctions or operator errors in environments where safety and mission success are paramount. Implementing these measures protects against hazards, preserves asset integrity, and ensures compliance with industry safety standards for critical software systems. Failure to incorporate these safety mechanisms could lead to catastrophic outcomes, including harm to personnel, equipment failure, mission compromise, or environmental hazards. This requirement reinforces a culture of safety, reliability, and robust design within projects of high consequence.

Below is a clear and direct rationale for each sub-requirement:

2.1 The software is initialized, at first start and restarts, to a known safe state.

Ensures that the system begins operation from a predefined stable, hazard-free condition, mitigating risks associated with unpredictable or unsafe states during boot-up or restart.

2.2 The software safely transitions between all predefined known states.

Guarantees that transitions between operational states occur in a controlled, safe manner, reducing the risk of unexpected states leading to hazards or system instability.

2.3 Termination performed by software functions is performed to a known safe state.

Ensures safe system shutdown or termination, preventing erratic behavior or lingering hazardous conditions during cessation of operations.

2.4 Operator overrides of software functions require at least two independent actions by an operator.

Reduces the likelihood of accidental or unintended overrides, ensuring intentional and deliberate operator input before changing critical functions, thus minimizing human error.

2.5 Software rejects commands received out of sequence when execution of those commands out of sequence can cause a hazard.

Prevents unsafe outcomes from commands being executed in a potentially hazardous order, maintaining control and integrity of the system's operations.

2.6 The software detects inadvertent memory modification and recovers to a known safe state.

Inadvertent memory modifications caused by errors or hardware faults could lead to unpredictable system behavior. Detection and recovery ensure the system returns to a stable and non-hazardous state, preserving safety.

2.7 The software performs integrity checks on inputs and outputs to/from the software system.

Validates the correctness of critical data that influences system behavior, ensuring erroneous or corrupted data does not lead to unsafe operations or states.

2.8 The software performs prerequisite checks prior to the execution of safety-critical software commands.

Verifies that all preconditions are met before executing critical commands, avoiding scenarios where premature execution could result in hazards or compromised reliability.

2.9 No single software event or action is allowed to initiate an identified hazard.

Introduces redundancy and barriers to prevent system hazards from being caused by a single point of failure, thereby increasing fault tolerance and safety.

2.10 The software responds to an off-nominal condition within the time needed to prevent a hazardous event.

Timely responses to abnormal conditions ensure proactive mitigation of risks and prevent escalation into hazardous events, preserving the system's safe operation.

2.11 The software provides error handling.

Robust error handling prevents unhandled faults from destabilizing the system and propagating unsafe conditions, ensuring reliable continuance or controlled recovery.

2.12 The software can place the system into a safe state.

Provides an ultimate safeguard by ensuring the system can be intentionally moved into a safe condition under any circumstances, mitigating risks from unforeseen failures or hazards.

3. Guidance

3.1 Safety-Critical Software and Mission-Critical Software

This requirement applies to safety-critical software and mission-critical software. These items are design practices that should be followed when developing safety-critical software and mission-critical software.

Software safety requirements contained in NASA-STD-8739.8B

Derived from NPR 7150.2D para 3.7.3 SWE 134: Table 1, SA Tasks 1 - 6

1. Analyze the software requirements and the software design and work with the project to implement NPR 7150.2 requirement items "a" through "l."

2. Assess that the source code satisfies the conditions in the NPR 7150.2 requirement "a" through "l" for safety-critical and mission-critical software at each code inspection, test review, safety review, and project review milestone.

3. Confirm that the values of the safety-critical loaded data, uplinked data, rules, and scripts that affect hazardous system behavior have been tested.

4. Analyze the software design to ensure the following:
a. Use of partitioning or isolation methods in the
design and code,
b. That the design logically isolates the safety-critical
design elements and data from those that are
non-safety-critical.

5. Participate in software reviews affecting safety-critical software products.

6. Ensure the SWE-134 implementation supports and is consistent with the system hazard analysis.

See the software assurance tab for additional guidance material.

3.2 Requirement Notes

The following clarifications and enhancements aim to make the guidance more precise, actionable, and relevant for practical software engineering applications. The goal is to improve the interpretability of requirements while providing specific measures to ensure compliance.

Item a: (The software is initialized, at first start, and restarts, to a known safe state.)

Improved Guidance:
A known safe state is a system state where hazards are mitigated, and the system is ready for reliable operation. To establish this state, ensure that the following components are inspected and verified:
- Hardware state: Ensure hardware configuration matches predefined, verified initialization parameters. Include hardware self-test routines during the startup sequence.
- Software state: Confirm critical software modules are loaded correctly, and any volatile settings are initialized to nominal values.
- Operational phase: Validate initial operational mode (e.g., standby mode vs. active operation) based on system requirements.
- Device capability: Verify hardware fault indicators and ensure initial settings are compatible with device tolerances.
- Configuration: Ensure network, file system, and device configurations are consistent with system specifications.
- Memory integrity: Perform integrity checks on boot code and file allocation tables before system initialization.

Item d: (Operator overrides of software functions require at least two independent actions by an operator.)

Improved Guidance:
Requiring multiple, independent actions reduces the risks of accidental overrides by human operators. Examples include:
- Physical actions, such as pressing two separate buttons simultaneously or sequentially within a confirmed timeframe.
- Logical actions, such as entering an override confirmation in the software followed by user authentication or a verification code.
- Independent actions must require distinct input mechanisms (e.g., touchscreen + physical button) or software contexts (e.g., separate windows or modes).
  This approach minimizes the probability of unintended actions resulting from human error or confusion.

Item f: (The software detects inadvertent memory modification and recovers to a known safe state.)

Improved Guidance:
Inadvertent memory modifications can have catastrophic impacts, particularly in systems exposed to extreme environments (e.g., radiation environments). The following strategies can mitigate these risks:
- Detection mechanisms:
  - Implement error detection codes (EDC), such as parity checks, cyclic redundancy checks (CRC), and checksums.
  - Use memory protection hardware features, such as memory access control (lock bits) and write protection.
  - Monitor dynamic memory access for anomalies using runtime validation routines.
- Recovery mechanisms:
  - Utilize error-correcting codes (ECC) to correct bit errors where feasible.
  - Periodically scrub memory (e.g., refresh memory cells and correct errors in non-volatile memory).
- Prevention strategies:
  - Employ software authentication (e.g., verifying code and data integrity before execution).
  - Design memory partitioning to isolate critical portions from non-critical areas.

Item g: (The software performs integrity checks on inputs and outputs to/from the software system.)

Improved Guidance:
Input-output integrity is essential for safe system behavior. Design guidelines include:
- Input validation:
  - Validate nominal input ranges during run-time and reject out-of-spec inputs.
  - Detect and mitigate transient startup input anomalies using low-pass filters or stabilization logic.
- Output verification:
  - Implement sanity checks on output settings before external actuation (e.g., throttle limits or safety valve thresholds).
- Interface documentation:
  - Provide detailed interface specifications including expected ranges, timing requirements, and fault conditions.
  - Include clear contingency actions in case of interface errors or anomalies (e.g., fallback routines).

Item h: (The software performs prerequisite checks prior to the execution of safety-critical software commands.)

Improved Guidance:
Prerequisite checks ensure that commands are executed in the correct operational sequence, mode, and state. To achieve this:
- Sequence validation:
  - Design command sequencing rules to identify inappropriate or unsafe sequences. For example, transitions between modes or states must follow predefined workflows.
- Mode/state verification:
  - Verify device or system state compatibility before executing safety-critical commands (e.g., ensure a "disarmed state" before entering maintenance mode).
- Command gating:
  - Implement logic that prohibits execution of commands until all prerequisites are explicitly met. This includes environmental conditions, sensor validation, and operator confirmation.

Item j: (The software responds to an off-nominal condition within the time needed to prevent a hazardous event.)

Improved Guidance:
The system must proactively detect and respond to off-nominal conditions within required timeframes to prevent hazards. Key considerations include:
- Detection mechanisms: Sensors must identify anomalies such as temperature thresholds, out-of-bounds motion, or voltage variations.
- Response timing:
  - Perform real-time timing analysis to ensure mitigation routines execute within specified windows.
  - Include fail-safe mechanisms that trigger mitigation when fault detection exceeds real-time processing capability.
- Mitigation strategies:
  - Correct faults when possible (e.g., failed sensor recalibration).
  - Transition to a known safe state when fault correction or continuation is infeasible.

Item k: (The software provides error handling.)

Improved Guidance:
Robust error handling ensures runtime faults or failures do not escalate into hazards. Design considerations for error handling include:
- Fault detection:
  - Integrate mechanisms to detect operational, hardware, and software anomalies.
- Error isolation:
  - Segment error-handling routines to prevent propagation of faults across system boundaries (e.g., isolate affected subsystems).
- Error recovery:
  - Choose recovery techniques including rollback operations, system resets, or fallback modes based on fault criticality.
- Minimization of common failure modes:
  - Apply design redundancy to eliminate single points of failure (e.g., replicate critical subsystems).
  - Explicitly handle all exception cases within software, rather than relying on default behavior.

Item l: (The software can place the system into a safe state.)

Improved Guidance:
A safe state ensures all hazards are neutralized while retaining limited system functionality if feasible. Guidelines for achieving a safe state include:
- Safe state definition:
  - Establish specific safe states early in system design. Examples include "disarmed" mode, "power off", or "standby mode".
- Sensor design:
  - Design fault-tolerant sensors capable of robustly monitoring hazardous conditions and accurately transitioning the system into a safe state during detection.
- Verification:
  - Validate safe-state transitions through rigorous testing and failure mode analysis to ensure all paths lead to non-hazardous conditions.
- Timed response:
  - Ensure software can achieve the safe state transition within the time constraints imposed by the hazardous condition.

Additional Safety-Critical Software Design Guidelines

(Note: Improved to enhance readability and applicability.)

Minimize complexity: Keep safety-critical code simple and testable. Review code with complexity metrics above threshold (e.g., cyclomatic complexity > 15).
Avoid unsafe constructs: Disallow recursion, goto, and infinite loops. Ensure bounds are fixed for loops.
Heap memory: Avoid dynamic memory allocation at runtime; preallocate memory during initialization.
Assertions: Use at least two runtime assertions per function to enforce assumptions and invariants.
Pointer usage: Limit pointer usage to non-functional pointers, and avoid multiple dereferences where possible to reduce error risks.
Compile rigorously: Enable all compiler warnings, and resolve them prior to software release.
Security: Apply security best practices at all development levels, including data validation, boundary checks, and threat mitigation measures.

Improving safety-critical software design reduces risks and builds resilience in systems where safety, reliability, and performance are paramount.

Participate in peer reviews on all software changes and software defects affecting safety-critical software and hazardous functionality: HR-33 - Inadvertent Operator Action,

3.3 Checklist for Safety-Critical or Mission-Critical Software

The checklist in PAT-035 - Checklist for Safety-Critical or Mission-Critical Software ^PAT-035 may be used to review the items in this requirement. Download the worksheet and make any modifications necessary for your project. Use the checklist as many times as necessary.

PAT-035 - Checklist for Safety-Critical or Mission-Critical Software

Click on the image to preview the file. From the preview, click on Download to obtain a usable copy.

PAT-035 - Checklist for Safety-Critical or Mission-Critical Software

3.4 Additional Guidance

Additional guidance related to this requirement may be found in the following materials in this Handbook:

3.5 Center Process Asset Libraries

SPAN - Software Processes Across NASA
SPAN contains links to Center managed Process Asset Libraries. Consult these Process Asset Libraries (PALs) for Center-specific guidance including processes, forms, checklists, training, and templates related to Software Development. See SPAN in the Software Engineering Community of NEN. Available to NASA only. https://nen.nasa.gov/web/software/wiki ¹⁹⁷

See the following link(s) in SPAN for process assets from contributing Centers (NASA Only).

SPAN Links
Safety

4. Small Projects

The following simplified guidance is tailored for small projects where resources, complexity, and scope are more limited. The focus for small projects is maximizing safety with streamlined processes and practical implementation.

4.1 General Approach for Small Projects

Prioritize Safe States: Even in small projects, define concrete safe states early and ensure these states are achievable in failures or hazards.
Simplify Design: Minimize software complexity while ensuring all safety-critical functions are deterministic, reliable, and testable.
Automate Safety Features: When possible, automate detection, recovery, and safe-state transitions to minimize reliance on operator intervention.
Leverage Tools: Utilize existing software libraries, tools, or hardware capabilities to implement safety features (e.g., memory protection or error correction) without reinventing systems.

4.2 Item-Specific Guidance for Small Projects

Item a: (The software is initialized, at first start and restarts, to a known safe state.)

Simplify initialization routines by ensuring the system starts in a default, non-hazardous state (e.g., powered off motors, communications disabled).
Perform basic checks, such as verifying hardware status (e.g., sensors reading nominal values) and initializing software configurations to default, tested settings.

Item d: (Operator overrides of software functions require at least two independent actions by an operator.)

Require two distinct actions, such as:
- Physically pressing two separate buttons or switches.
- Software and hardware confirmation, such as entering an override password and simultaneously toggling a hardware switch.
Avoid complex override procedures; keep it simple but effective (e.g., "Confirm action" prompts with a second verification step).

Item f: (The software detects inadvertent memory modification and recovers to a known safe state.)

Use basic error detection mechanisms available in hardware (e.g., parity checks or watchdog timers).
Perform regular memory integrity checks on critical variables, especially after system interruptions or major events like resets.
Define a fallback routine that resets or reinitializes the system in case memory corruption is detected.

Item g: (The software performs integrity checks on inputs and outputs to/from the software system.)

Validate all inputs as a simple check (e.g., confirm values are within expected ranges) before acting on them.
Ensure outputs are validated before execution (e.g., limit physical actuation commands to avoid damaging hardware or creating unsafe conditions).
Document and test all interface specifications, defining acceptable inputs/outputs explicitly.

Item h: (The software performs prerequisite checks prior to the execution of safety-critical commands.)

Implement basic checks for prerequisites, such as ensuring the system is in the right state (e.g., standby mode) before executing a command.
- Example: If a command requires the system to be powered on, reject commands when the system is off.
Use simple rules that prevent unsafe sequencing, such as "command A must always precede command B."

Item j: (The software responds to an off-nominal condition within the time needed to prevent a hazardous event.)

Detect abnormal conditions (e.g., sensor failure or hardware faults) and respond immediately by:
- Logging the fault and disabling unsafe system components.
- Transitioning to a predefined safe state, such as pausing operations safely.
Use simple timers to ensure mitigation routines execute within required time limits.

Item k: (The software provides error handling.)

Simplify error handling by focusing on:
- Detection: Include basic fault checks, such as unavailable hardware or invalid inputs.
- Isolation: Deactivate affected components (e.g., disable a jammed actuator).
- Recovery: Reset operations to a known safe state or restart the system as a fallback.

Item l: (The software can place the system into a safe state.)

Define safe states early such as “stop all motion,” “power down non-critical systems,” or “disable unsafe components.”
Ensure basic sensors can identify hazards (e.g., temperature sensors for overheating) that trigger safe-state transitions.
Verify through testing that the system successfully transitions to safe states under various fault conditions.

4.3 Additional Streamlined Design Guidelines for Small Projects

Simplify Code:
- Code complexity should remain low (e.g., cyclomatic complexity < 10). Avoid recursive functions and excessive branching.
- Use small, modular functions with clear entry and exit points to make testing and debugging easier.
Limit Dependencies:
- Use simple constructs (e.g., if and switch instead of goto).
- Avoid relying on dynamic memory allocation (use arrays or preallocated buffers).
Error Prevention:
- Always check function return values (for errors) and handle them appropriately.
- Use assertions to enforce conditions during execution (e.g., "value X must always be positive").
Testing:
- Test all code paths for both nominal and off-nominal conditions.
- Automate safety-critical tests where possible to reduce errors during validation.
Safety Documentation:
- Create simple documentation defining safe states, prerequisites, interfaces, and mitigation actions.
- Keep records streamlined but explicit, focusing on actions necessary to maintain safety.

4.4 Simplified Off-the-Shelf Solutions

Small projects can leverage existing tools or hardware/software features to meet safety requirements efficiently:

Error detection/recovery: Use processors or microcontrollers with built-in ECC, memory parity checking, and interrupts.
Software validation: Many development platforms (e.g., Arduino, Raspberry Pi) provide libraries for input/output validation, timers, and fail-safe measures.
State handling: Develop state machines using modular approaches, making transitions between known states easy to program and test.

By leveraging simple, streamlined strategies and focusing on safety-critical aspects, small projects can achieve compliance with these requirements without unnecessary complexity or resource strain.

5. Resources

5.1 References

Click here to view master references table.

(SWEREF-014)
Computer-Based Control System Safety Requirements
SSP 50038, Revision C, NASA International Space Station Program, 1995.
(SWEREF-017) Constellation Computing Safety Requirements, CxP 70065, Revision A, 2005.
(SWEREF-197)
Software Processes Across NASA (SPAN)
Software Processes Across NASA (SPAN) web site in NEN SPAN is a compendium of Processes, Procedures, Job Aids, Examples and other recommended best practices.
(SWEREF-260)
NASA Fault Management Community of Practice
This NASA-only resource is available to NASA-users at https://nen.nasa.gov/web/faultmanagement.
(SWEREF-271)
NASA Software Safety Standard,
NASA STD 8719.13 (Rev C ) , Document Date: 2013-05-07
(SWEREF-276)
NASA Software Safety Guidebook,
NASA-GB-8719.13, NASA, 2004. Access NASA-GB-8719.13 directly: https://swehb.nasa.gov/download/attachments/16450020/nasa-gb-871913.pdf?api=v2
(SWEREF-278)
SOFTWARE ASSURANCE AND SOFTWARE SAFETY STANDARD
NASA-STD-8739.8B, NASA TECHNICAL STANDARD, Approved 2022-09-08 Superseding "NASA-STD-8739.8A"
(SWEREF-375)
IEC 62304
IEC 62304:2006, Medical device software — Software life cycle processes A copy of this standard is available from https://www.iso.org/standard/38421.html
(SWEREF-376)
ISO 26262
ISO 26262-1:2011, Road vehicles — Functional safety — Part 1: Vocabulary A copy of this standard is available from: https://www.iso.org/standard/43464.html
(SWEREF-432)
Overview of the DART Mishap Investigation Results.
For Public Release. (2006). Lessons Learned Reference.
(SWEREF-521)
Deficiencies in Mission Critical Software Development for Mars Climate Orbiter (1999)
Public Lessons Learned Entry: 740.
(SWEREF-603)
MCDC Coverage Video example.
Carnegie Mellon University course 18-642 updated Fall 2020, Koopman, Phil
(SWEREF-695)
Goddard Space Flight Center (GSFC) Lessons Learned online repository
The NASA GSFC Lessons Learned system. Lessons submitted to this repository by NASA/GSFC software projects personnel are reviewed by a Software Engineering Division review board. These Lessons are only available to NASA personnel.

5.2 Tools

Tools to aid in compliance with this SWE, if any, may be found in the Tools Library in the NASA Engineering Network (NEN).

NASA users find this in the Tools Library in the Software Processes Across NASA (SPAN) site of the Software Engineering Community in NEN.

The list is informational only and does not represent an “approved tool list”, nor does it represent an endorsement of any particular tool. The purpose is to provide examples of tools being used across the Agency and to help projects and centers decide what tools to consider.

5.3 Process Asset Templates

Click on a link to download a usable copy of the template.

(PAT-035 - )
PAT-035 - Checklist for Safety-Critical or Mission-Critical Software
SWE-134, For use in checking aspects of Safety-Critical or Mission-Critical Software.
(PAT-036 - )
PAT-036 Software Architecture and Design Process Audit
Topic 8.12, Checklist for Auditing the SWE Requirements related to the Software Architecture and Design.
(PAT-038 - )
PAT-038 - Implementation Process Audit
Topic 8.12, Checklist for Auditing the SWE Requirements related to the Software Implementation Process.
(PAT-042 - )
PAT-042 - Requirements Development and Mgmt Audit
Topic 8.12, Checklist for Auditing the SWE Requirements related to Software Requirements Development and Management.
(PAT-044 - )
PAT-044 - Software Hazard Development Process Audit
Topic 8.12, Checklist for Auditing the SWE Requirements related to the Software Hazard Development Process.
(PAT-045 - )
PAT-045 - Software Peer Review Inspection Report Audit
Topic 8.12, Checklist for Auditing the SWE Requirements related to the Software Peer Review Inspection Report Process.
(PAT-046 - )
PAT-046 - Test Verification and Validation Process Audit
Topic 8.12, Checklist for Auditing the SWE Requirements related to the Software Test Verification and Validation Process.

6. Lessons Learned

6.1 NASA Lessons Learned

Early planning and coordination between Software Engineering, Software Safety, and Software Assurance teams regarding the applicability and implementation of the SWE-134 software safety requirements are crucial for identifying and mitigating software-related risks. This proactive approach reduces schedule impacts, prevents costly late-phase rework, and ensures consistent adherence to safety and mission-critical requirements. Effective early collaboration establishes clear roles and responsibilities for identifying safety-critical software, performing hazard analysis, and verifying compliance with software safety standards.

Deficiencies in Mission Critical Software Development for Mars Climate Orbiter (MCO) (1999)
Lesson Number 0740: "Ensure Consistent Unit Conventions and Formal Software Review Processes"

Context: The Mars Climate Orbiter was lost due to a conversion error in the "Sm_forces" program, which output data in English units (pounds-force seconds) instead of the required metric units (Newton-seconds). This discrepancy caused navigational errors that ultimately resulted in mission failure.
Summary of Lessons Learned:
- The loss was directly tied to insufficient software safety planning and software engineering integration. Software Assurance and Safety teams were not fully engaged early in the software lifecycle, resulting in the omission of critical checks for unit consistency. This highlights the importance of rigorous software reviews, staff participation in design walkthroughs, and adequate training on software safety practices.
- Mission-critical software must be clearly identified, and both software and operational requirements need robust traceability to safety requirements like SWE-134.
- Formal unit consistency verifications should be explicitly tested and reviewed to prevent errors during design and implementation.
Recommendations:
1. Identify Mission-Critical Software Early: Collaborate between software engineering and the software safety team early to identify mission-critical and safety-critical software through a structured review process.
2. Enforce Rigorous Software Review Practices: Ensure participation of software assurance and safety staff in major software lifecycle events (design reviews, code walkthroughs, and test result reviews).
3. Standardize Units and Parameters: Verify the consistency of all engineering units and key system parameters to prevent misinterpretations during data handling between systems.
4. Train Key Personnel: Provide training for staff to identify safety risks, perform software reviews, conduct hazard analyses, and comply with SWE-134 safety requirements.
5. Include Unit Validation in Test Plans: Integrate unit validation, consistency checks, and data flow validation as part of formal software tests.
Key Takeaway: The Mars Climate Orbiter mission underscores the importance of proactive software safety planning and coordination. SWE-134 requirements (e.g., ensuring hazard analyses are performed and validated, consistent engineering practices, error detection, and fault handling) need to be embedded early in the software lifecycle to avoid mission-critical errors and delays.

Broader Applications from This Lesson:

The MCO lessons emphasize the following additional actions to implement safety-critical software effectively and mitigate failure risks:

Cross-Disciplinary Coordination: Create a collaborative environment where software engineers, systems engineers, and hazard analysts work together to ensure all safety-critical software is properly identified, validated, and reviewed.
Integrated Safety and Mission Assurance: Leverage SMA input during software testing and V&V processes to identify risks related to data handling, unit conversions, and fault handling.
Automated Verification Tools: Use validation tools to check for critical issues like parameter mismatches, inconsistent interfaces, or unit errors during development, testing, and integration.

By addressing these lessons early in the planning and development phases, project managers can avoid the devastating consequences of inadequate software safety practices, as seen in the Mars Climate Orbiter mission.

6.2 Other Lessons Learned

Demonstration of Autonomous Rendezvous Technology (DART) spacecraft Type A Mishap⁴³²: "NASA has completed its assessment of the DART MIB (Mishap Investigation Board) report, which included a classification review by the Department of Defense. The report was NASA-sensitive but unclassified because it contained information restricted by International Traffic in Arms Regulations (ITAR) and Export Administration Regulations (EAR). As a result, the DART mishap investigation report was deemed not releasable to the public." The LL also "provides an overview of publicly releasable findings and recommendations regarding the DART mishap."
The Goddard Space Flight Center (GSFC) Lessons Learned online repository ⁶⁹⁵ contains the following lessons learned related to software requirements identification, development, documentation, approval, and maintenance based on analysis of customer and other stakeholder requirements and the operational concepts. Select the titled link below to access the specific Lessons Learned:
- Test all commands to GNC simulated hardware. Lesson Number 345: The recommendation states: "Once the Spacecraft Command and Telemetry database is established, test all defined GNC hardware commands against the spacecraft simulator."

7. Software Assurance

SWE-134 - Safety-Critical Software Design Requirements

3.7.3 If a project has safety-critical software or mission-critical software, the project manager shall implement the following items in the software:

a. The software is initialized, at first start and restarts, to a known safe state.
b. The software safely transitions between all predefined known states.
c. Termination performed by software functions is performed to a known safe state.
d. Operator overrides of software functions require at least two independent actions by an operator.
e. Software rejects commands received out of sequence when execution of those commands out of sequence can cause a hazard.
f. The software detects inadvertent memory modification and recovers to a known safe state.
g. The software performs integrity checks on inputs and outputs to/from the software system.
h. The software performs prerequisite checks prior to the execution of safety-critical software commands.
i. No single software event or action is allowed to initiate an identified hazard.
j. The software responds to an off-nominal condition within the time needed to prevent a hazardous event.
k. The software provides error handling.
l. The software can place the system into a safe state.

7.1 Tasking for Software Assurance