9.01 Software Design Principles Without the capability to safely recover from certain credible faults the system could lose data, harm an instrument or, in the worst case, cause loss of life or the end-of-mission. Control systems incorporate fault detection and response mechanisms in order to properly respond to off-nominal conditions. Although some low criticality systems incorporate a generic fault response to any detected error, this is not generally sufficient for control systems. Identifying potential causes of anomalies typically requires specialized knowledge of the system components, material properties, wear mechanics, environmental effects, and other factors. Understanding how each potential local anomaly will affect the rest of the system, and developing appropriate and effective responses is typically the domain of systems engineering. Thus, a complete specification of fault detection and responses for a system is usually the result of a collaboration between lower-level domain experts, designers, and system engineers. Although the software team does not develop the overall fault protection system, they are usually charged with implementing the higher level behaviors, and play an important role in shaping the approach based on what they have found effective in the past. Some of the design features and considerations that have proven useful on NASA missions are discussed in the paragraphs to follow. Because fault detection and correction approaches will change depending upon the mission phase and system state, the capability to enable/disable scenarios as well as completely update them is an important feature of the software architecture. Fault management architectures usually include some type of escalating response to persistent failures. (For example, the software might notify upon the first failure, take some precautionary action upon a continuing failure, and shutdown equipment if the failure continues to persist.) Fault definitions and responses may be implemented in software as ground-modifiable data tables. Typically, the operator needs to have control of the specific fault responses and be able to modify them. Capabilities can include enabling and disabling fault detections and/or responses and modifying a table of potential faults and planned responses. Detection methods that have been used include checking the entire input data set against expected ranges. Values are typically checked against the range that is expected based on the current state of the system. Other fault detection mechanisms that have been used include: One failure response implementation that has been widely used includes reporting the fault and maintaining a count of the number of specific failures that have occurred. When a predefined, but modifiable, maximum number of specific failures is exceeded, a more severe response is initiated. More severe responses that have been used include resetting components, shutting down faulty components, switching to backup systems, or a complete system restart. A software architecture choice that must be considered is whether to design a centralized error checking system, a distributed system, or a combination of the two. As examples, a centralized implementation might limit-check all system measurements in one compact set of looping code, whereas a distributed implementation would limit-check individual (or small sets of) measurements as they are read. Previous experience has shown that a centralized system simplifies maintenance of the error checking code and produces reusable code. However, some environments would require multiple data moves across software interfaces to gather measure data for limit checking (e.g., a partitioned operating system with measurement data in several partitions), impacting the performance of the system. Or in extreme cases, the measurement may need to be limit-checked as soon as possible after it is read to identify faults. The software architects need to make the optimal decision based on careful consideration of the requirements and system constraints. Building in capabilities to continue operating in the presence of (possibly permanent) faults is an important software system design aspect. While the fault detection and response design specifies how the system will respond to faults as they are discovered, long term solutions may be required that extend beyond ensuring the immediate safety of the system. For example, the permanent failure of an inertial measurement unit (IMU) on a robotic spacecraft might be compensated by implementing a portion of its capabilities in software using a star tracker. In this example the immediate response to the IMU failure would probably involve using a sun sensor to ensure there was enough sun on the solar arrays to keep the spacecraft alive, while the alternate functionality would be the long-term solution. Sample Types Definition Over heating Certain thermal conditions could feasibly occur that could be corrected autonomously by turning off a subsystem or part of a subsystem. Pointing violation Certain instruments may be harmed if pointed erroneously. This should be detected by the Fault Management System and commands to the attitude control system (ACS) should be sent. Memory Failures Certain memory areas could be failing and work-arounds could be put in place assuring that those memory locations not be used. Power loss A loss of power can be detected and the system can autonomously be put in safehold by the software. Phase A B C D Activities: 1. Document a flight software (FSW) Fault Detection and Correction Plan 1. An analysis of the entire system must be performed to indentify all potential faults and the proper response to each fault. 1. Implement the FSW Detection and Correction Plan. 1. Update and analyze the documentation. Verification: Verify at Mission Definition Review (MDR). 1. Verify at FSW Software Requirements Review (SRR) & FSW Preliminary Design Review (PDR) 1. Verify at Critical Design Review (CDR) 1. Verify at FSW Acceptance Test Review None Lessons that appear in the NASA LLIS 439 or Center Lessons Learned Databases.
See edit history of this section
Post feedback on this section
1. Principle
1.1 Rationale
2. Examples and Discussion
Table 1. Sample Faults Types
Table 2. Example life cycle phased implementation of fault detection and response
2. With the system engineering team and instrument managers, agree on the FSW Detection and Correction Plan
2. Test all defined scenarios
2. Refine the scenarios as necessary
2. Verify at System Design Review (SDR) & PDR
2. Verify at FSW CDR
2. Verify at Pre-Ship Review (PSR) and Flight Readiness Review (FRR).3. Inputs
3.1 ARC
Note: A safe state is a state in which the spacecraft thermal condition and inertial orientation are stable, the spacecraft is commandable and is transmitting a downlink signal, and requires no immediate commanding to ensure spacecraft health and safety that preserves vital spacecraft resources. The safe state shall be power-positive.3.2 GSFC
3.3 JPL
a. Anomaly- any unexpected occurrence.
Note: Applied to indicate any of faults, failures, and their observable symptoms
b. Error- an observable symptom of a fault
Note: In not all cases will faults give rise to observable symptoms.
c. Failure- the inability of a system or component to perform its required function.
d. Fault- a physical defect or occurrence which causes the loss of required functionality
Note: Fault protection is not intended as a remedy for faults that result from design error and/or designs that are inadequate for the specified environment.
Rationale: Designs should not be required to handle unrealistic scenarios, yet based on our operational experience, we know they must handle 'unknown-unknowns'. Fault protection scope must include function preservation (loss of functionality) as well as identified fault scenarios. This notion of a 'safety-net' would include (but not be limited to) uplink command loss, attitude control loss, attitude knowledge loss, ephemeris errors, excess system momentum, system power/energy deficiency, and system over/under temperature.
Rationale: Having dealt with prior faults, it is nonetheless important to preserve remaining options for mission success. The likelihood of faults in functionally independent system elements is undiminished. Coincident faults, however, are generally of sufficiently low likelihood to justify making no overt provisions for them in the design.
Note: A newly discovered latent fault (e.g., one exposed by a recovery action for a recent fault) should not be considered a concurrent fault. Only the error detection is concurrent in this case. Operational mitigations to reveal latent faults in a timely manner may obviate the need to deal with such concurrent errors.
Note: Fault containment regions represent the smallest level of concern to fault protection. See 4.12.1.6.
Rationale: Redundancy schemes can be seriously compromised in the absence of fault containment. Also, failure to contain faults can complicate recovery by confusing diagnosis, and requiring more complex response actions.
Rationale: False alarms are inevitable, so the system shouldn’t be vulnerable to them.
Example: Incorrectly set monitor threshold may cause an unexpected response, and this should not severely degrade the mission, or create an operational hardship.
Example: Fault protection objectives or behavior may vary depending on mission phase. This variation should be based upon system modes or circumstances, rather than changed via a separate set of commands that alter fault protection enables/disables, thresholds, etc. System-wide behavioral changes should be made in a more atomic, integrated fashion- making the overall design less susceptible to operator error, or timing vulnerabilities.
Rationale If fault protection response actions must choose between completing a time critical event or losing the mission, it is better to attempt to complete the event (risking health and/or safety) and risk saving the mission vs. not completing the event and losing the mission.
Note: Mission-critical events are those that if not executed properly and in a timely manner could result in failure to achieve mission success (e.g., orbit insertion, EDL). A TCM is not mission-critical unless it must execute properly in the time scheduled for it, i.e. cannot be delayed.
Note: The safing mode may be a single state or more than one state. The RF downlink signal need not be continuous, but must be predictable in its timing.
Rationale: The spacecraft must autonomously recover from a detected fault when the function(s) affected by the fault threaten spacecraft/instrument survival (e.g., functions necessary to maintain Safe mode). Ensure spacecraft survivability and viability by preserving vital spacecraft resources (e.g., thermal, power), while enabling ground interaction (e.g., command and downlink) for recovery operations It is not enough merely to diagnose and isolate faults, or to restore lost functionality, if the resulting system state still threatens the rest of the mission (e.g., through stress, loss of consumables, or unresponsiveness to operator control).
Note: A missed tracking pass should not be reason to declare a s/c emergency, thus requiring rescheduling of tracking resources.
Note: 14 days is a typical duration based on the interval between ground contacts, but can be project and mission phase dependent.
Rationale: Transition to safing may be due to an operational mistake, and the system should still be single fault tolerant while awaiting ground recovery.
Note: Autonomous completion implies restoring the functionality needed to complete the mission-critical event. See 4.9.1.2 and 4.9.1.3.
Rationale: For certain mission critical events, ground response may not be possible and the autonomous fault protection design must ensure completion in the event of a single fault. 3.4 MSFC
Rationale: The determination of caution and warning is performed by the processing of logic utilizing a defined subset of health and status measurements. This includes the logic for the determination of abort conditions. Sensor latency is not included because it is application dependent and can vary widely.
Rationale: Potential faults and the action taken must be defined and determined so that actions taken upon error detection do not set off a chain reaction leading to more serious fault conditions, e.g., issuance of questionable commands to actuators as a result of a fault condition that exacerbates the problem.
Rationale: All error conditions must be logged in a manner that ground and crew are aware of the vehicle performance, whether real-time or during post-flight analysis.
Rationale: This requirement ensures that the operators participate in the decision to activate sequences involving hazardous commands. Operator inhibits or enables can be considered one of the functionally independent parameters4. Resources
4.1 References
5. Lessons Learned
5.1 NASA Lessons Learned
Contact was lost with the Mars Global Surveyor (MGS) spacecraft in November 2006 during its 4th extended mission. A routine memory load command sent to an incorrect address 5 months earlier corrupted positioning parameters, and their subsequent activation placed MGS in an attitude that fatally overheated a battery and depleted spacecraft power. The report by the independent MGS Operations Review Board listed 10 key recommendations to strengthen operational procedures and processes, correct spacecraft design weaknesses, and assure that economies implemented late in the course of long-lived missions do not impose excessive risks.
9.07 Fault Detection and Response
Web Resources
View this section on the websiteUnknown macro: {page-info}