9.01 Software Design Principles Incorrect use of memory can lead to catastrophic failure. Memory access errors generically fall into two types: inappropriate reads and writes. Spacecraft have several types of memory, with different characteristics, residing in different devices. All spacecraft will have memory on the processor board, and a form of non-volatile memory for boot code. There are often other devices that provide storage for extra copies of the flight software, system data, telemetry, and science data. While there are many types of memory technology in use, they can roughly be grouped into three categories: Which technology is chosen depends on the persistence requirements of the data that needs to be stored. Processor memory is typically organized with areas reserved for code storage, compiled-in data storage, stack space, and dynamic (heap) storage. Depending on processor architecture, these areas may be known by different terms, and be organized in physical memory in different ways, but regardless of architecture, each area in memory has a specific purpose, and the boundaries between each must be strictly observed. The specific techniques employed to prevent inappropriate memory use vary according to the underlying technology, frequency of access, criticality of the contents, and other application-specific considerations. The choice of techniques should balance the need for reliability against the cost, complexity, and risk of proposed solutions. For example, data replication with a simple checksum may be sufficient to guarantee an acceptable level of data integrity. In other situations, data replication might need to be combined with error detection and correction (either hardware or software-based), and possibly write protection mechanisms to achieve the desired reliability. Additional techniques that might be considered, either alone or in conjunction with other techniques include: Memory contents in both code and data segments may become corrupted due the effects of the space environment. This as a special case of incorrect write access. The techniques employed to detect and correct memory corruption include many of the techniques described above as well as systematic processes such as: Whatever techniques are used, when safeguard measures are invoked to prevent an undesirable action, or correct a detected fault, reporting the event in telemetry alerts ground operators to the situation and allows them to take appropriate action. See also the 9.08 Flight Software Modification and 9.16 Thread Safety design principles for related discussion. None The NASA Lesson Learned 439 database contains the following lessons learned related to incorrect memory use or access: In addition, a report prepared by the Near Earth Asteroid Rendezvous (NEAR) Anomaly Review Board contains information related to incorrect memory use or access. The report is entitled "The NEAR Rendezvous Burn Anomaly of December 1998" 678
See edit history of this section
Post feedback on this section
1. Principle
1.1 Rationale
2. Examples and Discussion
3. Inputs
3.1 ARC
a. Execution in data areas, unused areas, and areas not intended for execution.
b. Unintended over-writing of code areas.
c. The updating of code/software be limited to a single target memory device under user ground control and monitoring at a time. If dual memory units are incorporated in the design, under no circumstances are the prime and redundant memories to be modified concurrently, or before the operational performance of the change is properly assured in a single unit.
a. Flight software shall be designed to verify uplinked commands, data, or loads.
b. Flight software shall ignore, and log incorrectly formatted commands, data, or loads and provide notification that they were incorrectly formatted.
c. Flight software shall be designed to send acknowledgement of command receipt to the source with indication of acceptance or rejection of command. For rejected commands, the acknowledgement message shall include a reason for rejection in the transmitted message.
Note: For example, flight computer designs have included Error Detection And Correction (EDAC) logic on EEPROMs, and the load process has been designed to detect and respond to failure if the EDAC detects an uncorrectable bit error. Software designs have included check sum logic and periodic verification of memory to detect command, data, or load, and memory faults. For example, a command handler should check whether a received command is appropriate for the current system mode, and a software module should check whether a command is appropriate for its local state.3.2 GSFC
3.3 JPL
a. Flight software shall be designed to detect and respond safely to corrupted commands, data, or loads, and memory faults allocated to the software, such as stuck bits or single event effects (SEE).
Note:For example, flight computer designs have included Error Detection And Correction (EDAC) logic on EEPROMs, and the load process has been designed to detect and respond to failure if the EDAC detects an uncorrectable bit error. Software designs have included check sum logic and periodic verification of memory to detect command, data, or load, and memory faults.
Note: Protection is typically provided by intentionally enabling a write operation before modifying the software; at all other times, write operations are disabled to protect the software from unintended modifications. Unintended modifications can be introduced through configuration management, design, and operations flaws as well as physics.
a. Execution in data areas, unused areas, and other areas not intended for execution caused by branching into non-code areas.
b. Using code as data.
c. Unintended, harmful over-writing of code areas.
Note: Wherever possible, features of the processor and/or operating system should be utilized to protect against incorrect memory use. For example, disabling the “write enable” switch such that writing to protected memory areas causes interrupts. Software techniques may also be used such as initializing unused memory to illegal instructions that cause interrupts when executed.3.4 MSFC
Rationale: To ensure the integrity of the process, all updates to the program of reprogrammable devices, e.g., flight software loads/updates and sequence memory loads, must be verified by a memory readout or checksum readout. Depending on the application and mission/system consequence, single or multiple readouts may be considered. Program here refers to both loading software and configuration files.
Note: Incorrect use of memory such as execution in data areas, unused areas, and other areas not intended for execution, unintended over-writing of code areas. Wherever possible, features of the processor and/or operating system should be utilized to protect against incorrect memory use. For example, disabling the “write enable” switch such that writing to protected memory areas causes interrupts. Software techniques may also be used such as initializing unused memory to illegal instructions that cause interrupts when executed.
Rationale: Techniques such as partitioning of code and data help minimize the potential for unintended modification. It is a technique for providing isolation between functionally independent software elements and may even reduce the verification effort and increase its fidelity.
Rationale: Memory modifications may occur due to radiation-induced errors, uplink errors, configuration errors, or other causes, so the computing system must be able to detect the problem and recover to a safe state. For example, there have been cases aboard the International Space Station in which the computing system’s memory check sum function detected incorrect data loads and enabled isolation and safe recovery. Memory correction features may be used to enable such a recovery. As an example, computing systems may implement error detection and correction, software executable and data load authentication, periodic memory scrub and space partitioning to provide protection against inadvertent memory modification.4. Resources
4.1 References
5. Lessons Learned
5.1 NASA Lessons Learned
Contact was lost with the Mars Global Surveyor (MGS) spacecraft in November 2006 during its 4th extended mission. A routine memory load command sent to an incorrect address 5 months earlier corrupted positioning parameters, and their subsequent activation placed MGS in an attitude that fatally overheated a battery and depleted spacecraft power. The report by the independent MGS Operations Review Board listed 10 key recommendations to strengthen operational procedures and processes, correct spacecraft design weaknesses, and assure that economies implemented late in the course of long-lived missions do not impose excessive risks.
9.09 Incorrect Memory Use or Access
Web Resources
View this section on the websiteUnknown macro: {page-info}
0 Comments