Design software to protect against incorrect use of memory.
Incorrect use of memory can lead to catastrophic failure.
2. Examples and Discussion
Memory access errors generically fall into two types: inappropriate reads and writes.
Spacecraft have several types of memory, with different characteristics, residing in different devices. All spacecraft will have memory on the processor board, and a form of non-volatile memory for boot code. There are often other devices that provide storage for extra copies of the flight software, system data, telemetry, and science data. While there are many types of memory technology in use, they can roughly be grouped into three categories:
Programmable Read-Only Memory that cannot be modified (e.g., PROM);
Persistent (non-volatile) memory that can be modified (e.g., EEPROM), and
Volatile memory that can be modified freely, but which does not guarantee content on reset or power cycle (e.g., DRAM , SDRAM , etc.).
Which technology is chosen depends on the persistence requirements of the data that needs to be stored.
Processor memory is typically organized with areas reserved for code storage, compiled-in data storage, stack space, and dynamic (heap) storage. Depending on processor architecture, these areas may be known by different terms, and be organized in physical memory in different ways, but regardless of architecture, each area in memory has a specific purpose, and the boundaries between each must be strictly observed.
The specific techniques employed to prevent inappropriate memory use vary according to the underlying technology, frequency of access, criticality of the contents, and other application-specific considerations. The choice of techniques should balance the need for reliability against the cost, complexity, and risk of proposed solutions. For example, data replication with a simple checksum may be sufficient to guarantee an acceptable level of data integrity. In other situations, data replication might need to be combined with error detection and correction (either hardware or software-based), and possibly write protection mechanisms to achieve the desired reliability.
Additional techniques that might be considered, either alone or in conjunction with other techniques include:
Operating system or hardware-based memory partitioning,
The ability to roll back a failed change operation,
Use of multiple memory technologies to prevent common-mode failures,
Software checks on computed addresses or pointers to ensure they refer to valid memory locations,
Use of memory write protection,
Initializing unused memory to illegal instructions that cause interrupts when executed, and
Initializing unused memory to values other than zero.
Memory contents in both code and data segments may become corrupted due the effects of the space environment. This as a special case of incorrect write access. The techniques employed to detect and correct memory corruption include many of the techniques described above as well as systematic processes such as:
Hardware or software error detection and correction (EDAC) and memory scrubbing
Use of check-sums on important memory segments
Read-after-write to ensure that critical data have been correctly transferred to memory
Tracking of bad bits and/or memory blocks, with associated memory management to prevent us of physical areas of memory with known problems.
Whatever techniques are used, when safeguard measures are invoked to prevent an undesirable action, or correct a detected fault, reporting the event in telemetry alerts ground operators to the situation and allows them to take appropriate action.
Excerpts from three documents are included below but no information on the documents that the excerpts were taken from is available. These documents should be properly referenced.
18.104.22.168 Protection Against Incorrect Memory Use - Software shall be designed to protect against incorrect use of memory with the following considerations:
a. Execution in data areas, unused areas, and areas not intended for execution. b. Unintended over-writing of code areas. c. The updating of code/software be limited to a single target memory device under user ground control and monitoring at a time. If dual memory units are incorporated in the design, under no circumstances are the prime and redundant memories to be modified concurrently, or before the operational performance of the change is properly assured in a single unit.
22.214.171.124.3 Protection from Unintended Software Modification -The flight software architecture should be designed to protect flight software that is intended to be modifiable during flight from unintended modifications.
126.96.36.199.1 Command Validation and Acknowledgement a. Flight software shall be designed to verify uplinked commands, data, or loads. b. Flight software shall ignore, and log incorrectly formatted commands, data, or loads and provide notification that they were incorrectly formatted. c. Flight software shall be designed to send acknowledgement of command receipt to the source with indication of acceptance or rejection of command. For rejected commands, the acknowledgement message shall include a reason for rejection in the transmitted message.
Note: For example, flight computer designs have included Error Detection And Correction (EDAC) logic on EEPROMs, and the load process has been designed to detect and respond to failure if the EDAC detects an uncorrectable bit error. Software designs have included check sum logic and periodic verification of memory to detect command, data, or load, and memory faults. For example, a command handler should check whether a received command is appropriate for the current system mode, and a software module should check whether a command is appropriate for its local state.
188.8.131.52 Response to incorrect commands, loads, data, or memory a. Flight software shall be designed to detect and respond safely to corrupted commands, data, or loads, and memory faults allocated to the software, such as stuck bits or single event effects (SEE).
Note:For example, flight computer designs have included Error Detection And Correction (EDAC) logic on EEPROMs, and the load process has been designed to detect and respond to failure if the EDAC detects an uncorrectable bit error. Software designs have included check sum logic and periodic verification of memory to detect command, data, or load, and memory faults.
184.108.40.206 Protection from unintended software modification - Flight software that is modifiable during flight shall be protected from unintended modifications including those caused by operations errors, single event effects, and hardware problems.
Note: Protection is typically provided by intentionally enabling a write operation before modifying the software; at all other times, write operations are disabled to protect the software from unintended modifications. Unintended modifications can be introduced through configuration management, design, and operations flaws as well as physics.
220.127.116.11 Protection against incorrect memory use - Software shall be designed to protect against incorrect use of memory: a. Execution in data areas, unused areas, and other areas not intended for execution caused by branching into non-code areas. b. Using code as data. c. Unintended, harmful over-writing of code areas.
Note: Wherever possible, features of the processor and/or operating system should be utilized to protect against incorrect memory use. For example, disabling the “write enable” switch such that writing to protected memory areas causes interrupts. Software techniques may also be used such as initializing unused memory to illegal instructions that cause interrupts when executed.
18.104.22.168 Flight software loads, updates and sequence memory loads shall be capable of verification after upload. Rationale: To ensure the integrity of the process, all updates to the program of reprogrammable devices, e.g., flight software loads/updates and sequence memory loads, must be verified by a memory readout or checksum readout. Depending on the application and mission/system consequence, single or multiple readouts may be considered. Program here refers to both loading software and configuration files.
22.214.171.124 Software shall be designed to protect against incorrect use of memory. Note: Incorrect use of memory such as execution in data areas, unused areas, and other areas not intended for execution, unintended over-writing of code areas. Wherever possible, features of the processor and/or operating system should be utilized to protect against incorrect memory use. For example, disabling the “write enable” switch such that writing to protected memory areas causes interrupts. Software techniques may also be used such as initializing unused memory to illegal instructions that cause interrupts when executed.
Rationale: Techniques such as partitioning of code and data help minimize the potential for unintended modification. It is a technique for providing isolation between functionally independent software elements and may even reduce the verification effort and increase its fidelity.
126.96.36.199 Flight software shall detect inadvertent memory modification and recover to a known safe state. Rationale: Memory modifications may occur due to radiation-induced errors, uplink errors, configuration errors, or other causes, so the computing system must be able to detect the problem and recover to a safe state. For example, there have been cases aboard the International Space Station in which the computing system’s memory check sum function detected incorrect data loads and enabled isolation and safe recovery. Memory correction features may be used to enable such a recovery. As an example, computing systems may implement error detection and correction, software executable and data load authentication, periodic memory scrub and space partitioning to provide protection against inadvertent memory modification.
REF RPT p09
REF RPT p09
Visible to editors only
Enter the necessary modifications to be made in the table below:
SWEREFs to be added
SWEREFS to be deleted
SWEREFs called out in the text: 439, 557, 569, 978
SWEREFs NOT called out in text but listed as germane: NONE
REF RPT p09
REF RPT p09
5. Lessons Learned
5.1 NASA Lessons Learned
The NASA Lesson Learned
database contains the following lessons learned related to incorrect memory use or access:
MER Spirit Flash Memory Anomaly (2004). Lesson Learned 1483:
"Shortly after the commencement of science activities on Mars, an MER rover lost the ability to execute any task that requested memory from the flight computer. The cause was incorrect configuration parameters in two operating system software modules that control the storage of files in system memory and flash memory. Seven recommendations cover enforcing design guidelines for COTS software, verifying assumptions about software behavior, maintaining a list of lower priority action items, testing flight software internal functions, creating a comprehensive suite of tests and automated analysis tools, providing downlinked data on system resources, and avoiding the problematic file system and complex directory structure."
Mars Global Surveyor (MGS) Mars Global Surveyor (MGS) Spacecraft Loss of Contact. Lesson Learned 1805:
Contact was lost with the Mars Global Surveyor (MGS) spacecraft in November 2006 during its 4th extended mission. A routine memory load command sent to an incorrect address 5 months earlier corrupted positioning parameters, and their subsequent activation placed MGS in an attitude that fatally overheated a battery and depleted spacecraft power. The report by the independent MGS Operations Review Board listed 10 key recommendations to strengthen operational procedures and processes, correct spacecraft design weaknesses, and assure that economies implemented late in the course of long-lived missions do not impose excessive risks.
In addition, a report prepared by the Near Earth Asteroid Rendezvous (NEAR) Anomaly Review Board contains information related to incorrect memory use or access. The report is entitled "The NEAR Rendezvous Burn Anomaly of December 1998"