As NASA organizations use Artificial Intelligence (AI) to make decisions, the need for robust Software Assurance (SA) becomes important.
(AI) has emerged as a project option, changing the way we interact with and use data. Recommend, at this time, that we limit to use of AI to non-safety critical applications.
Software Assurance plays a crucial role in the AI development lifecycle. It involves a systematic process of monitoring, assessing, and improving the software development and AI implementation processes. In the context of AI, SA aims to:
evaluate the use of AI in software development activities,
validate the accuracy of algorithms,
ensure model robustness, and
ensure that the software meets the highest standards of performance.
AI introduces unique challenges to traditional SA methodologies. Unlike conventional software, AI systems continuously learn and adapt based on new data. This dynamic nature poses challenges in establishing fixed criteria for testing and validation. SA in AI must evolve alongside the models it scrutinizes, necessitating a more iterative and adaptive approach.
1.1 Data Quality
The quality of the data used to train and test AI models directly influences their performance and reliability. Garbage in, garbage out (GIGO) holds in the world of AI, emphasizing the critical importance of high-quality data. SA should also ensure the security and configuration control aspects of the data used to train and test AI. Data quality encompasses several dimensions, including accuracy, completeness, consistency, timeliness, and relevance. In the context of AI, accuracy is particularly vital, as even small inaccuracies in the training data can lead to significant errors in predictions. Ensuring that data is representative and unbiased is also crucial to avoid reinforcing existing biases within AI models.
There are a number of key considerations for software assurance on AI models:
1.1.1 Validating The Training Data
SA needs to review (evaluate and approve) how the project and engineering are testing the data used to train the AI model to ensure it is accurate, representative, and unbiased. This helps identify and mitigate potential issues with the model's performance and predictions. Effective data preprocessing and cleaning are foundational steps in ensuring data quality for AI. This involves identifying and addressing missing values, handling outliers, and normalizing data to create a standardized and reliable dataset. SA processes must include rigorous checks at these stages to guarantee that the data fed into AI models is of the highest quality. SA should analyze and think through all possible scenarios to make sure that the AI actions are correct.
Unlike traditional software, AI models that are generative or continually learning may require ongoing testing even after deployment, if the model continues to learn and evolve with new real-world data. SA needs to ensure that the software engineering process includes repeatedly testing the model's behavior and performance over time and evaluating the results. Testing AI models is a multifaceted process. It involves validating the accuracy of predictions, assessing model generalization to new data, and evaluating performance under diverse conditions. The dynamic nature of some AI techniques may require continuous testing throughout the development lifecycle. Rigorous testing not only ensures the reliability of AI applications but also contributes to building trust among end-users. The dynamic nature of AI necessitates continuous monitoring and feedback loops. SA processes should include mechanisms for monitoring of AI applications in development, production, and in testing. This ongoing evaluation helps identify and address issues promptly, ensuring that the AI system remains accurate and effective as it encounters new data.
Comprehensive documentation and traceability are essential aspects of AI applications. Documenting the entire development and testing process, along with the data used, facilitates transparency allowing for effective debugging and auditing. In the event of issues or unexpected outcomes, traceability enables developers and software assurance teams to identify and rectify problems efficiently. SA needs to ensure that sufficient documentation and traceability exist for the AI application and associated data. SA should ensure if the model needs to be recreated, then all of the inputs need to be documented so that they can be fed into the AI tool and produce the same/correct results.
Automated tools and machine learning models can be used to analyze large codebases, identify patterns, and predict potential issues more efficiently than manual review. SA needs to analyze the engineering data or perform independent static code analysis to check for code defects, software quality objectives, code coverage objectives, software complexity values, and software security objectives. SA needs to confirm the static analysis tool(s) are used with checkers to identify security and coding errors and defects. SA needs to confirm that the project addresses the errors and defects and assesses the results from the static analysis tools used by software assurance, software safety, engineering, or the project. SA should confirm that Software Quality Objectives or software quality threshold levels are defined and set for static code analysis defects, or software security objectives.
It's important to establish clear quality criteria and variables to assess the AI model's performance, quality, risk, security, maintainability, and compliance with requirements. The SA process should confirm that these criteria are defined and aim to continuously improve the code quality towards all of the criteria.
While the goal is to maximize code quality, projects must also consider factors like features, costs, and schedule. The SA process should help identify the critical vulnerabilities to address based on objectives, risks, and schedules.
Protecting sensitive data used to train and operate the AI model is important. Appropriate access controls and security measures must be in place. SA should confirm that the project has the proper security controls in place and configuration management for the project.
One of the most significant challenges in AI software assurance is addressing bias. Biased training data can lead to incorrect outcomes. Software assurance should look at how the project identifies and mitigates bias in the data and the algorithms. SA should also ensure ongoing monitoring and adjustment are used to ensure objectivity in AI applications.
1.1.9 Requirements Verification And Validation
AI applications introduce a unique challenge of indeterministic behavior. The ability to verify and validate the behavior of AI systems may not be predictably repeatable. SA should work to develop verification and validation criteria that address the probabilistic nature of AI systems to establish qualitative measures for validation and verification test acceptance, pass, and fail criteria.
In summary, software assurance for AI models requires a comprehensive approach that focuses on validating training data, continuous testing, leveraging AI/ML tools, defining quality benchmarks, balancing project goals and risks, ensuring data security and privacy, and addressing ethical considerations. This helps ensure the AI system is accurate, robust, secure, and compliant.
In the era of artificial intelligence, software quality assurance emerges as a linchpin for ensuring the reliability, accuracy, and ethicality of AI applications. By focusing on data quality as a foundational element, SA plays a pivotal role in mitigating challenges such as bias, ensuring transparency, and building trust in AI systems. As technology continues to advance, a proactive and adaptive approach to SA will be crucial in unlocking the full potential of AI while safeguarding against unintended consequences. Through continuous improvement, collaboration, and a commitment to ethical standards, the marriage of SA and AI promises a future where intelligent systems enhance our lives responsibly and reliably.
1.3 Additional Guidance
Links to Additional Guidance materials for this subject have been compiled in the Relevant Links table. Click here to see the
Tablink2
tab
3
linktext
Additional Guidance
in the Resources tab.
Div
id
tabs-2
2. Generative AI Metrics
Defining metrics for the complexity of generative AI models requires tailoring your evaluation to the unique characteristics of these systems—such as their architecture, size, computational requirements, capabilities, and usability. Unlike traditional software complexity metrics (e.g., cyclomatic complexity), generative AI complexity is often evaluated through model-specific engineering, mathematical, and operational characteristics. Below is a comprehensive guide to key metrics that you can use:
2.1 Architectural Complexity
Metrics:
Number of Layers: The depth or number of layers in the neural network architecture (e.g., 12 layers for GPT-2, 96 layers for GPT-4).
Parameters: Total number of trainable parameters (e.g., billions for most large language models).
Attention Heads: In transformer models, attention heads drive complexity. Evaluate the number of heads per layer and their interactions.
Non-Linearity: Measure the types and number of activation functions (e.g., ReLU, GELU).
Why Important:
These metrics indicate the model's capacity to learn complex patterns. Larger and deeper architectures typically have higher expressivity but come with increased computational cost.
2.2 Computational Complexity
Metrics:
Floating-Point Operations per Second (FLOPs): The number of computations required to perform training and inference.
Memory Requirements: GPU or RAM usage during training and inference—especially significant for deployment on constrained systems.
Inference Time: Latency in generating outputs. Faster inference models are considered less complex and more efficient.
Power Consumption: Energy required for training and inference, relevant for sustainable AI practices.
Why Important:
These metrics determine the model's scalability and operational costs for deployment and training. For example, models with high FLOPs and memory requirements are often harder to scale.
2.3 Model Representational Complexity
Metrics:
Expressive Power: The ability of the model to learn and represent complex functions or dynamics.
Entropy of Outputs: Capturing the diversity and unpredictability of model outputs during inference.
Embedding Space Size: The dimensionality of the embeddings used internally (e.g., 768 for GPT-2, 4096 for GPT-4).
Why Important:
These metrics highlight how effectively the model can generalize across diverse tasks and inputs while maintaining rich representations.
2.4 Training Complexity
Metrics:
Dataset Size: The volume of training data required (e.g., tokens or examples in billions for large models).
Training Iterations: Number of epochs or updates needed to achieve convergence.
Learning Rate Dynamics: The adaptation of learning rates during training, which impacts convergence speed.
Optimization Complexity: Evaluate the type of optimizer used (e.g., Adam vs. AdaFactor) and its configuration.
Why Important:
High training complexity can imply longer development times and greater hardware requirements to train a model properly.
2.5 Fine-Tuning and Adaptation Complexity
Metrics:
Number of Parameters Adapted: How much of the model can or must be fine-tuned for specific tasks (e.g., fine-tuning full models vs. adapter layers in PEFT [Parameter-Efficient Fine-Tuning]).
Data Requirements for Fine-Tuning: The amount of task-specific data required to adapt the model.
Domain Generalization: The model’s ability to generalize across new domains without full retraining.
Why Important:
Assessing fine-tuning complexity helps determine the model’s usability for downstream applications.
2.6 Output Complexity
Metrics:
Sequence Length: The maximum number of tokens or characters the model can process or generate in a single inference step.
Coherence Score: How logically connected the outputs are over long sequences (subjective or algorithmic measures).
Temperature and Diversity: Configurations used during inference and their influence on creativity or randomness of generative outputs.
Why Important:
Output complexity impacts the quality and usability of generative and conversational results, especially for tasks requiring coherence, relevance, or creativity.
2.7 Interpretability Complexity
Metrics:
Explainability: How easy it is to understand the internal workings of the model (e.g., decision-making pathways or attention distributions).
Saliency Maps: Highlights in the input that influence the outputs, which are useful for interpretability tools.
Layer Contribution Analysis: Understanding which layers contribute most to model performance.
Bias and Fairness Audits: The complexity of detecting and mitigating bias in the model outputs.
Why Important:
Interpretability metrics are crucial for ethical AI deployment and trust-building in sensitive applications.
2.8 Real-World Deployment Complexity
Metrics:
Scalability: How easy it is to scale up or down the model architecture for different hardware configurations.
Latency: The time taken for the model to respond or process input in real-world usage scenarios.
API Complexity: The ease or difficulty of integrating the model into applications (e.g., REST APIs vs. custom libraries).
Security and Robustness: Complexity of ensuring the model is robust to adversarial attacks or misuse.
Why Important:
Deployment complexity plays a significant role in practical utility, customer satisfaction, and security of generative AI solutions.
2.9 Best Practices for Defining Metrics
Task-Specific Design: Tailor metrics to your specific use case, whether it's text generation, image generation, or conversational AI.
Benchmarking: Use standard benchmarks such as GLUE, SuperGLUE, BLEU, ROUGE, or human evaluation to assess performance alongside complexity.
Holistic View: Combine several complexity metrics for a more complete picture (architectural, computational, and deployment complexity).
Comparative Analysis: Compare your model against others (e.g., GPT, BERT, DALL-E) to contextualize complexity scores.
2.10 Tools and Frameworks for Complexity Evaluation
Example: You can use tools like these for computation-heavy components:
Weights & Biases (W&B): For tracking FLOPs, memory use, and other training metrics.
Hugging Face Benchmarking Tools: For evaluating inference performance.
Explainability Libraries: Captum, SHAP, or LIME for interpretability complexity.
Energy Usage Estimators: Like CodeCarbon, to assess power consumption.
By defining and measuring these complexity metrics, you can assess generative AI models more effectively, ensure performance optimization, and improve deployment decisions.
This checklist provides comprehensive data and evidence required to certify software for human-rated missions.
It ensures compliance with applicable safety standards, regulatory requirements (NASA NPR 7150.2D, SSP 50038, FAA, NASA-STD-8739.8B), mission-critical functionality, and stakeholder acceptance of residual risks, demonstrating that the software is safe, reliable, and mission-ready for crewed spaceflight operations.
PAT for Comprehensive Checklist for Software Certification in Human-Rated Missions
1.3 Additional Guidance
Links to Additional Guidance materials for this subject have been compiled in the Relevant Links table. Click here to see the
Tablink2
tab
3
linktext
Additional Guidance
in the Resources tab.
Div
id
tabs-2
2. Key Compliance Data Needs
2.1 Summary Table of Key Compliance Data Needs
Category
Key Data/Documentation
Requirements
System/Software Requirements Traceability, Hazard Control Requirements
Control Sequences, OCAD Validation, Manual Safing Data
2.2 Key Compliance Data Needs
Software Requirements
High-level system/software requirements
Detailed software requirements (or whatever the developer used)
All known software safety constraints
Software bi-directional traceability data
Specifications for internal and external software interfaces definition and testing
Encryption protocols, authentication mechanisms, secure coding practices, and access control procedures.
Software Design
Description of software designed
Hardware design data on safety-critical subsystems
Data Dictionary: input/output data formats, telemetry parameters, and command sequences.
Software Development
All software analyses results
Completed Time-to-effect (TTE) analysis
Completed Fault Tree Analyses
Completed Failure Mode and Effects Analysis
Software process audit results
Developer software process training records
Software Verification and Validation (software testing)
Software test data,
safety-critical requirements test results,
fault Injection Test Results,
End-to-End Integration Testing results,
Penetration Testing Results (resilience testing and telemetry plans against unauthorized system access and cyberattacks),
test results and data showing command execution timing within acceptable,
test results and data confirming adequate system resource margins
Detailed description of the software test environments
software interfaces (internal and external) test results
Code test coverage data
Software static analysis results reports
Number and types of static analysis tools used.
Results of a Security Vulnerability Analysis: detected and resolved vulnerabilities in the software's security framework.
All of the Independent Verification and Validation (IV&V) assessments results
Data showing that the safety-critical software components meet complexity thresholds
Evidence that the code structural quality has low risks.
Hazards
Hazards and mitigation controls that include software
List of any unresolved hazards
CM
Processes used for version control, change tracking, and baseline management.
Identification of flight-ready software configurations,
Flight readiness and Operations
Clear understanding of the operational environment for the mission.
Operational procedures for updating the software and data
Any software related threats for the operational environment on the software operation
List of and access to all open software defects
List of and access to all open and closed high-risk software defects.
Stakeholder-approved sign-off on any unavoidable operational software related risks.
Evidence of adherence to validated development processes, coding guidelines, and testing protocols.
Deliverables required for regulatory certification
Software Version Description Document (VDD)
FRR Exit Criteria Sign-Off for software
Crew software user guides, operational procedures, and troubleshooting documentation.
Documentation showing mechanisms to handle errors, recover failures, and preserve system operation under degraded conditions.
Div
id
tabs-3
3. Safety Case for Human-Rated Software Certification
This safety case demonstrates that the software used in this human-rated mission adheres to rigorous safety, quality, and regulatory standards. Based on the evidence provided, the software is flight-ready and capable of supporting critical mission operations while ensuring the safety of the crew and spacecraft under both nominal and adverse conditions.
1. Requirements and Traceability
Argument: The software requirements are clearly defined, traceable, and aligned with safety-critical mission needs.
Evidence:
Comprehensive Software Requirements Specification (SRS) covering high-level mission-critical systems (e.g., navigation, propulsion, anomaly detection, life support, and abort operations).
Verified safety requirements (fault tolerance, redundancy, and safe initialization/termination).
Acceptable quality of detailed low-level safety-critical requirements, including specifics like algorithm designs and timing constraints.
A completed and validated Requirements Traceability Matrix (RTM) showing bi-directional traceability from requirements through design, code, and test results.
Reviewed system-level safety analyses to document "Must Work" (MWF) and "Must Not Work" (MNWF) requirements, prerequisite checks for hazardous commands, and mitigation strategies.
2. Software Design and Architecture
Argument: The software architecture is resilient, modular, and designed for fault tolerance and safety-critical operations.
Evidence:
Architecture documentation detailing modular fault isolation, redundancy, and resiliency mechanisms.
Block diagrams illustrating fault containment, fail-safe control paths, and separation of critical functions.
Documentation and analysis of safety-critical subsystems (e.g., propulsion, crew displays, navigation) with clearly defined responsibilities.
Verified Interface Control Documents (ICDs), ensuring compatibility between internal software, hardware systems, and external interactions.
Safety validation evidence for safeguards like fault containment, error detection, operator validation, integrity checks, and anomaly recovery processes.
Independent redundant system designs ensuring physical and logical separation to mitigate single points of failure.
Validation of fault-tolerant mechanisms, including cosmic radiation protection in CPU designs.
3. Hazard Analysis and Safety Evidence
Argument: All hazards associated with software functionality are identified, analyzed, and mitigated to acceptable levels of risk.
Evidence:
A complete Hazard Analysis Report (HAR) identifying software-driving hazards and the mitigation strategies in place.
Fault Tree Analysis (FTA) and Failure Mode and Effects Analysis (FMEA) showing robust fault prevention and recovery mechanisms.
Time-to-effect (TTE) analyses ensuring hazardous conditions can be addressed by safing systems within operational thresholds.
Residual risk documentation showing resolution or acceptance of remaining risks by stakeholders.
4. Verification and Validation (V&V) Evidence
Argument: Rigorous testing, validation, and coverage analyses demonstrate software compliance with safety-critical requirements.
Evidence:
Unit testing, system integration testing, end-to-end validation, and operational flight simulations confirming that expected functional performance aligns with safety goals.
Validation of reused components (COTS, GOTS, OSS, MOTS) to ensure compatibility and reliable integration into human-rated environments.
Coverage analysis demonstrating:
100% Statement Coverage.
100% Decision Coverage.
100% Modified Condition/Decision Coverage (MC/DC) for safety-critical components.
Static analysis reports showing compliance with coding standards and identification/remediation of software defects.
Fault injection testing results validating responses to corrupted data, anomalies during power disruptions, and memory errors.
Worst-case response timing analysis confirming safing systems meet TTE requirements under degraded conditions.
5. Configuration Management and Change Tracking
Argument: Configuration management processes ensure version control and traceability for all software changes.
Evidence:
Documentation showing version-controlled baselines for flight-ready software, including configuration hashes and release notes.
Audit records verifying modifications, regression testing, impact analyses, and stakeholder approvals
6. Cybersecurity and Security Validation
Argument: The software architecture incorporates robust cybersecurity measures to mitigate threats in operation environments.
Penetration testing results validating resilience against cyberattacks and unauthorized system access during pre-launch and flight.
Vulnerability analysis reports confirming detection, resolution, and closure of security-related risks.
7. Defect Management and Residual Risks
Argument: All software defects have been resolved or mitigated to acceptable levels of residual risk.
Evidence:
Defect reports showing all open and closed defects categorized by severity and justifications for acceptance of residual risks.
Logs documenting defect resolutions and testing data validating the outcomes of mitigation measures.
Residual risk acceptance documentation signed off by stakeholders, with sufficient evidence showing safe system behavior despite unresolved minor risks.
8. Resource Utilization and Performance Metrics
Argument: The software demonstrates sufficient resource margins and acceptable performance under normal and worst-case conditions.