bannerd


8.25 - Artificial Intelligence And Software Assurance

1. Introduction

As NASA organizations use Artificial Intelligence (AI) to make decisions, the need for robust Software Assurance (SA) becomes important. (AI) has emerged as a project option, changing the way we interact with and use data. Recommend, at this time, that we limit to use of AI to non-safety critical applications.

Software Assurance plays a crucial role in the AI development lifecycle. It involves a systematic process of monitoring, assessing, and improving the software development and AI implementation processes. In the context of AI, SA aims to:

  • evaluate the use of AI in software development activities,
  • validate the accuracy of algorithms,
  • ensure model robustness, and
  • ensure that the software meets the highest standards of performance.

AI introduces unique challenges to traditional SA methodologies. Unlike conventional software, AI systems continuously learn and adapt based on new data. This dynamic nature poses challenges in establishing fixed criteria for testing and validation. SA in AI must evolve alongside the models it scrutinizes, necessitating a more iterative and adaptive approach.

1.1 Data Quality

The quality of the data used to train and test AI models directly influences their performance and reliability. Garbage in, garbage out (GIGO) holds in the world of AI, emphasizing the critical importance of high-quality data. SA should also ensure the security and configuration control aspects of the data used to train and test AI. Data quality encompasses several dimensions, including accuracy, completeness, consistency, timeliness, and relevance. In the context of AI, accuracy is particularly vital, as even small inaccuracies in the training data can lead to significant errors in predictions. Ensuring that data is representative and unbiased is also crucial to avoid reinforcing existing biases within AI models.

There are a number of key considerations for software assurance on AI models:

1.1.1 Validating The Training Data

SA needs to review (evaluate and approve) how the project and engineering are testing the data used to train the AI model to ensure it is accurate, representative, and unbiased. This helps identify and mitigate potential issues with the model's performance and predictions. Effective data preprocessing and cleaning are foundational steps in ensuring data quality for AI. This involves identifying and addressing missing values, handling outliers, and normalizing data to create a standardized and reliable dataset. SA processes must include rigorous checks at these stages to guarantee that the data fed into AI models is of the highest quality. SA should analyze and think through all possible scenarios to make sure that the AI actions are correct.

See also SWE-193 - Acceptance Testing for Affected System and Software Behavior

1.1.2 Continuous Testing

Unlike traditional software, AI models that are generative or continually learning may require ongoing testing even after deployment, if the model continues to learn and evolve with new real-world data. SA needs to ensure that the software engineering process includes repeatedly testing the model's behavior and performance over time and evaluating the results. Testing AI models is a multifaceted process. It involves validating the accuracy of predictions, assessing model generalization to new data, and evaluating performance under diverse conditions. The dynamic nature of some AI techniques may require continuous testing throughout the development lifecycle. Rigorous testing not only ensures the reliability of AI applications but also contributes to building trust among end-users. The dynamic nature of AI necessitates continuous monitoring and feedback loops. SA processes should include mechanisms for monitoring of AI applications in development, production, and in testing. This ongoing evaluation helps identify and address issues promptly, ensuring that the AI system remains accurate and effective as it encounters new data.

See also SWE-066 - Perform Testing, SWE-211 - Test Levels of Non-Custom Developed Software

1.1.3 Documentation And Traceability

Comprehensive documentation and traceability are essential aspects of AI applications. Documenting the entire development and testing process, along with the data used, facilitates transparency allowing for effective debugging and auditing. In the event of issues or unexpected outcomes, traceability enables developers and software assurance teams to identify and rectify problems efficiently. SA needs to ensure that sufficient documentation and traceability exist for the AI application and associated data. SA should ensure if the model needs to be recreated, then all of the inputs need to be documented so that they can be fed into the AI tool and produce the same/correct results. 

See also SWE-070 - Models, Simulations, Tools

1.1.4 Leveraging AI/ML For SA

Automated tools and machine learning models can be used to analyze large codebases, identify patterns, and predict potential issues more efficiently than manual review. SA needs to analyze the engineering data or perform independent static code analysis to check for code defects, software quality objectives, code coverage objectives, software complexity values, and software security objectives. SA needs to confirm the static analysis tool(s) are used with checkers to identify security and coding errors and defects. SA needs to confirm that the project addresses the errors and defects and assesses the results from the static analysis tools used by software assurance, software safety, engineering, or the project. SA should confirm that Software Quality Objectives or software quality threshold levels are defined and set for static code analysis defects, or software security objectives.

See also SWE-146 - Auto-generated Source Code, SWE-205 - Determination of Safety-Critical Software.

1.1.5 Defining Quality Criteria

It's important to establish clear quality criteria and variables to assess the AI model's performance, quality, risk, security, maintainability, and compliance with requirements. The SA process should confirm that these criteria are defined and aim to continuously improve the code quality towards all of the criteria.

See also SWE-034 - Acceptance Criteria

1.1.6 Balancing Quality Goals

While the goal is to maximize code quality, projects must also consider factors like features, costs, and schedule. The SA process should help identify the critical vulnerabilities to address based on objectives, risks, and schedules.

See also SWE-033 - Acquisition vs. Development Assessment,  SWE-151 - Cost Estimate Conditions, and SWE-086 - Continuous Risk Management  

1.1.7 Ensuring Data Security And Privacy

Protecting sensitive data used to train and operate the AI model is important. Appropriate access controls and security measures must be in place. SA should confirm that the project has the proper security controls in place and configuration management for the project.

See also SWE-156 - Evaluate Systems for Security Risks

1.1.8 Addressing Bias Considerations

One of the most significant challenges in AI software assurance is addressing bias. Biased training data can lead to incorrect outcomes. Software assurance should look at how the project identifies and mitigates bias in the data and the algorithms. SA should also ensure ongoing monitoring and adjustment are used to ensure objectivity in AI applications.

1.1.9 Requirements Verification And Validation

AI applications introduce a unique challenge of indeterministic behavior.  The ability to verify and validate the behavior of AI systems may not be predictably repeatable.  SA should work to develop verification and validation criteria that address the probabilistic nature of AI systems to establish qualitative measures for validation and verification test acceptance, pass, and fail criteria.

See also SWE-066 - Perform Testing, SWE-068 - Evaluate Test ResultsSWE-070 - Models, Simulations, Tools, and SWE-193 - Acceptance Testing for Affected System and Software Behavior

1.2 Approach

In summary, software assurance for AI models requires a comprehensive approach that focuses on validating training data, continuous testing, leveraging AI/ML tools, defining quality benchmarks, balancing project goals and risks, ensuring data security and privacy, and addressing ethical considerations. This helps ensure the AI system is accurate, robust, secure, and compliant.

In the era of artificial intelligence, software quality assurance emerges as a linchpin for ensuring the reliability, accuracy, and ethicality of AI applications. By focusing on data quality as a foundational element, SA plays a pivotal role in mitigating challenges such as bias, ensuring transparency, and building trust in AI systems. As technology continues to advance, a proactive and adaptive approach to SA will be crucial in unlocking the full potential of AI while safeguarding against unintended consequences. Through continuous improvement, collaboration, and a commitment to ethical standards, the marriage of SA and AI promises a future where intelligent systems enhance our lives responsibly and reliably.

1.3 Additional Guidance

Links to Additional Guidance materials for this subject have been compiled in the Relevant Links table. Click here to see the Additional Guidance in the Resources tab.

2. Generative AI Metrics

Defining metrics for the complexity of generative AI models requires tailoring your evaluation to the unique characteristics of these systems—such as their architecture, size, computational requirements, capabilities, and usability. Unlike traditional software complexity metrics (e.g., cyclomatic complexity), generative AI complexity is often evaluated through model-specific engineering, mathematical, and operational characteristics. Below is a comprehensive guide to key metrics that you can use:

2.1 Architectural Complexity

Metrics:

  • Number of Layers: The depth or number of layers in the neural network architecture (e.g., 12 layers for GPT-2, 96 layers for GPT-4).
  • Parameters: Total number of trainable parameters (e.g., billions for most large language models).
  • Attention Heads: In transformer models, attention heads drive complexity. Evaluate the number of heads per layer and their interactions.
  • Non-Linearity: Measure the types and number of activation functions (e.g., ReLU, GELU).

Why Important:

These metrics indicate the model's capacity to learn complex patterns. Larger and deeper architectures typically have higher expressivity but come with increased computational cost.

2.2 Computational Complexity

Metrics:

  • Floating-Point Operations per Second (FLOPs): The number of computations required to perform training and inference.
  • Memory Requirements: GPU or RAM usage during training and inference—especially significant for deployment on constrained systems.
  • Inference Time: Latency in generating outputs. Faster inference models are considered less complex and more efficient.
  • Power Consumption: Energy required for training and inference, relevant for sustainable AI practices.

Why Important:

These metrics determine the model's scalability and operational costs for deployment and training. For example, models with high FLOPs and memory requirements are often harder to scale.

2.3 Model Representational Complexity

Metrics:

  • Expressive Power: The ability of the model to learn and represent complex functions or dynamics.
  • Entropy of Outputs: Capturing the diversity and unpredictability of model outputs during inference.
  • Embedding Space Size: The dimensionality of the embeddings used internally (e.g., 768 for GPT-2, 4096 for GPT-4).

Why Important:

These metrics highlight how effectively the model can generalize across diverse tasks and inputs while maintaining rich representations.

2.4 Training Complexity

Metrics:

  • Dataset Size: The volume of training data required (e.g., tokens or examples in billions for large models).
  • Training Iterations: Number of epochs or updates needed to achieve convergence.
  • Learning Rate Dynamics: The adaptation of learning rates during training, which impacts convergence speed.
  • Optimization Complexity: Evaluate the type of optimizer used (e.g., Adam vs. AdaFactor) and its configuration.

Why Important:

High training complexity can imply longer development times and greater hardware requirements to train a model properly.

2.5 Fine-Tuning and Adaptation Complexity

Metrics:

  • Number of Parameters Adapted: How much of the model can or must be fine-tuned for specific tasks (e.g., fine-tuning full models vs. adapter layers in PEFT [Parameter-Efficient Fine-Tuning]).
  • Data Requirements for Fine-Tuning: The amount of task-specific data required to adapt the model.
  • Domain Generalization: The model’s ability to generalize across new domains without full retraining.

Why Important:

Assessing fine-tuning complexity helps determine the model’s usability for downstream applications.

2.6 Output Complexity

Metrics:

  • Sequence Length: The maximum number of tokens or characters the model can process or generate in a single inference step.
  • Coherence Score: How logically connected the outputs are over long sequences (subjective or algorithmic measures).
  • Temperature and Diversity: Configurations used during inference and their influence on creativity or randomness of generative outputs.

Why Important:

Output complexity impacts the quality and usability of generative and conversational results, especially for tasks requiring coherence, relevance, or creativity.

2.7 Interpretability Complexity

Metrics:

  • Explainability: How easy it is to understand the internal workings of the model (e.g., decision-making pathways or attention distributions).
  • Saliency Maps: Highlights in the input that influence the outputs, which are useful for interpretability tools.
  • Layer Contribution Analysis: Understanding which layers contribute most to model performance.
  • Bias and Fairness Audits: The complexity of detecting and mitigating bias in the model outputs.

Why Important:

Interpretability metrics are crucial for ethical AI deployment and trust-building in sensitive applications.

2.8 Real-World Deployment Complexity

Metrics:

  • Scalability: How easy it is to scale up or down the model architecture for different hardware configurations.
  • Latency: The time taken for the model to respond or process input in real-world usage scenarios.
  • API Complexity: The ease or difficulty of integrating the model into applications (e.g., REST APIs vs. custom libraries).
  • Security and Robustness: Complexity of ensuring the model is robust to adversarial attacks or misuse.

Why Important:

Deployment complexity plays a significant role in practical utility, customer satisfaction, and security of generative AI solutions.

2.9 Best Practices for Defining Metrics

  1. Task-Specific Design: Tailor metrics to your specific use case, whether it's text generation, image generation, or conversational AI.
  2. Benchmarking: Use standard benchmarks such as GLUE, SuperGLUE, BLEU, ROUGE, or human evaluation to assess performance alongside complexity.
  3. Holistic View: Combine several complexity metrics for a more complete picture (architectural, computational, and deployment complexity).
  4. Comparative Analysis: Compare your model against others (e.g., GPT, BERT, DALL-E) to contextualize complexity scores.

2.10 Tools and Frameworks for Complexity Evaluation

Example: You can use tools like these for computation-heavy components:

  • Weights & Biases (W&B): For tracking FLOPs, memory use, and other training metrics.
  • Hugging Face Benchmarking Tools: For evaluating inference performance.
  • Explainability Libraries: Captum, SHAP, or LIME for interpretability complexity.
  • Energy Usage Estimators: Like CodeCarbon, to assess power consumption.

By defining and measuring these complexity metrics, you can assess generative AI models more effectively, ensure performance optimization, and improve deployment decisions.


3. Resources

3.1 References

No references have been currently identified for this Topic. If you wish to suggest a reference, please leave a comment below.


3.2 Tools


Tools to aid in compliance with this SWE, if any, may be found in the Tools Library in the NASA Engineering Network (NEN). 

NASA users find this in the Tools Library in the Software Processes Across NASA (SPAN) site of the Software Engineering Community in NEN. 

The list is informational only and does not represent an “approved tool list”, nor does it represent an endorsement of any particular tool.  The purpose is to provide examples of tools being used across the Agency and to help projects and centers decide what tools to consider.


3.3 Additional Guidance

Additional guidance related to this requirement may be found in the following materials in this Handbook:

3.4 Center Process Asset Libraries

SPAN - Software Processes Across NASA
SPAN contains links to Center managed Process Asset Libraries. Consult these Process Asset Libraries (PALs) for Center-specific guidance including processes, forms, checklists, training, and templates related to Software Development. See SPAN in the Software Engineering Community of NEN. Available to NASA only. https://nen.nasa.gov/web/software/wiki 197

See the following link(s) in SPAN for process assets from contributing Centers (NASA Only). 

SPAN Links



3.5 Related Activities

This Topic is related to the following Life Cycle Activities:

Related Links