- Vishnu K
- May 22, 2024
Enhancing ISO26262 Functional Safety Validation: Challenges and Best Practices
Introduction to functional safety
Functional safety is a vital discipline in engineering, focusing on ensuring system safety in areas impacting human safety or the environment. It aims to prevent or minimize the consequences of system failures, faults, or errors leading to hazardous situations.
The complexity of modern systems in critical industries such as aerospace, automotive, and medical devices necessitate rigorous functional safety verification. This verification process ensures correct implementation of safety measures and addresses potential hazards before deployment. It involves comprehensive testing, analysis, and simulation to validate system behavior under normal and abnormal conditions.
Examples highlight the significance of functional safety across industries. For instance, in automotive applications, airbag systems must reliably deploy to protect occupants during accidents. Similarly, medical devices like pacemakers require continuous and reliable operation to avoid life-threatening consequences for patients.
In automotive engineering, functional safety is paramount due to the interconnected nature of vehicle systems. Electronic control units (ECUs) manage critical functions like braking and steering, necessitating seamless system interaction for safety and performance. Compliance with stringent safety standards, such as ISO 26262, is not only a regulatory requirement but also crucial for protecting lives on the road.
The ISO26262 standard provides a guideline to assess severity of all situations and provides a safety rating system called Automotive Safety Integrity Level (ASIL).
Focusing on the ASIL-B standard, the system should be capable of identifying 90% of single point faults occurring in the design using all the SMs (safety mechanism) defined for that design.
Fault injection
Fault injection is a technique used in functional safety verification to assess how a system responds to various faults or errors intentionally introduced into its operation. By simulating faults, engineers can evaluate the system’s robustness and its ability to detect, diagnose, and recover from potential failures.
ISO26262 function safety standard verification requires extensive fault injection campaigns and complex manual analysis.
Faults can be broadly classified into Permanent and Transient. A permanent fault is a type of fault that persists until it is actively corrected or repaired. It occurs due to inherent functional errors in the design, component degradation, manufacturing defect, physical damage, electromagnetic interference, etc. A transient fault is a temporary deviation from normal system behavior that occurs due to external factors or temporary conditions. Unlike permanent faults that persist until corrected, transient faults typically resolve on their own once the influencing factor diminishes. It occurred due to voltage spikes, electrostatic discharge, environment factors, radiation or cosmic rays, interference from nearby equipments, etc.
Permanent fault can be simulated by forcing 0 or 1 on the node, classified as stuck at 0 and stuck at 1 fault. Gate level netlist is used for fault injection of permanent fault model. In transient fault, the signal is inverted, and the modified value remains for a small time. This is classified into two types-
Single event upset (SEU): Faulty value remains until new value assigned
Single event transient (SET): Hold the value for a specific period of time
FCM flow
The Fault Campaign Manager (FCM) oversees the entirety of the fault injection campaign, managing every step from planning to execution. It relies on critical engines such as the Xcelium Fault Simulator (XFS) and the Jasper Functional Safety Verification App (FSV) to create a robust and thorough functional safety solution. These tools work together seamlessly within the FCM framework, allowing for comprehensive fault injection testing, data analysis, and reporting.
The FCM ensures that fault scenarios are configured accurately, simulations are run effectively, and results are analysed comprehensively to assess system resilience, fault tolerance, and safety mechanisms. By integrating these core engines, the FCM streamlines the end-to-end flow of functional safety verification, enabling engineers to validate safety requirements and enhance system reliability efficiently. The steps for FCM flow are given below:
PREP: Create campaign directory structure
O_EXEC, O_RANK: Execute all test cases from users list and rank the test cases for fault injection (FI) (based on toggle coverage percentage contribution by each test cases).
G_ELAB: Elaborate the design with fault information, create xcelium snapshot and fault database
FST: Fault space reduction using testability analysis and cone of influence (COI)
FSV_TC: Fault pruning with constant analysis
F_EXEC, F_EXEC_C: Simulate each fault with selected rank test cases using concurrent and serial engines
F_RANK: Generate final report with fault details
Challenges and Solutions
There are lot of challenges while using FCM flow for bigger IPs (large fault lists). Challenges and effective solutions are provided below:
- Addressing runtime issues
- Optimizing fault list
- Test case prioritization
- Utilizing engineering judgment
- Understanding tools limitation
- Selection of window of opportunity
Addressing runtime issues
Runtime issues can be classified into two:
- Huge runtime for complex designs (large fault list)
- Handling non simulatable (NS) fault
Huge runtime for complex designs
Designs containing fewer than 50,000 faults can complete the campaign relatively quickly.
However, designs exceeding 50,000 faults, such as hardware accelerator designs with over 1 million faults, will require more time. It is possible to reduce runtime by configuring parameters appropriately.
Grouping is a crucial parameter in fault injection campaigns. Consider a scenario with 5000 faults and 10 test cases, resulting in 50000 fault simulations (5000 faults * 10 test cases). Executing such a large number of simulations can significantly prolong the campaign duration. Grouping faults offers an effective solution to this challenge.
For instance, if we group 1000 faults per simulation, only 5 simulations will be necessary for each test case, reducing the total simulations to 50 for the entire fault injection campaign. This grouping strategy significantly decreases the campaign’s runtime. Moreover, if there are sufficient licenses, all 50 simulations can run concurrently, further reducing the overall runtime.
However, it is essential to note a potential drawback of grouping: simulating 1000 faults in a single simulation may take longer than simulating 1 fault per simulation. This trade-off between the number of faults per simulation and simulation runtime should be carefully considered based on the specific requirements and constraints of the fault injection testing process.
We need to determine the optimal value for fault grouping based on the number of faults, test cases, and available fault simulator licenses. Conducting experiments to find the most effective grouping value is crucial. Below are the details of the experiments conducted and the findings regarding the best value for fault grouping.
Total Faults: 100000
Total test cases: 50
Exp1: FS_MAX_FAULTS_PER_GROUP = 200
FS_SERIAL_MAX_FAULTS_PER_GROUP = 1
Exp2: FS_MAX_FAULTS_PER_GROUP = 10000
FS_SERIAL_MAX_FAULTS_PER_GROUP = 40
Exp3: FS_MAX_FAULTS_PER_GROUP = 2000
FS_SERIAL_MAX_FAULTS_PER_GROUP = 40
Handling non simulatable fault
A limitation exists with the concurrent engine regarding faults labelled as non-simulatable, particularly when these faults propagate through RTL constructs (e.g., behavioural memory code). These non-simulatable (NS) faults are executed by the serial engine, which can extend the runtime.
To mitigate this issue and improve runtime, reducing the number of NS faults is essential. One approach is to define a fault boundary, which delineates the extent of fault propagation. For example, when analysing a specific IP (Intellectual Property), it is beneficial to align the fault boundary with the IPs boundaries or the hierarchy where checker strobes and functional strobes are located. This strategy effectively reduces non-simulatable faults, thereby decreasing the need for serial runs and optimizing overall runtime.
Optimizing fault list
After the simulation phase, numerous undangerous undetected (UU) faults may remain, which are not observed in functional and checker strobes. Generally, faults become UU due to two reasons: either there are no test cases to exercise the fault path, or the fault itself is considered safe. To streamline fault simulation and reduce total faults, identifying safe faults is crucial. This involves checking design constraints and coverage waivers to ensure that these faults do not propagate to functional strobes.
Additionally, in fault injection campaigns, some blocks may be instantiated multiple times. It’s essential to focus on one instance and extrapolate faults from other instances since fault analysis and propagation paths remain the same for all instances.
To manage exclusions or convert UU faults to safe faults effectively, we utilize the JasperGold Functional Safety Verification App (FSV). FSV classifies safe faults and significantly reduces runtime. Within the Fault Campaign Manager (FCM), the FSV phase involves structural analysis via functional safety tool, which includes:
Out of COI analysis: Removes faults on diagnostic logic based on COI.
Activatability analysis: Aids in removing tied-off logic.
Propagability analysis: Waives off faults that cannot propagate to functional strobes based on user-defined assumptions.
Examples illustrating these analyses are provided below to demonstrate their efficacy in optimizing fault simulation and improving overall fault management in functional safety verification processes.
1. Since we are not covering design for testability (DFT) logic for fault injection (FI), we need to include the following statement in the FSV tickle file:
assume -env {DUT_WRAPPER … DFT_sen = = 1’b0}
In this statement, DFT_sen is treated as “0,” and any sa0 on this signal and the signals driven by DFT_sen will be considered safe faults.
2. To mask untargeted logic, you can utilize a barrier:
check_fsv -barrier -add {hierarchy}
This command will exclude faults on the specified node and its inputs.
Test case prioritization
The lack of stimulus can result in numerous UU faults in the Fault Campaign Manager (FCM) campaign. Before initiating fault injection (FI) activities, it’s crucial to ensure that the available test cases provide full coverage of the fault target, especially in terms of toggle coverage. Often, there are many redundant test cases that increase the number of fault runs and consequently, the runtime.
To address this issue, we need to create a targeted set of test cases that offer maximum coverage, thereby reducing the number of fault runs. The FCM flow includes an optional phase for test case selection, where tests with higher coverage are chosen for the fault injection campaign. The test case ranking phase prioritizes test cases based on their fault coverage, from highest to lowest. Additionally, the test drop feature, when used in conjunction, can significantly enhance efficiency and reduce runtimes.
Utilizing engineering judgment
After applying all assumptions and barriers, if fault coverage remains insufficient, manual classification of faults with proper analysis and justification is necessary. Engineering Judgment (EJ) involves a set of rules (Unobserved Safeness Factor and Unobserved Detection Factor) for classifying unobserved faults as safe or detected. Justification is based on functional coverage (FC) and diagnostic coverage (DC).
UOSF (Unobserved Safeness Factor) and UODF (Unobserved Detection Factor) are calculated based on:
Functional Coverage (FC): Indicates the effectiveness of workload for branch, toggle, or expression coverage. Higher FC implies well-simulated design, leading to a high UOSF (Unobserved Safeness Factor). Depending on the type of safety mechanism, high FC may also indicate that unobserved faults could potentially be detected (high UODF).
The calculation strategy for UOSF and UODF is as follows:
Measured dangerous = dangerous undetected (DU) + dangerous detected (DD)
In summary, EJ involves evaluating unobserved faults using UOSF and UODF, with justification derived from functional and diagnostic coverage. Higher FC contributes to a higher UOSF and potentially a higher UODF, indicating the safety and detectability of unobserved faults by safety mechanisms.
Selection of window of opportunity
Transient analysis is used to examine faults that exist for a short period, known as transient faults. The time window during which faults are activated for propagation is termed the window of opportunity (WoO). Analysing the WoO for each node is critical yet complex, requiring a thorough review of waveforms and a deep understanding of data flow.
Sync events can aid in this process by serving as inputs to the Xcelium Fault Set Generator (XFSG) tool. By using clock signals as sync events, faults are injected immediately after the positive edge of the sync event signal. Identifying clocks in the design and utilizing them as sync events is essential since all flops in the design are activated based on different clock inputs. The XFSG tool takes a timing configuration file and waveform data (.shm file) as inputs to generate the fault list, considering the sync event timings.
For the initial fault injection (FI) run, starting time (after reset de assertion) and end time, along with the time interval between fault injections, can be specified. The fault generator injects the same fault in different time windows based on these parameters. Analysing the campaign output provides insights into which time windows activate or deactivate the most faults. This data aids in further analysis and minimizes the number of UU faults. Subsequent analysis focuses on identifying and understanding the WoO for the remaining UU faults.
Understanding tools limitation
The sync event flow only supports clock inputs; other signals cannot be used as sync events.
The serial fault engine considers all faults for the campaign, whereas it should normally only consider NS (Non-Simulatable) faults. This results in longer runtimes due to a higher number of fault simulations.
The FCM flow misbehaves when adding test cases supported by analog models, leading to fault hierarchy being skipped.
Summary and Future enhancement
Completing fault injection and achieving comprehensive diagnostic coverage on complex IPs within tight timelines presents considerable challenges. Our experiments and observations detailed above are aimed at overcoming these hurdles by optimizing runtimes and simplifying the analysis process.
The solutions discussed are not only applicable to the specific scenarios outlined but can be extrapolated to address similar challenges encountered with any IP. By implementing the strategies outlined, analysts can navigate through the complexities of fault injection and diagnostic coverage with greater ease and efficiency.
Looking ahead, there is potential for further enhancements in fault simulator tools. Developing a refined workflow that specifically targets UU faults for future campaigns, while excluding DD, DU, and safe faults, holds promise in reducing the overall fault set, streamlining runtimes, and facilitating more straightforward analysis processes.
Moreover, emphasizing the enhancement of FSV capabilities, especially in terms of visualizing stimulus to cover corner cases, is essential for ensuring a thorough fault analysis and achieving comprehensive diagnostic coverage across various IP designs.
In conclusion, by leveraging the insights and strategies discussed in this blog, the problem statement can be simplified, ultimately leading to more robust and reliable IP designs.
References
- Fault Campaign Manager User Guide
- Xcelium Fault Simulator User Guide
- ISO 26262-1:2018 – Road vehicles — Functional safety
- Felipe Augusto da Silva, Ahmet Cagri Bagbaba, Said Hamdioui, Christian Sauer, 2019, October, Efficient Methodology for ISO26262 Functional Safety Verification, 2019 IEEE 25th international symposiumoon On-Line testing and Robust system designing