The operation of complex systems such as nuclear power plants has necessitated advanced diagnostic capabilities, particularly as part of a broader autonomous operation framework. However, while the need for diagnostics is driven by the need to achieve cost efficiency and enhanced responses, a diagnostic tool’s effectiveness largely depends on the ability of operators to understand and trust the information presented. In safety-critical environments like nuclear power plants, where operators must make informed decisions, the ability to understand and trust the diagnostic information is of paramount importance. It is not sufficient to be told that something is wrong, rather it is crucial to understand why and how it is wrong to make the most effective corrective actions. A paper by Akshay J. Dave, Tat Nghia Nguyen, and Richard B. Vilim of the Nuclear Science and Engineering Division of Argonne National Laboratory, explores this theme in a project supported by the US Department of Energy, Office of Nuclear Energy. Their analysis concludes that to address this problem two requirements must be met. Firstly, is the provision of a diagnostics capability and secondly, the use of a computational agent that can provide explanations about those diagnoses that emerge. Their paper, titled ‘Integrating LLMs for Explainable Fault Diagnosis in Complex Systems’ presents a system incorporating both aspects for explainable fault diagnostics.
While the concept of explainability lacks a universally agreed-upon definition, their work employs a physics-based diagnostic model to derive the causal relationships between potential faults and fault symptoms. This approach, the authors say, allows precise rationales for any offered diagnoses.
Creating a diagnostic framework
One of the diagnostic framework’s challenges is constructing a physics-based model of the target system. At the component level, approximations of the physics, including the mass, momentum, and energy conservation equations, are utilised to formulate the component models. Each component model may contain several model parameters to be determined by fitting against training data. The model training process for each model has a minimum sensor set requirement. To allow the construction of diagnostic models for maximum coverage, the concept of virtual sensors was also introduced to be used in place of missing sensors for model training.
In this case a physics-based model was developed and implemented in ANL’s Parameter-Free Reasoning Operator for Automated Identification and Diagnosis
(PRO-AID) which performs real-time monitoring and diagnostics for an engineering system. The framework relies on normal (fault-free) models of a system such that any anomalies due to faults can be detected from the inconsistencies between the normal behaviours predicted by the models and observed data. When the system is fault-free, the observed data must satisfy relations imposed by the fault-free system model. Such constraint relations between observations and the expected normal behaviours are formally defined as analytical redundancy relations (ARRs). Each analytical redundancy relation may involve a subset of sensor data and certain parts of the system. In PRO-AID, each ARR is represented by an equation establishing the constraint among the involved sensor readings. The difference between the two sides of the ARR equation is defined as the residual. A non-zero residual would indicate a violation to the ARR, implying at least one of the involved sensors or part of the system is in a fault state.
The set of model residuals forms the basis for fault diagnostics. Each non-zero residual serves as a fault symptom that implicates certain system faults. That set of relevant faults – the possible causes of the non-zero residuals – can be identified from the underlying ARR. Based on the set of all the residuals generated for a system, various reasoning approaches, including both deterministic and probabilistic reasoning, can be employed to produce fault diagnoses. Faults may be detected and diagnosed at a given time based on observed symptoms. In the presence of measurement and modelling uncertainty, a statistical tool may be needed to determine whether a residual is (statistically) non-zero.
To address the objective of integrating LLMs with PRO-AID, a system was designed consisting of four major components:
- Diagnostics Agent: A large language model that has been engineered to contextualise the plant and data generated by PRO-AID, then query the symbolic engine if additional information is needed by the operator.
- Symbolic Engine: A graph of information that forms the basis of the plant knowledge available to the LLM. Provides various functions to the LLM to query PRO-AID or Plant data.
- PRO-AID: A state-of-the-art monitoring & diagnostics tool. Component and sensor faults are determined by physics-based model that are calibrated with plant data.
- Plant: The physical system that is monitored using sensors. The system is represented digitally via a graph generated from the Piping and Instrumentation Diagram (P&ID).
The explainability of the PRO-AID diagnostic results stemmed from the use of physics-based analytical models. The causal relations between potential faults in the system and possible fault symptoms (non-zero residuals) are derived from the physics-based ARR and stored within PRO-AID. Fault diagnoses are obtained by logical inference based on observed symptoms, knowing the possible causes of each symptom. In the reverse direction, the fault diagnosis can be presented along with the observed symptoms and explained intuitively by causality. Forward chaining can be used by an operator to test that a diagnosis given by the algorithm is logically consistent with the symptoms that led to the diagnosis. The natural tendency of an operator is to do just that and this approach can be facilitated by making available the information an operator will use to check for logical consistency.

Using LLMs to explain results
In their research, the authors embed a large language model (LLM) machine learning agent inside a system to explain fault diagnoses to the operator. An LLM can perform various natural language processing tasks and have been embedded in various online portals as chatbots to accommodate arbitrary text-based interactions with humans. While any LLM could be embedded in the diagnostics framework, in this case GPT-4, developed by OpenAI, was used. Given that an LLM is pretrained, any external knowledge – about the plant or diagnostics framework – must be provided to it in the form of context. Carefully managing an LLM’s context is imperative to align it with the requisite knowledge for the specific task.
An acknowledged problem with LLMs is that their knowledge is limited to the data they have been trained on and they are also susceptible to “hallucinating” inaccuracies and presenting them as facts. Thus, the authors attempt to address the hallucination issues by constraining the output of the LLM.
To explain the PRO-AID diagnosis, the LLM needs the background information of the physical system and real-time updates of the observations and diagnostic results in PRO- AID. More specifically, the background information consists of an inventory of physical and virtual sensors, all possible sensor and component faults, and residuals generated for the system. For each residual, details on its dependency structure are also provided. The real-time updates from PRO-AID include recent sensor data, an updated list of zero and non-zero residuals (fault symptoms), and updated diagnostic results. The data exchange between PRO-AID and the LLM can be done periodically using data files produced by a PRO-AID output module.
The METL experimental facility
To demonstrate the diagnostics system’s capabilities, the proposed framework was applied to the purification loop of the Mechanisms Engineering Test Loop (METL) liquid sodium facility at the Argonne National Laboratory. The sodium purification system, consists of two electromagnetic (EM) pumps, an economiser, a cold trap, and a plugging meter. Each EM pump is equipped with a flow meter. The cold trap and the plugging meter are each equipped with a pair of pressure transducers at their inlet and outlet. Additionally, each piping segment of the system is equipped with a heater and a thermocouple – or two thermocouples for piping segments longer than 3 ft (1m) –mounted on the outer surface, under an insulation layer, for temperature control. During operation, the EM pump directs a small fraction of sodium from the main piping system to the cold trap, where the sodium flow is cooled to just above the freezing point. Due to the reduction of the solubility at lower temperatures, impurities precipitate out of the sodium flow and adsorb at the stainless-steel mesh filter within the jacket of the cold trap. An economiser acts as a counterflow heat exchanger to pre-cool the incoming sodium flow to the cold trap and pre-heat the purified outgoing sodium flow.
Since direct measurements of the sodium temperatures are not available, heat balance and heat transfer performance models of the economiser and using the thermocouples mounted on the outside piping surface at the inlet and outlet ports of the economiser were developed and implemented in PRO-AID. These models and the associated virtual sensors allow PRO-AID to detect and differentiate between component faults within the economiser and the surrounding sensor faults
The experiment involved injecting a fault in the TC 117 thermocouple at the economiser’s hot-side outlet by adding a 10.0°C bias to the sensor output by the METL operators. Several residuals were activated about a minute after injecting the bias, the delay is due to statistical significance requirements to exempt measurement noise. Using logical inference, the residual signature corresponds to a unique fault diagnosis, in this case a fault for sensor TC 117.
Fault deductions
Once the fault is introduced the Diagnostics Agent employs logical reasoning to deduce the fault diagnosis and then shares this diagnosis with the user. Notably, it accurately identified a sensor fault of TC 117. Additionally, the agent indicates the unique set of sensors responsible for triggering the various residuals listed. This information aids operators in closely examining each implicated sensor. In the outputs the fault signature was identified is ‘F6’ which corresponds to the fault ‘SensorFault-economizer.hot:temp:out’.
In this system, each fault has a unique signature that is identified by a specific set of residuals. When a fault occurs, it triggers certain residuals while leaving others inactive. The active residuals form the fault signature. In this case various other potential sensor faults were ruled out because they did not match the fault signature ‘F6’. The other faults (here F1, F2, F3, F4, F7, F8, F9) did not match this signature and would have triggered a different set of residuals.
After being provided with a list of potentially faulty sensors, the operator can examine the actual data these sensors recorded. The function ‘query_sensor_data’ facilitates this by allowing access to the relevant sensor data such as the analysis for TC 117. The Agent highlights anomalies in both statistical and spectral metrics. To support this function, the system maintains a buffer of past sensor readings. An inbuilt function then analyses various metrics from the batched data, with the results interpreted by the Agent.
To aid explainability, the Operator must also be able to query the Diagnostics Agent. This capability is available with the custom_query function and in this case the agent was asked to explain the exoneration process. The agent responded by explaining the exoneration process to the operator, in the context of fault F6 being activated. There is an option to save the Agent’s output in the context buffer.
Smarter explanation
Explainability in diagnostic tools for complex systems is crucial, enabling operators to comprehend not only the presence of a fault but also its origins and implications. The authors argue that purely data-driven approaches may fall short in providing useful explainability, adding that physics-model-based diagnostic tools offer a more effective solution with their inherent causal relationship mapping. Incorporating an LLM enhances this by translating the technical details from the physics-based model into understandable explanations for operators, and accommodating arbitrary queries about the system. However, the authors are explicit saying that care must be taken to constrain the LLM to prevent the dissemination of inaccurate information. Nonetheless, using data from a molten salt facility, this study demonstrates that an AI diagnostic Agent can explain the relationships between faults and the sensors involved, respond to queries, and analyse historical sensor measurements to draw conclusions.