Last Updated May 10, 2026
Explainable AI and model interpretability address one of the central tensions in artificial intelligence systems: the need to combine predictive power with epistemic transparency, human understanding, auditability, and accountable decision-making. As machine learning models become more powerful, especially deep neural networks, ensemble systems, large language models, and complex pipeline-based architectures, their internal reasoning often becomes more difficult to inspect. This opacity creates practical, scientific, ethical, and institutional problems in domains where automated or AI-assisted decisions must be justified, contested, validated, and governed.
The central argument of this article is that explainability should be understood as a theory of governed understanding. A useful explanation is not merely a visual output, feature ranking, saliency map, generated rationale, or simplified story. It is an evidence structure that helps people inspect model behavior, evaluate uncertainty, identify spurious patterns, support human judgment, document risk, enable contestability, and connect technical systems to institutional accountability.
Main Library
Publications
Article Map
Artificial Intelligence Systems
Related Topic
Data Systems & Analytics
Related Topic
Institutions & Governance
Related Topic
Intelligent Infrastructure Systems

Interpretability is therefore not only a technical feature of a model. It is a systems requirement. A model may produce accurate predictions and still fail as an institutional tool if users cannot understand when to trust it, auditors cannot inspect its behavior, affected people cannot contest its conclusions, and system owners cannot identify whether the model is relying on meaningful structure or spurious correlations. Explainable AI sits at the intersection of machine learning, statistics, causal inference, decision theory, human-computer interaction, risk management, governance, and the philosophy of science.
This article develops Explainable AI and Model Interpretability as an advanced article within the Artificial Intelligence Systems knowledge series. It explains the black-box problem, interpretability as a systems property, intrinsic and post-hoc explanation, feature attribution, SHAP, LIME, local surrogate models, counterfactual explanations, causal versus associational explanation, deep learning interpretability, explanation stability, human usability, contestability, governance documentation, and explanation failure modes. Selected Python and R examples appear here, while the full GitHub repository contains expanded computational scaffolding for interpretable modeling, feature-attribution analysis, local surrogate explanations, counterfactual search, stability testing, explanation audit tables, SQL metadata, governance documentation, and reproducible outputs.
Why Explainability Matters
Explainability matters because artificial intelligence systems increasingly operate in environments where predictions must be understood, justified, reviewed, challenged, or corrected. A model used for image tagging, product ranking, or entertainment recommendation may create limited harm when its logic is opaque. A model used in medicine, finance, public administration, infrastructure, education, employment, scientific research, environmental monitoring, or legal decision support raises a deeper problem: people need to know not only what the system predicted, but why the system produced that output and whether the reasoning should be trusted.
The black-box problem is not simply that complex models are difficult to understand. The deeper issue is that opaque systems can concentrate power without producing reviewable evidence. They may rely on features that are statistically predictive but morally inappropriate, causally weak, institutionally biased, unstable under small changes, or impossible for affected people to contest. An explanation can therefore support more than curiosity. It can support debugging, safety, calibration, accountability, recourse, audit, scientific interpretation, and responsible deployment.
Explainability also matters because trust without understanding is fragile. Users may over-trust a model because it appears sophisticated, or under-trust it because they cannot see how it works. Both failures matter. Meaningful interpretability helps users know when a system is reliable, when it is uncertain, when it is outside its intended use, and when human judgment should override automated output.
Prediction \neq Explanation
\]
Interpretation: A model can produce a prediction without providing a faithful, useful, causal, or contestable explanation of how that prediction was generated.
| System Context | Why Prediction Alone Is Not Enough | Explainability Question | Governance Concern |
|---|---|---|---|
| Healthcare | Clinical decisions require professional judgment, uncertainty review, and patient context. | Which factors drove the recommendation, and are they clinically meaningful? | Automation bias, patient safety, and accountability. |
| Finance | Scores can affect credit, insurance, fraud review, and economic opportunity. | Can the decision be explained, challenged, and corrected? | Fairness, transparency, and recourse. |
| Infrastructure | Risk scores may guide maintenance, inspection, and emergency response. | Does the explanation connect model output to physical mechanisms and sensor evidence? | Reliability, resilience, and operational safety. |
| Public administration | Automated support systems may affect access to services, benefits, or enforcement. | Can affected people understand why a decision occurred? | Due process, contestability, and institutional legitimacy. |
| Scientific workflows | Models may discover patterns that require interpretation, replication, and causal reasoning. | Does the model reveal structure or exploit artifacts? | Scientific validity and reproducibility. |
Note: Explainability becomes most important when model outputs affect people, institutions, infrastructure, knowledge claims, rights, resources, or public trust.
The Black-Box Problem in Modern AI
Modern machine learning systems often learn complex nonlinear relationships between inputs and outputs. A predictive model may classify images, rank applicants, forecast infrastructure risk, detect fraud, recommend medical follow-up, summarize documents, estimate environmental change, or support strategic decisions. In many of these systems, the model’s internal representation is difficult to inspect directly.
A simplified model can be written as:
f_{\theta}: x \rightarrow \hat{y}
\]
Interpretation: A model \(f_{\theta}\), parameterized by \(\theta\), maps input data \(x\) to a prediction \(\hat{y}\). The black-box problem arises when this mapping is accurate but difficult for humans to inspect, justify, or contest.
The black-box problem becomes especially important when model outputs affect real-world choices. In a low-stakes setting, opacity may be tolerable if the model performs well and errors are harmless. In a high-stakes setting, opacity becomes a governance problem. A hospital, bank, infrastructure agency, court, school district, scientific laboratory, or public institution may need to know not only what the model predicted, but why it predicted it, whether the reasoning was valid, and whether the system should be trusted in a specific case.
Opacity also creates operational risk. If a model fails, system owners may not know whether the problem came from data drift, feature leakage, spurious correlation, biased training data, subgroup failure, adversarial manipulation, or misuse outside the intended domain. Without interpretability, failure analysis becomes slower, less reliable, and less accountable.
High\ Accuracy + Low\ Inspectability \rightarrow Governance\ Risk
\]
Interpretation: A highly accurate model can still be risky when its reasoning cannot be inspected, validated, monitored, or challenged in the decision context where it is used.
Interpretability as a System Property
Interpretability is often discussed as though it belongs only to a model. In practice, interpretability belongs to the full AI system. The explanation a user receives depends on the model, data, interface, documentation, workflow, user training, governance process, and institutional purpose of the system.
A technical explanation may be mathematically valid but useless to a frontline decision-maker. A simplified explanation may be understandable but misleading. A local explanation may help explain one prediction but fail to reveal global failure patterns. A feature attribution score may identify which variables influenced the model but say nothing about whether those variables are morally appropriate, causally meaningful, or legally defensible.
Interpretability must therefore be evaluated through several questions:
- Model question: What does the model use to generate predictions?
- Data question: Are the features meaningful, reliable, and well measured?
- Causal question: Do the explanation patterns reflect mechanisms or only correlations?
- User question: Can the intended user understand and act on the explanation?
- Governance question: Can the explanation support audit, contestability, and accountability?
- Reliability question: Does the explanation remain stable across similar cases?
This systems view prevents explainability from becoming decorative transparency. An explanation should not merely make a black box appear more understandable. It should improve the system’s capacity for responsible use, review, correction, and governance.
| Layer | Interpretability Requirement | Practical Question | Risk if Ignored |
|---|---|---|---|
| Model | The model’s behavior must be inspectable or explainable. | What patterns does the model rely on? | Spurious logic remains hidden. |
| Data | Features, labels, and provenance must be documented. | Are inputs meaningful, valid, and appropriate? | Explanations may legitimize bad data. |
| Interface | Explanations must be usable by the intended audience. | Can users understand and act on the explanation? | Technical transparency fails in practice. |
| Workflow | Explanations must fit decision roles and escalation paths. | When should a human override or investigate? | Oversight becomes symbolic. |
| Governance | Explanation artifacts must support review and contestation. | Can decisions be audited and challenged? | Transparency becomes decorative rather than accountable. |
Note: Explainability should be designed for the full decision system, not only for the model artifact.
Interpretability = Model\ Evidence + User\ Understanding + Governance\ Use
\]
Interpretation: Interpretability becomes meaningful when technical explanations support human understanding and institutional action.
A Taxonomy of Explainability Methods
Explainability methods can be organized along several dimensions. Each dimension answers a different question about what kind of understanding is being sought. Some methods explain the overall model. Others explain a single prediction. Some models are interpretable by design. Others require post-hoc explanation. Some explanations focus on feature importance. Others focus on examples, concepts, counterfactuals, or causal mechanisms.
| Dimension | Type | Question Answered | Example Methods |
|---|---|---|---|
| Scope | Global explanation | How does the model behave overall? | Decision trees, global feature importance, partial dependence, accumulated local effects. |
| Scope | Local explanation | Why did the model make this prediction? | LIME, SHAP, local surrogate models, counterfactuals. |
| Model relation | Intrinsic interpretability | Is the model transparent by design? | Linear models, sparse rule lists, small decision trees, generalized additive models. |
| Model relation | Post-hoc explanation | Can an opaque model be explained after training? | SHAP, LIME, permutation importance, saliency maps. |
| Target | Feature-level explanation | Which input variables mattered? | Attribution scores, permutation importance. |
| Target | Example-level explanation | Which cases are most similar or influential? | Nearest neighbors, prototypes, criticisms, influence functions. |
| Target | Concept-level explanation | Which human concepts are represented? | TCAV, concept activation vectors. |
| Logic | Counterfactual explanation | What would need to change for a different outcome? | Minimal recourse, contrastive explanations. |
| Logic | Causal explanation | What intervention would change the outcome? | Causal graphs, structural causal models, do-calculus. |
Note: No single explanation method solves interpretability. Different methods serve different users, risks, and governance purposes.
A risk review may require global model understanding. A contested decision may require a local explanation. A scientific application may require causal reasoning. A user interface may require simplified explanations that support correct action without overwhelming the user. Explainable AI should therefore be treated as a toolkit, not a single technique.
Intrinsic Interpretability and Post-Hoc Explanation
Intrinsic interpretability refers to models whose structure is understandable by design. A sparse linear model, small decision tree, rule list, scoring system, or carefully constrained generalized additive model can often be inspected directly. The advantage is that explanation is built into the model rather than approximated after the fact.
Post-hoc explanation refers to techniques applied after a model has been trained. These methods attempt to explain a complex model without replacing it. Post-hoc explanation is especially common for random forests, gradient boosting machines, deep neural networks, and large AI systems whose internal structure is too complex for direct human inspection.
The distinction matters because post-hoc explanations may not perfectly reflect the model’s actual logic. They may approximate behavior locally, highlight associations, or produce unstable explanations under small perturbations. Intrinsic interpretability can be preferable in high-stakes settings when a simpler transparent model performs sufficiently well. But complex models may still be justified when they produce substantially better performance, capture nonlinear structure, or operate in domains where direct interpretability is not feasible.
| Approach | Meaning | Strength | Limitation |
|---|---|---|---|
| Intrinsic interpretability | The model is understandable by design. | Explanation is part of the model structure. | May sacrifice performance when relationships are complex. |
| Post-hoc explanation | An opaque model is explained after training. | Can be applied to powerful black-box models. | May approximate rather than faithfully reveal model logic. |
| Transparent baseline | A simpler interpretable model is used for comparison. | Helps assess whether complexity is justified. | May miss nonlinear interactions or high-dimensional structure. |
| Hybrid approach | Complex models are paired with explanation and governance tools. | Balances performance with inspectability. | Requires careful validation of explanation quality. |
Note: The practical question is whether model complexity is justified by performance gains, stakes, explanation quality, monitoring capacity, and governance controls.
Complexity\ Gain \gt Interpretability\ Cost
\]
Interpretation: In high-stakes settings, complex models should justify their opacity through meaningful performance gains and strong explanation, monitoring, and governance controls.
Feature Attribution, SHAP, and LIME
Feature attribution methods estimate how much each input variable contributed to a prediction. They are among the most widely used explainability tools because they produce intuitive outputs: a ranked list of features that pushed a prediction higher or lower.
SHAP, or SHapley Additive exPlanations, connects feature attribution to Shapley values from cooperative game theory. Its strength is that it provides a unified additive framework for explaining model predictions. Its limitations include computational complexity, sensitivity to background distributions, dependence assumptions, and possible confusion between association and causation.
LIME, or Local Interpretable Model-Agnostic Explanations, explains a prediction by fitting a simpler interpretable model near a specific case. It perturbs the input, observes the black-box model’s behavior, weights nearby samples more heavily, and fits a local surrogate. LIME is flexible and intuitive, but its explanations can vary depending on sampling, distance metrics, local approximation quality, and the neighborhood definition around the case.
Both SHAP and LIME can be useful, but both require careful interpretation. A feature attribution score does not prove that a variable caused an outcome. It indicates how the model used information in a specific predictive context. In governance settings, feature attribution should be paired with data documentation, causal analysis, subgroup review, and explanation stability testing.
| Method | Core Idea | Useful For | Governance Caution |
|---|---|---|---|
| Permutation importance | Measures performance loss when a feature is shuffled. | Global feature importance. | Can be distorted by correlated features. |
| SHAP | Estimates feature contributions using Shapley-value logic. | Local and global additive explanations. | Depends on background distribution and feature-dependence assumptions. |
| LIME | Fits a local interpretable surrogate near one prediction. | Case-level explanation for black-box models. | Can be unstable under sampling or neighborhood choices. |
| Partial dependence | Shows average model response as one feature changes. | Global pattern inspection. | Can be misleading when features are dependent. |
| Accumulated local effects | Estimates feature effects while reducing extrapolation problems. | Feature effect analysis under dependence. | Still requires careful interpretation and domain review. |
Note: Feature attribution explains model behavior; it does not automatically explain the real-world causal process.
Counterfactual and Contrastive Explanations
Counterfactual explanations ask how a case would need to change in order to receive a different prediction. Instead of explaining the entire model, they explain a decision through contrast: “This outcome occurred rather than that outcome because these conditions were different.”
Counterfactual explanations are especially useful when affected people need actionable recourse. A loan applicant, patient, student, infrastructure manager, or policy analyst may not need to understand every parameter in a model. They may need to know what would have changed the outcome and whether those changes are plausible, fair, and within their control.
Good counterfactual explanations should be:
- Valid: the counterfactual actually changes the model output;
- Close: the proposed change is minimal or reasonably small;
- Plausible: the counterfactual could occur in the real world;
- Actionable: the change concerns variables the affected person or system can influence;
- Ethically appropriate: the explanation does not suggest changing protected, immutable, or morally inappropriate attributes;
- Stable: similar cases should receive similar recourse guidance.
Counterfactual explanation connects explainability to institutional justice. A system that can explain why a decision occurred but provides no meaningful path to contest or correct it may still fail as an accountable decision system.
Explanation + Recourse + Contestability = Accountable\ Decision\ Support
\]
Interpretation: In high-stakes systems, explanations should help affected people or responsible institutions understand, challenge, correct, or improve decisions.
Causal versus Associational Explanation
Most machine learning explanations are associational. They describe how features relate to predictions inside a model. Causal explanations ask a stronger question: what would happen if an intervention were made in the world?
P(y \mid x) \neq P(y \mid do(x))
\]
Interpretation: Observing \(x\) is not the same as intervening to set \(x\). Associational explanations describe patterns in data; causal explanations require assumptions about mechanisms, interventions, and confounding.
This distinction is critical in high-stakes AI. A model may learn that a variable is predictive because it reflects historical patterns, institutional bias, measurement artifacts, or downstream effects of earlier decisions. Explaining that the variable influenced the model does not establish that changing the variable would causally change the real outcome.
For example, a risk model may treat missed appointments, residential location, or prior service usage as predictive. An attribution method may correctly show that these features influenced the prediction. But a causal analysis may reveal that the features reflect access barriers, infrastructure inequality, institutional neglect, or prior decision policies. In such cases, feature attribution alone can produce a technically faithful but socially incomplete explanation.
Explainable AI should therefore be connected to causal inference when decisions involve intervention, responsibility, recourse, or institutional reform.
| Explanation Type | Question | Evidence Needed | Risk if Confused |
|---|---|---|---|
| Associational explanation | Which features influenced the model prediction? | Model behavior, attribution values, feature effects. | Correlation may be mistaken for causal mechanism. |
| Causal explanation | What intervention would change the outcome? | Causal assumptions, domain theory, experimental or quasi-experimental evidence. | Interventions may fail or cause harm if based only on prediction. |
| Recourse explanation | What could change the model decision? | Counterfactual search, feasibility constraints, actionability review. | Users may be given unrealistic or unfair guidance. |
| Governance explanation | Is the model’s reasoning appropriate for the decision context? | Feature review, legal review, stakeholder review, audit evidence. | Technically faithful explanations may legitimize inappropriate models. |
Note: A faithful explanation of model behavior is not the same as a valid explanation of the world.
Deep Learning Interpretability: Gradients, Concepts, and Representations
Deep learning interpretability includes methods designed to inspect neural networks, embeddings, attention patterns, latent representations, saliency, gradients, and concept-level behavior. These methods can reveal useful patterns, but they must be interpreted carefully.
Gradient-based methods estimate how changes in input affect the output. Integrated Gradients, for example, attributes prediction differences by integrating gradients along a path from a baseline input to the actual input. Concept-based methods, such as TCAV, attempt to explain internal representations in terms of human-understandable concepts rather than low-level features.
For large language models and generative systems, interpretability becomes more difficult because explanations may involve distributed representations, prompt context, retrieval systems, tool calls, hidden internal reasoning processes, safety layers, and post-processing steps. In these systems, explanation should not be confused with a generated rationale. A model can produce a plausible explanation that does not faithfully describe the internal process that produced the output.
Deep learning interpretability is therefore most useful when treated as evidence, not proof. Saliency maps, attention patterns, attribution values, and generated rationales should be validated against perturbation tests, counterfactuals, benchmarks, human review, and system-level audits.
Plausible\ Rationale \neq Faithful\ Explanation
\]
Interpretation: A generated explanation may sound convincing without accurately representing the internal process or evidence that produced the model output.
Explainability in Decision and Infrastructure Systems
Explainability becomes most important when AI systems are embedded in decision workflows. A model that produces a score is only one part of a larger process that includes users, thresholds, interfaces, review rules, documentation, escalation paths, and institutional consequences.
In decision support systems, explanations should help users understand:
- why the model produced a recommendation;
- how confident or uncertain the system is;
- which features influenced the output;
- whether the case resembles the training data;
- what alternatives or counterfactuals exist;
- when human review is required;
- how to record disagreement or override the model.
In infrastructure systems, explainability supports reliability and debugging. When an AI system prioritizes maintenance, detects anomalies, manages energy loads, flags environmental risk, or supports emergency response, system operators need explanations that connect model output to measurable conditions, operational thresholds, sensor quality, and physical mechanisms.
Explainability should therefore be integrated into the design of interfaces and workflows. An explanation hidden in a technical report may satisfy documentation requirements but fail to support safe real-time use.
| System Need | Explanation Function | Example Artifact | Governance Value |
|---|---|---|---|
| Operational review | Shows why a recommendation was made. | Local feature attribution or counterfactual explanation. | Supports human judgment and override. |
| Reliability debugging | Identifies whether model behavior depends on unstable inputs. | Drift-linked explanation review. | Supports monitoring and incident response. |
| Contestability | Helps affected people challenge or correct decisions. | Decision explanation, recourse summary, appeal record. | Supports procedural fairness and accountability. |
| Infrastructure management | Connects model outputs to physical or operational evidence. | Sensor-based explanation and threshold report. | Supports reliability and resilience planning. |
| Audit and oversight | Creates reviewable evidence about model behavior. | Explanation logs, stability tests, feature-use reports. | Supports governance, compliance, and institutional learning. |
Note: Explanations should support the decisions, users, and accountability structures in which the AI system actually operates.
Governance, Documentation, and Accountability
Explainability supports governance when it creates evidence that can be reviewed, audited, challenged, and improved. Governance-oriented explainability should be connected to model cards, datasheets, audit logs, decision records, risk registers, evaluation reports, and incident reviews.
A mature explainability governance process should document:
- which explanation methods are used and why;
- whether explanations are local, global, counterfactual, causal, or user-facing;
- how explanation fidelity and stability are tested;
- which users receive explanations and in what format;
- how affected people can contest decisions;
- which features are prohibited, sensitive, or restricted;
- how explanation artifacts are logged;
- how explanation failures trigger review or remediation.
Regulatory and standards environments increasingly treat transparency, documentation, accountability, and risk management as lifecycle responsibilities. Explainability should not be bolted onto a model after deployment. It should be designed into the system as part of responsible AI governance.
| Governance Area | Question | Evidence Needed | Failure Mode |
|---|---|---|---|
| Purpose | What decision does the explanation support? | Use-case documentation and decision workflow map. | Explanations are generated without practical relevance. |
| Fidelity | Does the explanation reflect the model’s behavior? | Surrogate tests, perturbation tests, stability analysis. | Explanations are persuasive but misleading. |
| Audience | Who needs to understand the explanation? | User research, interface review, role-specific documentation. | Explanations are technically correct but unusable. |
| Contestability | Can the explanation support challenge or correction? | Appeal workflows, decision logs, recourse reports. | Affected people cannot meaningfully respond. |
| Accountability | Who acts when explanations reveal risk? | Risk registers, escalation paths, incident-review procedures. | Known issues persist without correction. |
Note: Explainability becomes accountable only when explanation evidence is connected to institutional responsibility and corrective authority.
Limits and Failure Modes of Explainable AI
Explainable AI has significant limits. The existence of an explanation does not guarantee that the model is fair, safe, causal, lawful, or trustworthy. Explanations can fail in several ways.
First, explanations can be unfaithful. A local surrogate may approximate the model poorly. A generated rationale may sound plausible while failing to reflect actual model behavior. A saliency map may highlight visually intuitive regions without proving that the model relied on them in a robust way.
Second, explanations can be unstable. Similar cases may produce different explanations, especially when methods depend on sampling, perturbation, background distributions, or local approximation. Instability can make explanations unreliable for audit or contestability.
Third, explanations can confuse association with causation. A feature may influence the model because it is predictive, but that does not mean it is causally appropriate or ethically acceptable. Explanation methods can reveal what the model learned without validating whether the learned pattern should be used.
Fourth, explanations can create automation bias. A user may over-trust a model because the explanation appears coherent. In this sense, explanation can increase risk if it persuades users without improving understanding.
Fifth, explanations can be strategically incomplete. A system may disclose enough information to appear transparent while withholding the information needed for meaningful contestation, independent audit, or institutional accountability.
For these reasons, explainability should be treated as one component of responsible AI systems, not as a substitute for validation, monitoring, causal analysis, security, fairness review, user research, and governance.
Explanation \neq Accountability
\]
Interpretation: Explanations support accountability only when they are faithful, usable, contestable, documented, and connected to authority for correction.
Mathematical Lens: Attribution, Surrogates, and Counterfactuals
Explainability methods often translate opaque model behavior into simpler mathematical objects: attribution scores, surrogate models, counterfactual examples, concept directions, or causal effects. These objects do not fully replace the original model. They provide structured ways to interrogate it.
f(x) \approx g(x)
\]
Interpretation: A surrogate explanation model \(g\) approximates the behavior of a more complex model \(f\). The surrogate may be global, explaining broad behavior, or local, explaining behavior near a specific input.
Feature attribution methods assign contribution values to input variables. In an additive explanation model, the prediction is decomposed into a baseline plus feature contributions.
f(x) \approx \phi_0 + \sum_{i=1}^{n} \phi_i
\]
Interpretation: The prediction is represented as a baseline value \(\phi_0\) plus feature attributions \(\phi_i\). This structure underlies many local explanation techniques, including Shapley-value-based approaches.
Shapley values come from cooperative game theory and estimate the marginal contribution of each feature across possible feature coalitions.
\phi_i =
\sum_{S \subseteq N \setminus \{i\}}
\frac{|S|!(n-|S|-1)!}{n!}
\left[
f(S \cup \{i\}) – f(S)
\right]
\]
Interpretation: The Shapley value \(\phi_i\) averages the marginal contribution of feature \(i\) across subsets \(S\) of other features. This provides a principled attribution framework, though interpretation still depends on feature dependence, background data, and modeling assumptions.
Counterfactual explanation asks what minimal change to an input would change the model’s output.
x^{*} =
\arg\min_{x’}
d(x, x’)
\quad \text{subject to} \quad
f(x’) = y^{*}
\]
Interpretation: A counterfactual \(x^{*}\) is the nearest alternative input \(x’\) that produces the desired output \(y^{*}\). This supports recourse-oriented explanations: what would need to change for the decision to change?
Explanation stability can be evaluated by comparing explanations for similar inputs.
d(x_a, x_b) \leq \epsilon
\quad \Rightarrow \quad
d(E(x_a), E(x_b)) \leq \delta
\]
Interpretation: If two inputs are similar, their explanations should usually be similar. Large explanation differences for nearly identical cases may indicate instability, brittleness, or a misleading explanation method.
Variables and System Interpretation
| Symbol or Term | Meaning | System Interpretation | Explainability Relevance |
|---|---|---|---|
| \(x\) | Input features | Observed data used by the model | Explanation depends on feature definition, measurement quality, and context. |
| \(\hat{y}\) | Predicted output | Model estimate, classification, ranking, score, or recommendation | The object being explained. |
| \(f\) | Original model | Possibly opaque prediction function | May require post-hoc explanation. |
| \(g\) | Surrogate model | Simplified model approximating \(f\) | Used for local or global explanation. |
| \(\phi_i\) | Feature attribution | Contribution assigned to feature \(i\) | Supports local importance analysis. |
| \(\phi_0\) | Baseline prediction | Reference value before feature contributions | Helps explain how a case differs from baseline. |
| \(S\) | Feature subset | Coalition of variables used in Shapley calculation | Determines marginal contribution logic. |
| \(x^{*}\) | Counterfactual case | Alternative input producing a desired outcome | Supports recourse and contestability. |
| \(d(x, x’)\) | Distance function | Measure of how different two cases are | Important for plausible counterfactuals. |
| \(E(x)\) | Explanation for input \(x\) | Attribution, rule, counterfactual, or concept explanation | Can be audited for stability and fidelity. |
| \(P(y \mid do(x))\) | Causal intervention probability | Outcome probability after intervention | Distinguishes causal explanation from association. |
Note: Explanation variables should be interpreted as part of a decision system, not only as mathematical objects.
Worked Example: Explaining a Risk Prediction
Consider an AI system that estimates whether a piece of public infrastructure is at elevated risk of failure. The model uses asset age, sensor load, maintenance gap, recent weather stress, and prior inspection results. It produces a risk score of 0.78, above the intervention threshold of 0.70.
A local feature attribution explanation might show:
| Feature | Direction | Interpretation | Governance Question |
|---|---|---|---|
| Sensor load | Increases risk | Recent readings are unusually high. | Are sensors calibrated and reliable? |
| Maintenance gap | Increases risk | The asset has not been serviced recently. | Does this reflect neglect, budget constraints, or true lower priority? |
| Asset age | Increases risk | The asset is older than baseline. | Is age causally related to failure in this asset class? |
| Inspection result | Decreases risk | The most recent inspection was acceptable. | Was the inspection recent, comparable, and well documented? |
| Weather stress | Increases risk | Recent extreme conditions increase failure likelihood. | Is the model calibrated for climate stress? |
Note: A local explanation identifies model logic, but governance review must still ask whether the logic is reliable, causal, fair, and operationally useful.
A counterfactual explanation might say that the score would fall below 0.70 if the maintenance gap were reduced by 40 days and sensor load returned to its historical range. A causal review would then ask whether maintenance actually reduces failure risk, whether sensor load is a reliable sign of stress, and whether the system is under-prioritizing communities with historically weaker inspection coverage.
This example shows why explanations require layered interpretation. Feature attribution identifies model logic. Counterfactuals support recourse. Causal analysis asks whether interventions are meaningful. Governance asks whether the model should be used as designed.
Attribution + Counterfactual + Causal\ Review + Governance = Stronger\ Explanation
\]
Interpretation: A responsible explanation combines model evidence, possible alternatives, causal reasoning, and institutional review.
Computational Modeling
Computational modeling for explainable AI should produce artifacts that help people inspect, compare, evaluate, and govern model behavior. A useful explanation workflow should not merely generate a plot. It should produce reviewable evidence: feature-importance summaries, local explanations, counterfactual cases, explanation stability metrics, model-version records, and audit-ready tables.
A practical explanation workflow should answer several questions:
- Which features are most important globally?
- Which features drove a specific prediction?
- Would a similar case receive a similar explanation?
- What minimally different case would produce a different outcome?
- Are suggested counterfactuals plausible and actionable?
- Does the explanation remain stable under perturbation?
- Can the explanation be stored, reviewed, and audited?
| Artifact | Purpose | Governance Value |
|---|---|---|
| Global feature importance | Identifies broad patterns in model behavior. | Supports model review and feature governance. |
| Local explanation | Explains one prediction or decision. | Supports user understanding and case-level review. |
| Counterfactual record | Shows what change would alter an outcome. | Supports recourse and contestability. |
| Stability metric | Tests whether explanations are consistent across similar cases. | Supports reliability and audit confidence. |
| Explanation audit table | Documents explanation method, purpose, audience, and limitations. | Supports accountability and lifecycle governance. |
Note: Explainability workflows should produce evidence that can be reviewed by technical teams, domain experts, governance bodies, and affected stakeholders.
Python Workflow: Local Explanations, Counterfactuals, and Stability
The following Python workflow demonstrates a practical explanation audit. It trains a simple model, computes permutation-style feature importance, builds a local linear surrogate around one case, searches for a simple counterfactual, and tests explanation stability for nearby cases. The code uses synthetic data so the workflow can be adapted to real model logs.
"""
Explainable AI and Model Interpretability
Python workflow: feature importance, local surrogate explanation,
counterfactual search, and explanation stability.
This script uses synthetic data and dependency-light methods so the
logic can be adapted to real model logs, model registry outputs,
or governed decision-support systems.
"""
from __future__ import annotations
from dataclasses import dataclass
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
RANDOM_SEED = 42
rng = np.random.default_rng(RANDOM_SEED)
OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)
@dataclass
class ExplanationResult:
"""Container for local explanation results."""
case_index: int
prediction_probability: float
local_coefficients: pd.Series
counterfactual: pd.Series
counterfactual_probability: float
explanation_stability: float
def make_synthetic_risk_data(n: int = 3000) -> pd.DataFrame:
"""
Create synthetic infrastructure risk-prediction data.
The variables are intentionally interpretable so the explanation
workflow can be reviewed by technical and non-technical audiences.
"""
asset_age = rng.normal(10, 3, n).clip(0)
sensor_load = rng.normal(0.5, 0.15, n).clip(0, 1)
maintenance_gap = rng.normal(90, 30, n).clip(0)
weather_stress = rng.normal(0.4, 0.2, n).clip(0, 1)
inspection_score = rng.normal(0.7, 0.15, n).clip(0, 1)
logit = (
-3.0
+ 0.10 * asset_age
+ 2.2 * sensor_load
+ 0.012 * maintenance_gap
+ 1.3 * weather_stress
- 1.6 * inspection_score
+ rng.normal(0, 0.25, n)
)
probability = 1 / (1 + np.exp(-logit))
outcome = rng.binomial(1, probability)
return pd.DataFrame(
{
"asset_age": asset_age,
"sensor_load": sensor_load,
"maintenance_gap": maintenance_gap,
"weather_stress": weather_stress,
"inspection_score": inspection_score,
"outcome": outcome,
}
)
def permutation_importance(
model: RandomForestClassifier,
x_valid: pd.DataFrame,
y_valid: pd.Series,
) -> pd.Series:
"""
Calculate simple permutation feature importance.
A feature is important when shuffling it reduces predictive performance.
"""
baseline = accuracy_score(y_valid, model.predict(x_valid))
scores: dict[str, float] = {}
for column in x_valid.columns:
x_permuted = x_valid.copy()
x_permuted[column] = rng.permutation(x_permuted[column].to_numpy())
permuted_score = accuracy_score(y_valid, model.predict(x_permuted))
scores[column] = baseline - permuted_score
return pd.Series(scores).sort_values(ascending=False)
def local_surrogate_explanation(
model: RandomForestClassifier,
x_train: pd.DataFrame,
case: pd.Series,
samples: int = 500,
noise_scale: float = 0.10,
) -> pd.Series:
"""
Fit a local linear surrogate around one case.
This is a simplified LIME-like demonstration:
perturb a case, score the black-box model, then fit a local linear model.
"""
feature_std = x_train.std().replace(0, 1)
perturbations = []
for _ in range(samples):
noise = rng.normal(0, noise_scale, size=len(case)) * feature_std.to_numpy()
perturbations.append(case.to_numpy() + noise)
local_x = pd.DataFrame(perturbations, columns=x_train.columns)
local_y = model.predict_proba(local_x)[:, 1]
surrogate = LinearRegression()
surrogate.fit(local_x, local_y)
return pd.Series(surrogate.coef_, index=x_train.columns).sort_values(
key=np.abs,
ascending=False,
)
def find_counterfactual(
model: RandomForestClassifier,
case: pd.Series,
target_threshold: float = 0.50,
max_steps: int = 60,
) -> tuple[pd.Series, float]:
"""
Search for a simple counterfactual by reducing actionable risk features.
This is a demonstration, not an optimized counterfactual solver.
"""
candidate = case.copy()
for _ in range(max_steps):
probability = model.predict_proba(candidate.to_frame().T)[0, 1]
if probability < target_threshold:
return candidate, float(probability)
candidate["maintenance_gap"] = max(candidate["maintenance_gap"] - 4, 0)
candidate["sensor_load"] = max(candidate["sensor_load"] - 0.015, 0)
candidate["weather_stress"] = max(candidate["weather_stress"] - 0.012, 0)
probability = model.predict_proba(candidate.to_frame().T)[0, 1]
return candidate, float(probability)
def explanation_stability(
model: RandomForestClassifier,
x_train: pd.DataFrame,
case: pd.Series,
repeats: int = 12,
) -> float:
"""
Estimate stability of local surrogate explanations.
Higher values indicate that repeated local explanations are more similar.
"""
explanations = []
for _ in range(repeats):
coefficients = local_surrogate_explanation(model, x_train, case, samples=300)
explanations.append(coefficients.reindex(x_train.columns).to_numpy())
matrix = np.vstack(explanations)
mean_vector = matrix.mean(axis=0)
distances = np.linalg.norm(matrix - mean_vector, axis=1)
stability_score = 1 / (1 + distances.mean())
return float(stability_score)
def main() -> None:
"""Run the explanation workflow and save governance artifacts."""
data = make_synthetic_risk_data()
x = data.drop(columns=["outcome"])
y = data["outcome"]
x_train, x_valid, y_train, y_valid = train_test_split(
x,
y,
test_size=0.30,
random_state=RANDOM_SEED,
stratify=y,
)
model = RandomForestClassifier(
n_estimators=250,
max_depth=6,
random_state=RANDOM_SEED,
)
model.fit(x_train, y_train)
validation_accuracy = accuracy_score(y_valid, model.predict(x_valid))
global_importance = permutation_importance(model, x_valid, y_valid)
case_index = 10
case = x_valid.iloc[case_index]
probability = float(model.predict_proba(case.to_frame().T)[0, 1])
local_coefficients = local_surrogate_explanation(model, x_train, case)
counterfactual, cf_probability = find_counterfactual(model, case)
stability = explanation_stability(model, x_train, case)
explanation = ExplanationResult(
case_index=case_index,
prediction_probability=probability,
local_coefficients=local_coefficients,
counterfactual=counterfactual,
counterfactual_probability=cf_probability,
explanation_stability=stability,
)
explanation_summary = pd.DataFrame(
{
"metric": [
"validation_accuracy",
"case_index",
"prediction_probability",
"counterfactual_probability",
"explanation_stability",
],
"value": [
validation_accuracy,
explanation.case_index,
explanation.prediction_probability,
explanation.counterfactual_probability,
explanation.explanation_stability,
],
}
)
local_explanation_table = explanation.local_coefficients.reset_index()
local_explanation_table.columns = ["feature", "local_surrogate_coefficient"]
counterfactual_table = pd.DataFrame(
{
"feature": x_train.columns,
"original_value": case.reindex(x_train.columns).to_numpy(),
"counterfactual_value": explanation.counterfactual.reindex(x_train.columns).to_numpy(),
}
)
counterfactual_table["change"] = (
counterfactual_table["counterfactual_value"]
- counterfactual_table["original_value"]
)
global_importance.to_csv(OUTPUT_DIR / "python_global_feature_importance.csv")
local_explanation_table.to_csv(OUTPUT_DIR / "python_local_explanation.csv", index=False)
counterfactual_table.to_csv(OUTPUT_DIR / "python_counterfactual.csv", index=False)
explanation_summary.to_csv(OUTPUT_DIR / "python_explanation_summary.csv", index=False)
memo = f"""# Explainable AI Audit Memo
## Model and Case Summary
Validation accuracy: {validation_accuracy:.3f}
Explained case index: {explanation.case_index}
Prediction probability: {explanation.prediction_probability:.3f}
Counterfactual probability: {explanation.counterfactual_probability:.3f}
Explanation stability score: {explanation.explanation_stability:.3f}
## Interpretation
- Global feature importance identifies broad model behavior.
- Local surrogate coefficients approximate the model near one case.
- Counterfactual output identifies actionable changes that alter the prediction.
- Explanation stability estimates whether repeated explanations are consistent.
- Governance review should evaluate whether explanations are faithful, stable,
causally meaningful, actionable, and appropriate for the decision context.
"""
(OUTPUT_DIR / "python_explainable_ai_audit_memo.md").write_text(memo)
print(memo)
print("\nGlobal feature importance")
print(global_importance)
print("\nLocal explanation")
print(local_explanation_table)
print("\nCounterfactual")
print(counterfactual_table)
if __name__ == "__main__":
main()
This workflow illustrates the core logic behind explanation auditing: global importance, local approximation, actionable counterfactuals, and stability testing. In a production system, these outputs should be stored with model version, data version, explanation method, parameter settings, reviewer decisions, and governance status.
R Workflow: Interpretable Models and Explanation Audits
The following R workflow uses base R to fit an interpretable logistic model, compare feature effects, calculate local contribution-style values, and produce an explanation audit table. It is useful as a transparent baseline against which more complex models can be compared.
# Explainable AI and Model Interpretability
# R workflow: transparent logistic model and explanation audit.
set.seed(42)
if (!dir.exists("outputs")) {
dir.create("outputs")
}
n <- 3000
asset_age <- pmax(rnorm(n, mean = 10, sd = 3), 0)
sensor_load <- pmin(pmax(rnorm(n, mean = 0.5, sd = 0.15), 0), 1)
maintenance_gap <- pmax(rnorm(n, mean = 90, sd = 30), 0)
weather_stress <- pmin(pmax(rnorm(n, mean = 0.4, sd = 0.2), 0), 1)
inspection_score <- pmin(pmax(rnorm(n, mean = 0.7, sd = 0.15), 0), 1)
logit <- -3.0 +
0.10 * asset_age +
2.2 * sensor_load +
0.012 * maintenance_gap +
1.3 * weather_stress -
1.6 * inspection_score +
rnorm(n, mean = 0, sd = 0.25)
probability <- 1 / (1 + exp(-logit))
outcome <- rbinom(n, size = 1, prob = probability)
data <- data.frame(
asset_age = asset_age,
sensor_load = sensor_load,
maintenance_gap = maintenance_gap,
weather_stress = weather_stress,
inspection_score = inspection_score,
outcome = outcome
)
model <- glm(
outcome ~ asset_age + sensor_load + maintenance_gap + weather_stress + inspection_score,
data = data,
family = binomial()
)
model_summary <- summary(model)
# Local contribution-style explanation for one case.
case_index <- 10
case <- data[case_index, ]
coefficients <- coef(model)
feature_names <- names(coefficients)[names(coefficients) != "(Intercept)"]
contributions <- coefficients[feature_names] * as.numeric(case[feature_names])
local_explanation <- data.frame(
feature = feature_names,
value = as.numeric(case[feature_names]),
coefficient = as.numeric(coefficients[feature_names]),
contribution_to_logit = as.numeric(contributions)
)
local_explanation <- local_explanation[
order(abs(local_explanation$contribution_to_logit), decreasing = TRUE),
]
predicted_probability <- predict(model, newdata = case, type = "response")
# Simple counterfactual: reduce actionable risk features.
counterfactual <- case
for (step in 1:60) {
cf_probability <- predict(model, newdata = counterfactual, type = "response")
if (cf_probability < 0.50) {
break
}
counterfactual$maintenance_gap <- max(counterfactual$maintenance_gap - 4, 0)
counterfactual$sensor_load <- max(counterfactual$sensor_load - 0.015, 0)
counterfactual$weather_stress <- max(counterfactual$weather_stress - 0.012, 0)
}
counterfactual_probability <- predict(model, newdata = counterfactual, type = "response")
counterfactual_summary <- data.frame(
feature = feature_names,
original_value = as.numeric(case[feature_names]),
counterfactual_value = as.numeric(counterfactual[feature_names]),
change = as.numeric(counterfactual[feature_names]) - as.numeric(case[feature_names])
)
# Explanation audit table.
audit_table <- data.frame(
audit_item = c(
"Model is intrinsically interpretable",
"Feature coefficients are inspectable",
"Local explanation generated",
"Counterfactual generated",
"Actionable variables used in counterfactual",
"Explanation requires causal review",
"Explanation should be logged for governance"
),
status = c(
TRUE,
TRUE,
TRUE,
TRUE,
TRUE,
TRUE,
TRUE
)
)
explanation_summary <- data.frame(
metric = c(
"case_index",
"predicted_probability",
"counterfactual_probability"
),
value = c(
case_index,
round(predicted_probability, 4),
round(counterfactual_probability, 4)
)
)
write.csv(local_explanation, "outputs/r_local_explanation.csv", row.names = FALSE)
write.csv(counterfactual_summary, "outputs/r_counterfactual_summary.csv", row.names = FALSE)
write.csv(audit_table, "outputs/r_explanation_audit_table.csv", row.names = FALSE)
write.csv(explanation_summary, "outputs/r_explanation_summary.csv", row.names = FALSE)
memo <- paste0(
"# Explainable AI Audit Memo\n\n",
"Explained case index: ", case_index, "\n",
"Predicted probability: ", round(predicted_probability, 3), "\n",
"Counterfactual probability: ", round(counterfactual_probability, 3), "\n\n",
"Interpretation:\n",
"- The logistic model provides an intrinsically interpretable baseline.\n",
"- Local contributions show how each feature contributes to the logit score.\n",
"- The counterfactual summary identifies actionable changes that reduce risk.\n",
"- Governance review should evaluate whether the explanation is faithful, ",
"stable, causal, actionable, and appropriate for the decision context.\n"
)
writeLines(memo, "outputs/r_explainable_ai_audit_memo.md")
print(model_summary)
print("Local explanation")
print(local_explanation)
print("Counterfactual summary")
print(counterfactual_summary)
print("Explanation audit table")
print(audit_table)
cat(memo)
This R example reinforces a practical governance principle: interpretable baselines matter. Even when a complex model is ultimately deployed, a transparent baseline can help auditors understand whether complexity is justified and whether explanation patterns are plausible.
GitHub Repository
The article body includes selected computational examples so the conceptual and mathematical argument remains readable. The full repository contains expanded computational infrastructure: advanced notebooks, Python and R workflows, SQL metadata, explanation audit tools, counterfactual search, local surrogate models, stability diagnostics, governance documentation, and reproducible outputs.
Complete Code Repository
The full code distribution for this article includes Python, R, SQL, Julia, explanation-audit documentation, interpretable-modeling workflows, feature-attribution examples, counterfactual reasoning tools, local surrogate diagnostics, explanation stability checks, reproducible outputs, and governance scaffolding for studying explainable AI and model interpretability.
From Explanation to Governed Understanding
Explainable AI shows that intelligence is not only prediction, classification, generation, or optimization. It is also evidence. When AI systems shape decisions, explanations become part of the infrastructure of trust, accountability, contestability, and institutional learning. A model that cannot be understood may still be useful in some settings, but its use becomes more fragile as the stakes rise.
The central lesson is that explanation must be governed. Feature attribution, counterfactuals, surrogate models, saliency maps, concept vectors, and generated rationales are not automatically trustworthy. They must be tested for fidelity, stability, usefulness, causal relevance, and governance value. Explanations can clarify, but they can also mislead. They can support accountability, but they can also create a false sense of transparency.
The strongest explainability programs therefore treat explanations as part of a larger system: data documentation, model validation, uncertainty review, user-centered design, workflow integration, appeal processes, audit trails, risk registers, monitoring, and corrective authority. Explainability becomes meaningful when it helps people understand when a system should be trusted, when it should be questioned, and when it should be changed.
Within the Artificial Intelligence Systems knowledge series, this article belongs near Trust, Interpretability, and User-Centered AI Systems, AI Safety and System Reliability, Model Validation, Benchmarking, and Generalization Theory, Data Governance, Provenance, and Lineage in AI Systems, Causal Inference and Experimental Design in AI Systems, Human-AI Interaction and Interface Design, and Systemic Risk, Feedback Loops, and Cascading Failures in AI Systems. It provides the interpretability and evidence layer for understanding how AI systems can be inspected, challenged, corrected, and governed.
The final point is institutional. Explainability is not a decorative feature added after deployment. It is part of the architecture of responsible AI. A system becomes more accountable when its outputs can be explained, its explanations can be tested, its assumptions can be challenged, and its decisions can be corrected.
Related Articles
- Trust, Interpretability, and User-Centered AI Systems
- AI Safety and System Reliability
- Model Validation, Benchmarking, and Generalization Theory
- Data Governance, Provenance, and Lineage in AI Systems
- Causal Inference and Experimental Design in AI Systems
- Human-AI Interaction and Interface Design
- Systemic Risk, Feedback Loops, and Cascading Failures in AI Systems
Further Reading
- Molnar, C. (2025) Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Available at: https://christophm.github.io/interpretable-ml-book/
- Rudin, C. (2019) ‘Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead’, Nature Machine Intelligence. Available at: https://www.nature.com/articles/s42256-019-0048-x
- Doshi-Velez, F. and Kim, B. (2017) ‘Towards A Rigorous Science of Interpretable Machine Learning’. Available at: https://arxiv.org/abs/1702.08608
- Wachter, S., Mittelstadt, B. and Russell, C. (2017) ‘Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR’. Available at: https://arxiv.org/abs/1711.00399
- Lundberg, S.M. and Lee, S.-I. (2017) ‘A Unified Approach to Interpreting Model Predictions’, NeurIPS. Available at: https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions
- Ribeiro, M.T., Singh, S. and Guestrin, C. (2016) ‘“Why Should I Trust You?”: Explaining the Predictions of Any Classifier’, KDD. Available at: https://arxiv.org/abs/1602.04938
- Kim, B. et al. (2018) ‘Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)’. Available at: https://arxiv.org/abs/1711.11279
References
- Doshi-Velez, F. and Kim, B. (2017) ‘Towards A Rigorous Science of Interpretable Machine Learning’. Available at: https://arxiv.org/abs/1702.08608
- European Union (2024) Regulation (EU) 2024/1689 laying down harmonised rules on artificial intelligence. Available at: https://eur-lex.europa.eu/eli/reg/2024/1689/oj/eng
- Gebru, T. et al. (2021) ‘Datasheets for Datasets’, Communications of the ACM, 64(12), pp. 86–92. Available at: https://dl.acm.org/doi/10.1145/3458723
- Kim, B. et al. (2018) ‘Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)’, ICML. Available at: https://arxiv.org/abs/1711.11279
- Lundberg, S.M. and Lee, S.-I. (2017) ‘A Unified Approach to Interpreting Model Predictions’, NeurIPS. Available at: https://papers.nips.cc/paper/7062-a-unified-approach-to-interpreting-model-predictions
- Mitchell, M. et al. (2019) ‘Model Cards for Model Reporting’, Proceedings of the Conference on Fairness, Accountability, and Transparency. Available at: https://dl.acm.org/doi/10.1145/3287560.3287596
- Molnar, C. (2025) Interpretable Machine Learning: A Guide for Making Black Box Models Explainable. Available at: https://christophm.github.io/interpretable-ml-book/
- National Institute of Standards and Technology (2023) Artificial Intelligence Risk Management Framework (AI RMF 1.0). Available at: https://www.nist.gov/itl/ai-risk-management-framework
- Pearl, J. (2009) Causality: Models, Reasoning, and Inference. Cambridge: Cambridge University Press. Available at: https://www.cambridge.org/core/books/causality/12A5D0E3A2B61A91C55D7A8B2F65D24B
- Ribeiro, M.T., Singh, S. and Guestrin, C. (2016) ‘“Why Should I Trust You?”: Explaining the Predictions of Any Classifier’, KDD. Available at: https://arxiv.org/abs/1602.04938
- Rudin, C. (2019) ‘Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead’, Nature Machine Intelligence. Available at: https://www.nature.com/articles/s42256-019-0048-x
- Shapley, L.S. (1953) ‘A Value for n-Person Games’, in Kuhn, H.W. and Tucker, A.W. (eds.) Contributions to the Theory of Games II. Princeton: Princeton University Press. Available at: https://www.rand.org/pubs/papers/P295.html
- Sundararajan, M., Taly, A. and Yan, Q. (2017) ‘Axiomatic Attribution for Deep Networks’, ICML. Available at: https://arxiv.org/abs/1703.01365
- Wachter, S., Mittelstadt, B. and Russell, C. (2017) ‘Counterfactual Explanations without Opening the Black Box: Automated Decisions and the GDPR’. Available at: https://arxiv.org/abs/1711.00399
