Overfitting, Underfitting, and Model Error: How Machine Learning Models Fail

Last Updated June 21, 2026

Overfitting, underfitting, and model error explain why machine-learning systems can fail even when they appear mathematically sophisticated. A model may be too flexible, learning noise and accidents in the training data. It may be too simple, missing important structure in the problem. Or it may be evaluated with an error metric that hides the kinds of failure that matter most in practice.

These are not merely technical defects. They are reasoning failures. Overfitting confuses memorization with learning. Underfitting confuses simplicity with reliability. Model error reminds us that every learned system is an approximation shaped by data, assumptions, features, labels, objectives, optimization choices, and deployment conditions.

This article explains overfitting, underfitting, bias, variance, irreducible error, model capacity, regularization, validation, cross-validation, leakage, error analysis, uncertainty, distribution shift, governance, and representation risk. It shows why responsible computational reasoning requires more than choosing the model with the highest score: it requires understanding what kind of error a model makes, where it fails, and whether those failures are acceptable for the intended use.

A restrained scholarly illustration of a vintage machine-learning analysis workspace with model-fit curves, residual plots, decision boundaries, error diagrams, comparison panels, notebooks, rulers, and archival tools representing overfitting, underfitting, and model error.
Overfitting, underfitting, and model error shown through competing model behaviors: too simple, appropriately fitted, and overly complex patterns compared against observed data.

This article explains overfitting, underfitting, model error, bias, variance, irreducible error, model capacity, regularization, training and validation curves, cross-validation, leakage, noisy labels, residual analysis, model selection, distribution shift, error governance, and representation risk. It emphasizes that model error is not just a score to minimize; it is evidence about how a computational system sees, simplifies, and misrepresents the world.

Why Model Error Matters

Model error matters because a machine-learning system can be wrong in several different ways. A model may fail because it is too simple, too complex, poorly evaluated, trained on narrow data, exposed to leakage, distorted by noisy labels, or deployed in a setting unlike the one it was designed for. A single performance score can hide these differences.

For computational reasoning, error is evidence. It tells us whether a model is learning durable structure, memorizing accidents, missing important relationships, or misrepresenting certain groups, contexts, or edge cases. Understanding error is therefore part of understanding the model itself.

Question Weak review Stronger review
Model fit Does the model score well? Does it generalize beyond the data used to choose it?
Overfitting Is training error low? Is validation or test error much worse than training error?
Underfitting Is the model simple? Is it too simple to represent the structure of the task?
Error metric Which metric is highest? Which errors matter for the actual decision context?
Failure pattern What is average error? Where, for whom, and under what conditions does error concentrate?
Governance Was a score reported? Were error limits, uncertainty, and use boundaries documented?

A model is not reliable because it fits data. It is reliable only when its errors are understood, bounded, and acceptable for the intended use.

Back to top ↑

Model Error Defined

Model error is the difference between what a model predicts and what should be predicted, measured, classified, ranked, or estimated under a defined task. In regression, error may be the difference between predicted and observed values. In classification, error may involve false positives, false negatives, calibration failure, or ranking mistakes. In policy settings, error may include downstream harm that is not captured by ordinary metrics.

Model error is not only a mathematical quantity. It is also a design signal. It depends on how the target was defined, which observations were included, how features were measured, which metric was chosen, and what consequences follow from mistakes.

Error concept Meaning Review question
Training error Error on data used to fit the model. Is the model learning the training sample too closely?
Validation error Error on data used for model selection or tuning. Does tuning improve held-out performance?
Test error Error on protected held-out data. Does performance hold after design choices are fixed?
Generalization error Expected error on new cases from the intended population. Does the model travel beyond the development sample?
Deployment error Error after the model is used in real systems. Does the model remain valid under operational conditions?
Social error Harmful failure not captured by technical scores. Are institutional consequences included in review?

Model error is the trace left by approximation. It shows where computational simplification meets the complexity of the world.

Back to top ↑

Overfitting Defined

Overfitting occurs when a model fits the training data too closely, including noise, quirks, accidental correlations, duplicated records, leakage, or sample-specific patterns that do not generalize. An overfit model may look excellent on training data while failing on validation, test, external, or future data.

Overfitting often appears when model capacity is high relative to the amount and quality of data. But it can also occur through repeated experimentation, excessive feature search, test-set reuse, or informal model shopping. The model does not need to be visually complex to overfit; even a simple model can overfit if the evaluation process is compromised.

Overfitting signal How it appears Review response
Large train-test gap Training error is low but test error is high. Compare learning curves and held-out performance.
High variance Small data changes produce large model changes. Use resampling, regularization, or simpler models.
Noise learning Model follows random fluctuation rather than signal. Audit residuals, data quality, and label noise.
Feature oversearch Many features are tried until one appears predictive. Protect validation/test sets and document feature selection.
Test-set contamination Final test results shape further model choices. Use a new final test or external validation.
Brittle performance Score collapses across time, group, or setting. Evaluate under shift and monitor deployment drift.

Overfitting is not just too much complexity. It is a failure to distinguish durable pattern from accidental fit.

Back to top ↑

Underfitting Defined

Underfitting occurs when a model is too limited to capture the structure of the task. It may use insufficient features, an overly rigid functional form, weak optimization, inappropriate representation, or an objective that misses important relationships. An underfit model often performs poorly on both training and test data.

Underfitting is sometimes mistaken for responsible simplicity. Simpler models can be valuable, especially when interpretability, stability, and governance matter. But simplicity becomes underfitting when the model systematically misses meaningful patterns that are relevant to the task and the intended use.

Underfitting signal How it appears Review response
High training error Model cannot fit even the development data. Improve representation, features, or model class.
Systematic residuals Errors follow a pattern rather than random scatter. Inspect residual plots and subgroup errors.
Missing interactions Model ignores relationships among variables. Consider interaction terms or richer models.
Rigid assumptions Linear model is used for nonlinear structure. Test alternative functional forms.
Weak features Inputs do not capture relevant information. Review measurement, feature engineering, and construct validity.
Unresolved task ambiguity Model is asked to learn a poorly defined target. Clarify the target before increasing model complexity.

Underfitting reminds us that a model can be stable and still be wrong because it is too limited to represent the problem.

Back to top ↑

Bias, Variance, and Irreducible Error

The bias-variance framework helps explain model error. Bias refers to error introduced by simplifying assumptions. Variance refers to sensitivity to the particular training data. Irreducible error refers to noise or uncertainty that cannot be eliminated by changing the model alone.

A high-bias model may underfit because it cannot express the true structure. A high-variance model may overfit because it reacts too strongly to sample-specific fluctuations. Good modeling often requires balancing these forces rather than minimizing one in isolation.

Error component Meaning Typical failure
Bias Error from simplifying assumptions. Underfitting, systematic residual patterns.
Variance Error from sensitivity to training data variation. Overfitting, unstable predictions.
Noise Randomness or unmeasured variability in outcomes. Limits on achievable accuracy.
Measurement error Distortion in features, labels, or outcomes. Misleading targets and degraded learning.
Specification error Wrong model form or omitted structure. Biased conclusions or poor predictions.
Deployment error Mismatch between development and real use. Performance collapse after release.

The bias-variance framework is useful because it connects technical fit to a deeper question: what kind of error is the model making?

Back to top ↑

Model Capacity and Complexity

Model capacity is the ability of a model class to represent a wide range of patterns. A low-capacity model may miss important structure. A high-capacity model may fit training data extremely well, including noise. Capacity is not inherently good or bad; it must match data quality, sample size, task complexity, regularization, and evaluation design.

Complexity also includes workflow complexity. A model-selection process can be complex even if the final model looks simple. Repeated feature search, hyperparameter tuning, threshold adjustment, subgroup selection, and post-hoc reporting can create hidden capacity in the development process.

Complexity source Possible benefit Possible risk
Flexible model class Captures nonlinear or interaction structure. Learns noise or unstable patterns.
Many features Provides richer information. Increases leakage, proxy risk, and spurious correlation.
Hyperparameter search Improves model configuration. Overfits validation data through repeated tuning.
Feature engineering Makes important structure visible. Builds target information into inputs accidentally.
Ensembles Can reduce variance or improve accuracy. May become opaque and difficult to govern.
Deep architectures Can learn high-level representations. Require large data, careful validation, and monitoring.

Capacity must be governed. The question is not whether a model can fit, but whether its fit survives accountable evaluation.

Back to top ↑

Training, Validation, and Test Signals

Training, validation, and test results provide diagnostic signals. Low training error and high validation error suggest overfitting. High training and validation error suggest underfitting. Similar and acceptably low errors suggest better generalization, although external validation may still be required.

These signals must be interpreted carefully. A model can show acceptable aggregate error while failing in subgroups, rare cases, extreme values, or shifted environments. A strong evaluation workflow looks beyond averages to the pattern and consequence of error.

Training error Validation/test error Likely interpretation
Low Low Possible good fit, pending external and subgroup review.
Low High Likely overfitting or leakage-sensitive evaluation.
High High Likely underfitting, weak features, or poorly defined task.
Moderate Moderate May be acceptable if errors are bounded and documented.
Unstable Unstable Dataset may be too small, noisy, shifted, or poorly sampled.
Good aggregate Poor subgroup Average performance may hide inequitable or dangerous failure.

Model evaluation is diagnostic reasoning. Scores are symptoms; the analyst must infer the failure mode.

Back to top ↑

Regularization and Constraints

Regularization adds constraints or penalties that discourage overly complex fits. It can reduce overfitting by limiting model flexibility, shrinking parameters, pruning trees, stopping training early, adding dropout, using simpler features, or enforcing stability. Regularization does not guarantee responsibility, but it can improve generalization when properly validated.

Constraints should be chosen with the task in mind. A penalty that improves average error may still worsen minority-case performance. A simpler model may be easier to audit, but it may underfit. Regularization is therefore part of a larger evaluation process, not a substitute for it.

Regularization practice How it works Review concern
L1 penalty Encourages sparse coefficients. May drop features that matter for smaller groups.
L2 penalty Shrinks coefficients toward smaller values. May smooth over real heterogeneity.
Tree pruning Limits overly specific splits. May remove important edge-case structure.
Early stopping Stops training before validation performance degrades. Depends on valid monitoring data.
Dropout Reduces co-adaptation in neural networks. Does not solve poor measurement or biased labels.
Simpler feature set Reduces dimensionality and leakage risk. Can underfit if important information is excluded.

Regularization disciplines model flexibility, but it does not remove the need for judgment.

Back to top ↑

Cross-Validation and Model Selection

Cross-validation estimates model performance by repeatedly splitting data into training and validation folds. It can make better use of limited data and provide insight into variability across splits. It is especially useful for model selection, hyperparameter tuning, and comparing alternatives.

But cross-validation is not magic. If folds are not designed around time, groups, institutions, households, patients, locations, or other dependency structures, validation can be overly optimistic. Cross-validation must respect the way new cases will actually appear.

Cross-validation issue Risk Better practice
Random folds with repeated entities Same person or institution leaks across folds. Use grouped cross-validation.
Random folds with time series Future information helps predict the past. Use time-aware validation.
Many hyperparameter trials Validation data become overfit. Use nested validation or protected final test data.
Small datasets Results vary strongly by split. Report uncertainty across folds.
Imbalanced classes Rare cases may be missing in some folds. Use stratification and rare-case review.
Institutional deployment Validation does not match real use. Use site, time, or external validation where possible.

Cross-validation helps estimate generalization only when the validation design reflects the question being asked.

Back to top ↑

Error Analysis and Residuals

Error analysis examines where and how the model fails. Residuals, false positives, false negatives, calibration gaps, ranking errors, missed edge cases, and subgroup disparities all reveal different aspects of model behavior. A model with acceptable average error may still fail in ways that matter.

A serious error review asks whether errors are random, systematic, concentrated, explainable, preventable, or harmful. It also asks whether the error metric matches the decision context. In many systems, a false negative and a false positive do not carry the same consequence.

Error-analysis practice What it reveals Governance question
Residual plots Systematic under- or over-prediction. Are errors patterned across the prediction range?
Confusion matrix False positives and false negatives. Which type of error carries greater harm?
Calibration analysis Whether predicted probabilities match observed rates. Are scores interpretable for action thresholds?
Subgroup evaluation Error concentration across groups or contexts. Does performance differ where consequences are high?
Edge-case review Behavior near boundaries or rare cases. Should the model abstain or escalate uncertainty?
Post-deployment monitoring Error changes after release. Is the model still valid under current conditions?

Error analysis turns model evaluation from a scorecard into an interpretive audit.

Back to top ↑

Data Leakage, Noise, and Measurement Error

Data leakage occurs when information is available during training or evaluation that would not be available at the moment of real prediction. Leakage can make a model seem powerful while destroying real-world validity. It is one of the most common causes of misleadingly low error.

Noise and measurement error also shape model failure. Labels may be inconsistent. Features may be proxies. Historical records may reflect institutional practice rather than the underlying phenomenon. The model may be blamed for error that begins in the data-generation process.

Data problem How it distorts error Review response
Target leakage Inputs contain information from the outcome. Audit feature timing and data lineage.
Duplicate leakage Near-identical records appear in train and test sets. Deduplicate and split by entity.
Noisy labels Model learns inconsistent targets. Estimate label reliability and audit annotation rules.
Proxy variables Model learns substitutes rather than the intended construct. Review construct validity and social meaning.
Missingness Absent data reflects institutional processes. Analyze missingness mechanisms.
Historical bias Past decisions shape labels and outcomes. Separate prediction accuracy from normative validity.

Before asking whether a model fits data, ask what the data are actually measuring.

Back to top ↑

Distribution Shift and Deployment Error

Distribution shift occurs when the data seen during deployment differ from the data used in development. This can happen because populations change, policies change, measurement systems change, incentives change, behavior adapts, or the model itself alters the environment.

A model can be neither overfit nor underfit on historical data and still fail after deployment. That is why model error must be reviewed across time, sites, groups, and operational conditions. Deployment is not the end of evaluation; it is where new forms of error become visible.

Shift type Example Review response
Population shift The served population changes. Monitor feature and outcome distributions.
Temporal shift Patterns change over months or years. Use time-aware validation and drift monitoring.
Policy shift Rules alter who is observed or treated. Revalidate after institutional changes.
Measurement shift Data collection systems change. Track schema, instruments, and coding practices.
Feedback shift Model decisions change future data. Monitor feedback loops and perform impact review.
Adversarial shift People adapt behavior to the model. Stress-test against gaming and strategic response.

Generalization is not a one-time property. It must be maintained under changing conditions.

Back to top ↑

Governance and Responsible Use

Model error requires governance because errors are not evenly distributed and do not carry equal consequences. In a low-stakes recommendation setting, some error may be acceptable. In health, finance, education, employment, public services, infrastructure, or legal settings, error can affect access, safety, opportunity, and accountability.

Governance should define acceptable error, unacceptable error, escalation rules, monitoring obligations, documentation standards, review rights, and conditions under which the model should not be used. A model should not be deployed merely because it improves an aggregate metric.

Governance concern Error question Documentation
Use boundary Where should the model not be used? Scope and limitation statement.
Error tolerance Which errors are acceptable? Error-budget and risk-threshold record.
Subgroup impact Where does error concentrate? Disaggregated performance report.
Human review When should uncertainty trigger escalation? Escalation and contestability pathway.
Monitoring How will error be detected after deployment? Post-deployment monitoring plan.
Retirement When should the model be revised or withdrawn? Lifecycle and decommissioning criteria.

Responsible model governance treats error as an institutional responsibility, not just a technical artifact.

Back to top ↑

Representation Risk

Representation risk appears when model error is described in ways that make the system seem more reliable than it is. A single accuracy score may hide overfitting. A small average error may hide severe subgroup harm. A clean validation result may hide leakage. A technical discussion of bias and variance may hide the politics of measurement.

Another risk is error laundering: using formal metrics to make a contested system appear objective. A model can be statistically well evaluated and still be inappropriate for its institutional purpose if the task, labels, action pathway, or consequences are poorly defined.

Representation risk How it appears Review response
Score overclaiming One metric is treated as proof of reliability. Report multiple metrics and error distributions.
Hidden overfitting Model-selection process is not disclosed. Document tuning, feature search, and test-set use.
Underfitting as fairness Weak model is defended because it is simple. Separate interpretability from adequacy.
Aggregate masking Average performance hides group or edge-case failure. Use subgroup, edge-case, and external validation.
Error normalization Known harms are treated as unavoidable noise. Define unacceptable error and escalation rules.
Technical authority Metrics discourage challenge or appeal. Preserve contestability and human judgment.

Error should make a model more accountable, not more mysterious.

Back to top ↑

Examples of Overfitting, Underfitting, and Model Error

The examples below show how model error appears across technical, institutional, scientific, and policy settings.

Polynomial regression

A very high-degree curve can follow every training point while failing to approximate the underlying pattern.

Medical risk scoring

A model may perform well on one hospital’s historical data but fail after deployment in a different clinical setting.

Hiring prediction

A model can learn historical hiring patterns that fit past data while reproducing institutional bias.

Credit scoring

Aggregate accuracy may hide higher false-denial rates for particular groups or contexts.

Fraud detection

A rigid model may underfit evolving fraud strategies, while a flexible model may overfit past cases.

Language classification

A model may fit dataset artifacts, formatting cues, or annotation habits rather than linguistic meaning.

Climate and infrastructure modeling

Simplified models may underfit nonlinear dynamics, while over-tuned simulations may not generalize across scenarios.

Education analytics

A student-risk model may overfit past cohorts or underfit contextual factors that affect learning.

Across these examples, the important question is not only whether the model is wrong, but what kind of wrongness it produces.

Back to top ↑

Mathematics, Computation, and Modeling

For a supervised learning problem, prediction error can be written as a loss between an observed value and a prediction:

\[
L(y, \hat{f}(x))
\]

Interpretation: The loss function measures how costly it is for the model prediction \(\hat{f}(x)\) to differ from the observed or target value \(y\).

Mean squared error can be written as:

\[
MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i – \hat{y}_i)^2
\]

Interpretation: Squared error penalizes larger prediction errors more heavily than smaller ones.

A common decomposition describes expected prediction error as:

\[
E[(Y – \hat{f}(X))^2] = \text{Bias}^2[\hat{f}(X)] + \text{Var}[\hat{f}(X)] + \sigma^2
\]

Interpretation: Expected error can be understood through squared bias, variance, and irreducible noise.

A regularized empirical-risk objective can be written as:

\[
\min_f \left[ \frac{1}{n}\sum_{i=1}^{n} L(y_i, f(x_i)) + \lambda \Omega(f) \right]
\]

Interpretation: The model minimizes training loss while also paying a penalty for complexity, controlled by \(\lambda\).

A generalization gap can be expressed as:

\[
G = Error_{test} – Error_{train}
\]

Interpretation: A large positive gap suggests that performance on new data is worse than performance on training data.

These formulas show why model error is not one idea. It combines fit, complexity, uncertainty, data quality, and the relationship between development and use.

Back to top ↑

Python Workflow: Model Error Audit

The Python workflow below creates a dependency-light model error audit. It generates synthetic data, fits polynomial models of increasing complexity, compares training and validation error, identifies likely underfitting and overfitting, records residual diagnostics, and writes reproducible CSV and JSON outputs.

# overfitting_underfitting_model_error_audit.py
# Dependency-light workflow for model complexity, training error,
# validation error, residual diagnostics, and error governance records.

from __future__ import annotations

from dataclasses import asdict, dataclass
from pathlib import Path
from statistics import mean
import csv
import json
import math
import random
from datetime import datetime, timezone

ARTICLE_ROOT = Path(__file__).resolve().parents[1]
TABLES = ARTICLE_ROOT / "outputs" / "tables"
JSON_DIR = ARTICLE_ROOT / "outputs" / "json"


@dataclass(frozen=True)
class ErrorAuditConfig:
    seed: int = 2026
    n: int = 180
    train_fraction: float = 0.70
    noise_sd: float = 0.18
    max_degree: int = 9


def timestamp_utc() -> str:
    return datetime.now(timezone.utc).isoformat()


def true_function(x: float) -> float:
    return math.sin(2.4 * x) + 0.35 * math.cos(5.0 * x)


def generate_data(config: ErrorAuditConfig) -> list[dict[str, float]]:
    rng = random.Random(config.seed)
    rows = []
    for i in range(config.n):
        x = rng.uniform(-2.0, 2.0)
        y_true = true_function(x)
        y = y_true + rng.gauss(0.0, config.noise_sd)
        rows.append({"unit_id": i + 1, "x": x, "y_true": y_true, "y": y})
    rng.shuffle(rows)
    cutoff = int(config.n * config.train_fraction)
    for index, row in enumerate(rows):
        row["split"] = "train" if index < cutoff else "validation"
    return rows


def polynomial_features(x: float, degree: int) -> list[float]:
    return [x ** power for power in range(degree + 1)]


def solve_linear_system(matrix: list[list[float]], vector: list[float]) -> list[float]:
    n = len(vector)
    augmented = [row[:] + [vector[i]] for i, row in enumerate(matrix)]
    for column in range(n):
        pivot = max(range(column, n), key=lambda row: abs(augmented[row][column]))
        augmented[column], augmented[pivot] = augmented[pivot], augmented[column]
        divisor = augmented[column][column]
        if abs(divisor) < 1e-12:
            continue
        for j in range(column, n + 1):
            augmented[column][j] /= divisor
        for row in range(n):
            if row == column:
                continue
            factor = augmented[row][column]
            for j in range(column, n + 1):
                augmented[row][j] -= factor * augmented[column][j]
    return [augmented[i][n] for i in range(n)]


def fit_ridge_polynomial(rows: list[dict[str, float]], degree: int, penalty: float = 1e-4) -> list[float]:
    size = degree + 1
    xtx = [[0.0 for _ in range(size)] for _ in range(size)]
    xty = [0.0 for _ in range(size)]
    for row in rows:
        features = polynomial_features(float(row["x"]), degree)
        y = float(row["y"])
        for i in range(size):
            xty[i] += features[i] * y
            for j in range(size):
                xtx[i][j] += features[i] * features[j]
    for i in range(1, size):
        xtx[i][i] += penalty
    return solve_linear_system(xtx, xty)


def predict(coefficients: list[float], x: float) -> float:
    return sum(weight * (x ** power) for power, weight in enumerate(coefficients))


def mse(rows: list[dict[str, float]], coefficients: list[float]) -> float:
    return mean((float(row["y"]) - predict(coefficients, float(row["x"]))) ** 2 for row in rows)


def write_csv(path: Path, rows: list[dict[str, object]]) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    if not rows:
        path.write_text("", encoding="utf-8")
        return
    fieldnames = sorted({key for row in rows for key in row.keys()})
    with path.open("w", newline="", encoding="utf-8") as handle:
        writer = csv.DictWriter(handle, fieldnames=fieldnames, extrasaction="ignore")
        writer.writeheader()
        writer.writerows(rows)


def write_json(path: Path, payload: object) -> None:
    path.parent.mkdir(parents=True, exist_ok=True)
    path.write_text(json.dumps(payload, indent=2, sort_keys=True), encoding="utf-8")


def main() -> None:
    config = ErrorAuditConfig()
    rows = generate_data(config)
    train = [row for row in rows if row["split"] == "train"]
    validation = [row for row in rows if row["split"] == "validation"]
    diagnostics = []
    for degree in range(1, config.max_degree + 1):
        coefficients = fit_ridge_polynomial(train, degree)
        train_mse = mse(train, coefficients)
        validation_mse = mse(validation, coefficients)
        gap = validation_mse - train_mse
        if train_mse > 0.15 and validation_mse > 0.15:
            status = "possible_underfitting"
        elif gap > 0.12:
            status = "possible_overfitting"
        else:
            status = "candidate_fit"
        diagnostics.append({
            "degree": degree,
            "training_mse": round(train_mse, 6),
            "validation_mse": round(validation_mse, 6),
            "generalization_gap": round(gap, 6),
            "status": status,
            "interpretation": "Compare training and validation error before trusting apparent fit."
        })
    best = min(diagnostics, key=lambda row: row["validation_mse"])
    audit_summary = {
        "article": "overfitting_underfitting_and_model_error",
        "timestamp_utc": timestamp_utc(),
        "n": config.n,
        "train_rows": len(train),
        "validation_rows": len(validation),
        "best_degree_by_validation_mse": best["degree"],
        "best_validation_mse": best["validation_mse"],
        "largest_generalization_gap": max(row["generalization_gap"] for row in diagnostics),
        "interpretation": "Model error review should compare underfitting, overfitting, validation performance, residual structure, and use-boundary risk."
    }
    write_csv(TABLES / "model_error_synthetic_observations.csv", rows)
    write_csv(TABLES / "model_complexity_diagnostics.csv", diagnostics)
    write_csv(TABLES / "model_error_audit_summary.csv", [audit_summary])
    write_json(JSON_DIR / "model_error_audit_config.json", asdict(config))
    write_json(JSON_DIR / "model_complexity_diagnostics.json", diagnostics)
    write_json(JSON_DIR / "model_error_audit_summary.json", audit_summary)
    print("Model error audit complete.")
    print(TABLES / "model_error_audit_summary.csv")


if __name__ == "__main__":
    main()

The workflow is intentionally small enough to inspect. Its purpose is not to replace production modeling libraries, but to make the logic of underfitting, overfitting, and validation error visible.

Back to top ↑

R Workflow: Error Summary and Diagnostics

The R workflow reads the Python-generated diagnostics and produces simple visual summaries of training error, validation error, and generalization gaps across model complexity.

# overfitting_underfitting_model_error_summary.R
args <- commandArgs(trailingOnly = FALSE)
file_arg <- grep("^--file=", args, value = TRUE)
if (length(file_arg) > 0) {
  script_path <- normalizePath(sub("^--file=", "", file_arg[1]), mustWork = TRUE)
  article_root <- normalizePath(file.path(dirname(script_path), ".."), mustWork = TRUE)
} else {
  article_root <- getwd()
}
setwd(article_root)

tables_dir <- file.path(article_root, "outputs", "tables")
figures_dir <- file.path(article_root, "outputs", "figures")
dir.create(figures_dir, recursive = TRUE, showWarnings = FALSE)

diagnostics_path <- file.path(tables_dir, "model_complexity_diagnostics.csv")
if (!file.exists(diagnostics_path)) {
  stop(paste("Missing", diagnostics_path, "Run the Python workflow first."))
}

diagnostics <- read.csv(diagnostics_path, stringsAsFactors = FALSE)

png(file.path(figures_dir, "training_vs_validation_error.png"), width = 1300, height = 850)
plot(diagnostics$degree, diagnostics$training_mse, type = "b",
     ylim = range(c(diagnostics$training_mse, diagnostics$validation_mse)),
     xlab = "Polynomial degree", ylab = "Mean squared error",
     main = "Training and Validation Error by Model Complexity")
lines(diagnostics$degree, diagnostics$validation_mse, type = "b", lty = 2)
grid()
dev.off()

png(file.path(figures_dir, "generalization_gap.png"), width = 1300, height = 850)
barplot(diagnostics$generalization_gap, names.arg = diagnostics$degree,
        xlab = "Polynomial degree", ylab = "Validation MSE - Training MSE",
        main = "Generalization Gap by Model Complexity")
abline(h = 0, lty = 2)
grid()
dev.off()

summary_table <- data.frame(
  candidate_models = nrow(diagnostics),
  best_degree = diagnostics$degree[which.min(diagnostics$validation_mse)],
  best_validation_mse = min(diagnostics$validation_mse),
  largest_generalization_gap = max(diagnostics$generalization_gap),
  possible_overfitting_count = sum(diagnostics$status == "possible_overfitting"),
  possible_underfitting_count = sum(diagnostics$status == "possible_underfitting")
)

write.csv(summary_table, file.path(tables_dir, "r_model_error_summary.csv"), row.names = FALSE)
print(summary_table)

The R layer gives the repository a second, independent summary path for reviewing fit, complexity, and generalization gaps.

Back to top ↑

GitHub Repository

The companion repository contains reproducible workflows, synthetic data, audit outputs, calculators, documentation, and multilingual examples for this article.

Back to top ↑

A Practical Method for Reviewing Model Error

A practical model-error review should begin before the final score is reported. It should examine the task definition, data quality, split design, model capacity, tuning process, validation signals, error distribution, deployment context, and governance boundary.

Step Review action Output
1 Define the intended prediction or classification task. Task and target statement.
2 Audit features, labels, leakage, and measurement error. Data-quality and construct-validity record.
3 Compare simple, moderate, and flexible model classes. Complexity comparison table.
4 Track training, validation, and test error separately. Generalization-gap report.
5 Analyze residuals, false positives, false negatives, and calibration. Error-analysis report.
6 Evaluate performance across groups, sites, time periods, and edge cases. Disaggregated validation report.
7 Define use boundaries, monitoring triggers, and retirement criteria. Governance and lifecycle plan.

The goal is not to eliminate all error. The goal is to know what the error means before the model is allowed to shape action.

Back to top ↑

Common Pitfalls

Overfitting, underfitting, and model error are often misunderstood because evaluation is reduced to a single metric. A serious review must avoid common shortcuts.

Pitfall Why it matters Correction
Trusting training performance Training fit can reflect memorization. Use protected validation and test data.
Calling complexity the problem Simple models can also be wrong. Diagnose whether error comes from bias, variance, data, or deployment.
Ignoring label noise The model may learn inconsistent or unjust targets. Audit label reliability and target meaning.
Reusing the test set Repeated testing turns the test set into tuning data. Protect final evaluation or use external validation.
Reporting only averages Average performance hides concentrated harm. Report disaggregated and edge-case errors.
Assuming validation is permanent Worlds, data, and institutions change. Monitor drift and revalidate over time.

The most dangerous model errors are often the ones that the evaluation design failed to look for.

Back to top ↑

Why Model Error Is Computational Reasoning

Overfitting, underfitting, and model error are central to computational reasoning because they force analysts to distinguish appearance from reliability. A model that fits training data is not necessarily learning. A model that is simple is not necessarily trustworthy. A model with a strong average score is not necessarily safe to deploy.

Computational reasoning requires asking what the model learned, what it missed, what it memorized, where it fails, which errors matter, and whether the evidence supports the intended use. Error is not only a technical residue. It is a way of seeing the limits of the model.

Responsible machine learning therefore depends on error literacy: the ability to interpret fit, complexity, uncertainty, and failure before algorithmic outputs are treated as grounds for action.

Back to top ↑

Back to top ↑

Further Reading

  • Bishop, C.M. (2006) Pattern Recognition and Machine Learning. New York: Springer. Available at: Springer.
  • Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press. Available at: Deep Learning Book.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edn. New York: Springer. Available at: Stanford author site.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. and Taylor, J. (2023) An Introduction to Statistical Learning: With Applications in Python. Cham: Springer. Available at: StatLearning.
  • scikit-learn developers (n.d.) ‘Underfitting vs. Overfitting’, scikit-learn User Guide. Available at: scikit-learn.

Back to top ↑

References

  • Bishop, C.M. (2006) Pattern Recognition and Machine Learning. New York: Springer. Available at: https://link.springer.com/book/9780387310732.
  • Breiman, L. (2001) ‘Random forests’, Machine Learning, 45, pp. 5–32. Available at: SpringerLink.
  • Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press. Available at: MIT Press.
  • Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2nd edn. New York: Springer. Available at: SpringerLink.
  • James, G., Witten, D., Hastie, T., Tibshirani, R. and Taylor, J. (2023) An Introduction to Statistical Learning: With Applications in Python. Cham: Springer. Available at: StatLearning.
  • Kuhn, M. and Johnson, K. (2013) Applied Predictive Modeling. New York: Springer. Available at: SpringerLink.
  • López, O.A.M. et al. (2022) ‘Overfitting, Model Tuning, and Evaluation of Prediction Performance’, in Clinical Prediction Models. Bethesda: National Center for Biotechnology Information. Available at: NCBI Bookshelf.
  • scikit-learn developers (n.d.) ‘Underfitting vs. Overfitting’, scikit-learn User Guide. Available at: scikit-learn.
  • Vapnik, V.N. (1998) Statistical Learning Theory. New York: Wiley. Available at: Wiley.

Back to top ↑

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top