Supervised, Unsupervised, and Reinforcement Learning - Sustainable Catalyst | Open Knowledge Lab for Ethical Strategy and Systems Intelligence

Last Updated May 10, 2026

Supervised, unsupervised, and reinforcement learning are not merely categories of machine learning. They are fundamentally different formulations of how artificial systems acquire structure from experience, defined by the kind of information available, the objective being optimized, the timing of feedback, and the relationship between observation, inference, and action. Supervised learning estimates relationships from labeled examples. Unsupervised learning discovers structure in unlabeled data. Reinforcement learning optimizes behavior through sequential interaction with an environment. Each paradigm therefore represents a distinct theory of learning under uncertainty.

The central argument of this article is that learning paradigms should be understood as governed information structures. A learning system is not defined only by its architecture or algorithm. It is defined by what signal it receives, what objective it optimizes, how feedback arrives, how uncertainty is handled, how evaluation is performed, and how its outputs or actions affect the world. A supervised model can inherit flawed labels. An unsupervised system can impose misleading structure. A reinforcement learning agent can optimize a reward in ways that violate human intention. The paradigm shapes the risk.

Main Library
Publications

Article Map
Artificial Intelligence Systems

Related Topic
Data Systems & Analytics

Related Topic
Risk & Resilience

Related Topic
Intelligent Infrastructure Systems

Series context: This article is part of the Artificial Intelligence Systems knowledge series, which examines machine learning, foundation models, data systems, automation, governance, accountability, human oversight, risk, infrastructure, and the social consequences of intelligent systems.

Machine learning paradigms system showing supervised learning from labeled examples, unsupervised learning from latent patterns and clusters, reinforcement learning through agent-environment reward loops, hybrid representation learning, evaluation diagnostics, human oversight, and audit controls. — Supervised, unsupervised, and reinforcement learning represent distinct machine learning paradigms defined by different information structures: labeled examples, latent structure in unlabeled data, and reward-driven interaction with an environment.

Understanding these paradigms requires moving beyond surface definitions. The central difference is not simply whether labels are present. It is the structure of the learning signal. In supervised learning, feedback is direct: the system observes inputs and target outputs. In unsupervised learning, feedback is implicit: the system must infer structure from the distribution of observations. In reinforcement learning, feedback is delayed and action-dependent: the system learns through rewards generated by interaction over time. These differences shape what models can learn, how they generalize, how they fail, and how they should be evaluated and governed.

This article develops Supervised, Unsupervised, and Reinforcement Learning as an advanced article within the Artificial Intelligence Systems knowledge series. It explains learning paradigms as information structures; supervised learning as conditional estimation; unsupervised learning as distribution modeling and latent-structure discovery; reinforcement learning as sequential decision-making; and modern hybrid systems such as self-supervised learning, semi-supervised learning, representation learning, transfer learning, reward modeling, and reinforcement learning from human feedback. Selected Python and R examples appear here, while the full GitHub repository contains expanded computational scaffolding for supervised classification, clustering, dimensionality reduction, Q-learning intuition, reward modeling, grouped diagnostics, SQL metadata, governance notes, and advanced Jupyter notebooks.

Why Learning Paradigms Matter

Learning paradigms matter because they define the structure of evidence available to an artificial intelligence system. A supervised model learns from examples where the desired answer is supplied. An unsupervised model must infer structure without explicit answers. A reinforcement learning agent learns through consequences that may appear only after a sequence of actions. These differences change everything: data requirements, objective design, evaluation strategy, uncertainty, failure modes, governance, and the kinds of claims that can responsibly be made about model behavior.

A supervised classifier trained on labeled images, an unsupervised clustering model organizing customer segments, and a reinforcement learning agent controlling a robot are not just different applications. They embody different relationships between information and action. In one case, the system estimates from known targets. In another, it discovers structure that may or may not correspond to meaningful categories. In the third, it acts in the world, receives feedback, and changes future behavior based on reward.

This distinction is essential for trustworthy AI. A supervised model may inherit bias from labels. An unsupervised model may impose misleading structure on ambiguous data. A reinforcement learning system may optimize a reward function in ways its designers did not intend. Learning paradigms are therefore not neutral technical categories. They shape how systems know, act, fail, and produce consequences.

\[
Learning\ Paradigm = Signal + Objective + Feedback
\]

Interpretation: A learning paradigm is defined by the information available to the system, the objective it optimizes, and the feedback structure that shapes its behavior.

Why Learning Paradigms Matter
Paradigm	Learning Signal	Main Capability	Governance Risk
Supervised learning	Labeled input-output examples.	Predicts labels, values, rankings, or structured outputs.	Can scale biased, noisy, incomplete, or historically contingent labels.
Unsupervised learning	Unlabeled observations.	Discovers clusters, latent structure, embeddings, anomalies, or distributions.	Can impose categories that are mathematically coherent but socially or scientifically misleading.
Reinforcement learning	States, actions, rewards, and transitions.	Learns policies for sequential decision-making.	Can optimize reward in ways that violate intended goals or safety constraints.
Self-supervised learning	Training signals generated from data structure itself.	Learns representations at large scale.	Can inherit hidden bias, data provenance problems, and weak grounding from unlabeled corpora.
Hybrid learning systems	Labels, unlabeled data, preferences, rewards, retrieval, and interaction.	Combines multiple learning signals.	Harder to audit because behavior reflects many training stages.

Note: Learning paradigms define the evidence structure of AI systems. Evaluation and governance should be matched to that structure.

Learning as Information Structure: Signal, Objective, and Feedback

All machine learning systems can be analyzed through three core components: signal, objective, and feedback. Signal is the information available to the system. Objective is the function or criterion the system is trying to optimize. Feedback is how the system receives information about its performance.

The three major learning paradigms differ because these components are configured differently:

Supervised learning: full input-output pairs are observed.
Unsupervised learning: only inputs are observed, and structure must be inferred.
Reinforcement learning: feedback is sparse, delayed, and action-dependent.

This framing reveals a deeper unity. All learning can be understood as inference under constraints imposed by available information. When feedback is rich and direct, learning can be comparatively well specified. When feedback is sparse, indirect, delayed, or ambiguous, the learning problem becomes more difficult and the risk of misinterpretation increases.

A learning system is therefore not merely a model architecture. It is an information relationship. The structure of the available signal determines what can be learned, how confidently it can be evaluated, and what kinds of governance controls are necessary.

Learning as Information Structure
Component	Definition	Example	Governance Question
Signal	Information available during learning.	Labels, unlabeled observations, rewards, preferences, demonstrations.	What does the system actually observe?
Objective	Criterion being optimized.	Prediction loss, reconstruction error, contrastive loss, reward.	Does the objective match the intended purpose?
Feedback timing	When learning receives correction or consequence.	Immediate labels, implicit structure, delayed rewards.	Can errors be attributed to the right causes?
Evaluation evidence	How performance is assessed.	Held-out accuracy, cluster stability, return, safety tests.	Does evaluation match deployment risk?
Deployment effect	How model outputs influence future data or environments.	Predictions, categories, recommendations, actions, policies.	Does the system change the world it learns from?

Note: Learning systems should be governed according to the type of signal they use and the consequences of acting on that signal.

\[
Available\ Signal \Rightarrow Learnable\ Structure
\]

Interpretation: A model can only learn patterns made available through its data, objective, feedback, and interaction with the environment.

Supervised Learning: Conditional Estimation from Labeled Data

Supervised learning is the most structured machine learning paradigm. The system observes examples of the form \((x_i,y_i)\), where \(x_i\) is an input and \(y_i\) is the corresponding target. The goal is to learn a function that maps inputs to outputs.

A supervised learning problem can be written as:

\[
D=\{(x_i,y_i)\}_{i=1}^{n}
\]

Interpretation: A supervised dataset contains labeled input-output pairs used to train a model.

The system may estimate a conditional distribution:

\[
P(y\mid x)
\]

Interpretation: Supervised learning estimates how likely an output \(y\) is given input \(x\).

Or it may learn a deterministic mapping:

\[
\hat{y}=f_\theta(x)
\]

Interpretation: The model \(f_\theta\) maps input \(x\) to predicted output \(\hat{y}\).

Training is often formulated as minimizing empirical loss:

\[
\theta^*
=
\arg\min_{\theta}
\frac{1}{n}
\sum_{i=1}^{n}
\ell(y_i,f_\theta(x_i))
\]

Interpretation: The model selects parameters that reduce prediction error on labeled examples.

This framework supports classification, regression, ranking, forecasting, structured prediction, object detection, segmentation, anomaly scoring with labeled data, and many domain-specific decision-support tasks. Because the correct output is provided during training, the learning signal is strong and direct.

But this strength is also a limitation. Supervised learning depends on labels. Labels may be expensive, incomplete, noisy, biased, historically contingent, or misaligned with the real objective. In many institutional settings, the target variable is not truth but a proxy: diagnosis codes, arrest records, clicks, purchases, grades, repayment behavior, incident reports, or human annotations. The model learns from the label system it is given, not from reality itself.

Supervised Learning as Conditional Estimation
Element	Meaning	Strength	Risk
Labeled examples	Inputs paired with target outputs.	Provides direct learning signal.	Labels can encode bias, noise, or flawed institutional history.
Loss function	Penalty for prediction error.	Gives learning an operational objective.	May optimize a proxy rather than the real goal.
Generalization	Performance on unseen cases.	Supports prediction beyond training data.	May fail under shift, rare cases, or subgroup differences.
Calibration	Match between predicted probability and observed frequency.	Supports risk-based decisions.	Accurate classifiers can still be poorly calibrated.
Thresholding	Turning scores into decisions.	Supports operational use.	Threshold choices can create unequal or hidden tradeoffs.

Note: Supervised learning is powerful because labels provide direct feedback, but those labels must be examined as social, institutional, and measurement artifacts.

\[
Label \neq Truth
\]

Interpretation: A supervised model learns from recorded labels, which may reflect measurement limits, human judgment, institutional practice, or historical bias.

Unsupervised Learning: Structure, Compression, and Distribution Modeling

Unsupervised learning removes explicit targets. The system observes inputs \(x\) and must infer structure from the data itself. Instead of predicting known labels, the system may discover clusters, estimate density, compress information, identify latent variables, learn embeddings, reduce dimensionality, detect anomalies, or represent the data more efficiently.

A basic unsupervised problem can be written as modeling the input distribution:

\[
P(x)
\]

Interpretation: Unsupervised learning seeks structure in the distribution of observed inputs.

Latent-variable modeling introduces hidden structure:

\[
P(x,z)=P(x\mid z)P(z)
\]

Interpretation: A latent variable \(z\) represents hidden structure that helps explain observed data \(x\).

Clustering can be represented as assigning each observation to a latent group:

\[
c_i=\arg\min_{k}\|x_i-\mu_k\|^2
\]

Interpretation: In a k-means-style clustering problem, each point is assigned to the nearest cluster center \(\mu_k\).

Unsupervised learning is critical because most real-world data is unlabeled. Sensor streams, documents, images, logs, transactions, biological data, climate records, infrastructure records, and user behavior often exist before labels or categories are defined. Representation learning, dimensionality reduction, anomaly detection, generative modeling, and self-supervised pretraining all depend on extracting structure from these data sources.

Unsupervised learning also has a deeper epistemic problem: without labels, it is not always clear what counts as a correct representation. Multiple structures may be consistent with the same data. A clustering model may find mathematically coherent groups that lack domain meaning. A dimensionality reduction method may reveal visual patterns that are artifacts of scale, preprocessing, or noise. Evaluation therefore requires domain interpretation, stability testing, downstream validation, and caution against treating discovered structure as natural fact.

Unsupervised Learning as Structure Discovery
Method	What It Finds	Potential Value	Interpretive Risk
Clustering	Groups of similar observations.	Segmentation, exploration, taxonomy, anomaly review.	Clusters may not correspond to meaningful categories.
Dimensionality reduction	Lower-dimensional representation of high-dimensional data.	Visualization, compression, feature learning.	Visual separation can reflect preprocessing or scale artifacts.
Density estimation	Probability structure of observed data.	Anomaly detection, generative modeling, simulation.	Low-density cases may be rare, novel, marginalized, or important.
Latent variable modeling	Hidden factors explaining observations.	Interpretable structure and generative representation.	Latent variables may not correspond to real-world causes.
Embedding learning	Vector representations of data.	Retrieval, clustering, transfer, downstream prediction.	Embedding similarity can encode bias or shallow association.

Note: Unsupervised structure should be treated as a hypothesis for interpretation, not as automatic discovery of natural categories.

\[
Discovered\ Pattern \neq Meaningful\ Category
\]

Interpretation: Unsupervised learning can reveal mathematical structure, but domain review is needed before that structure is treated as meaningful, fair, or actionable.

Reinforcement Learning: Sequential Decision-Making Under Reward

Reinforcement learning addresses a fundamentally different problem: learning through action. Instead of receiving labeled examples, an agent interacts with an environment over time. It observes a state, chooses an action, receives a reward, and transitions to a new state. The goal is to learn a policy that maximizes cumulative reward.

A reinforcement learning environment is commonly formalized as a Markov Decision Process:

\[
\mathcal{M}=(\mathcal{S},\mathcal{A},P,R,\gamma)
\]

Interpretation: An MDP consists of states \(\mathcal{S}\), actions \(\mathcal{A}\), transition dynamics \(P\), reward function \(R\), and discount factor \(\gamma\).

A policy defines action choice:

\[
\pi(a\mid s)
\]

Interpretation: A policy gives the probability of taking action \(a\) in state \(s\).

The objective is to maximize expected discounted return:

\[
J(\pi)
=
\mathbb{E}_{\pi}
\left[
\sum_{t=0}^{\infty}
\gamma^t r_t
\right]
\]

Interpretation: Reinforcement learning optimizes the expected sum of future rewards, discounted by \(\gamma\).

Reinforcement learning introduces challenges that do not appear in the same way in supervised or unsupervised learning. Rewards may be delayed. The system must balance exploration and exploitation. Actions influence future observations. Credit assignment is difficult because long-term outcomes may depend on earlier decisions. The environment may be stochastic, partially observed, nonstationary, or affected by other agents.

These properties make reinforcement learning powerful for control, robotics, games, resource allocation, scheduling, autonomous systems, simulation, recommender systems, and dynamic decision environments. They also make reinforcement learning governance especially difficult. A poorly specified reward function can produce unintended behavior. A policy may exploit loopholes in the environment. An agent may learn strategies that are efficient but unsafe, unfair, brittle, or misaligned with human intent.

Reinforcement Learning as Sequential Decision-Making
RL Element	Meaning	System Function	Governance Risk
State	Representation of the current situation.	Defines what the agent observes.	Missing or distorted state information can produce unsafe actions.
Action	Choice available to the agent.	Changes the environment or system trajectory.	Action space may include unsafe or poorly constrained options.
Reward	Feedback signal used for optimization.	Defines what behavior is reinforced.	Reward misspecification can create harmful optimization.
Policy	Rule for choosing actions.	Operational behavior of the agent.	Policy may exploit unintended loopholes.
Environment	System with which the agent interacts.	Generates transitions and rewards.	Simulation may fail to represent real-world complexity.

Note: Reinforcement learning is powerful because it learns from consequences, but consequence-driven optimization requires careful reward design, constraints, monitoring, and safety review.

\[
Reward \neq Human\ Intent
\]

Interpretation: A reward function is a formal proxy for desired behavior. If the proxy is incomplete, the agent may optimize behavior that satisfies the reward while violating the purpose.

Comparing the Three Paradigms

The three paradigms can be compared by the type of information available during learning. A supervised model answers: “What output should correspond to this input?” An unsupervised model asks: “What structure is present in these observations?” A reinforcement learning agent asks: “What action should I take now to improve future reward?” These questions are related, but they are not interchangeable.

Comparison of Major Machine Learning Paradigms
Paradigm	Observed Signal	Typical Objective	Feedback Structure	Core Risk
Supervised learning	Labeled pairs \((x,y)\).	Predict \(y\) from \(x\).	Direct and immediate.	Learning biased or noisy labels.
Unsupervised learning	Inputs \(x\).	Discover structure in \(P(x)\).	Implicit and ambiguous.	Finding structure without meaning.
Reinforcement learning	States, actions, rewards.	Maximize cumulative reward.	Delayed and action-dependent.	Optimizing the wrong reward.
Self-supervised learning	Inputs transformed into prediction tasks.	Learn representations from data structure.	Automatically generated from data.	Scaling hidden data problems into foundation models.
Hybrid systems	Labels, unlabeled data, rewards, preferences, retrieval, interaction.	Combine prediction, representation, alignment, and action.	Multi-stage and layered.	Opaque provenance across training stages.

Note: These paradigms are not mutually exclusive in modern systems. Many advanced AI pipelines combine supervised fine-tuning, unsupervised or self-supervised representation learning, and reinforcement-style optimization.

The comparison matters because evaluation must fit the learning structure. A supervised model can be tested against held-out labels. An unsupervised model may require cluster stability, reconstruction quality, downstream task performance, or domain interpretation. A reinforcement learning policy must be evaluated under sequential consequences, safety constraints, environment shift, exploration risk, and long-run reward behavior.

\[
Different\ Signal \Rightarrow Different\ Evidence
\]

Interpretation: Supervised, unsupervised, and reinforcement systems require different forms of evaluation because they learn from different feedback structures.

Generalization Across Learning Regimes

Generalization is central to all learning paradigms, but it appears differently in each.

In supervised learning, generalization means performing well on unseen examples drawn from the same or related distribution. The primary concern is whether the model has learned durable relationships rather than memorizing training examples or exploiting spurious correlations.

In unsupervised learning, generalization is more subtle. A learned representation may generalize if it captures meaningful latent structure, supports downstream tasks, remains stable across samples, or reveals interpretable patterns. But without labels, evaluation often requires indirect evidence.

In reinforcement learning, generalization means that a learned policy performs well in new states, changed environments, altered dynamics, or previously unseen scenarios. This is especially difficult because the agent’s actions influence future experience.

A supervised generalization gap can be written as:

\[
\mathrm{Gap}
=
R_{\mathrm{test}}(\theta)-R_{\mathrm{train}}(\theta)
\]

Interpretation: In supervised settings, the generalization gap compares test risk with training risk.

For reinforcement learning, the analogous problem is policy robustness:

\[
J_{\mathrm{deploy}}(\pi)-J_{\mathrm{train}}(\pi)
\]

Interpretation: A reinforcement policy may perform differently when deployed in an environment that differs from training or simulation.

Across all paradigms, the central question is the same: does learning transfer beyond the conditions under which it occurred?

Generalization Across Learning Regimes
Paradigm	Generalization Question	Evaluation Evidence	Common Failure
Supervised learning	Does prediction work on unseen examples?	Held-out test sets, cross-validation, calibration, subgroup diagnostics.	Overfitting, shortcut learning, label bias, distribution shift.
Unsupervised learning	Does discovered structure remain meaningful and stable?	Stability tests, reconstruction, downstream performance, domain review.	Misleading clusters, unstable embeddings, artifact-driven structure.
Reinforcement learning	Does the policy work under new states and dynamics?	Simulation stress tests, off-policy evaluation, safety constraints, deployment monitoring.	Reward hacking, simulation gap, unsafe exploration, brittle policy behavior.
Self-supervised learning	Do learned representations transfer to useful tasks?	Downstream benchmarks, representation audits, probing, robustness tests.	Representations inherit hidden corpus bias or spurious association.

Note: Generalization is always a claim about transfer beyond the training conditions. The evidence required depends on the paradigm.

Self-Supervised, Semi-Supervised, and Hybrid Learning

Modern AI systems increasingly blur the boundaries among learning paradigms. Self-supervised learning creates training signals from the structure of the data itself. A model may predict masked tokens, reconstruct missing image patches, contrast related and unrelated representations, predict future frames, or align paired modalities. This allows large-scale learning from unlabeled data while still using supervised-style objectives.

A contrastive objective can be written as:

\[
\mathcal{L}_{\mathrm{contrast}}
=
-\log
\frac{
\exp(\mathrm{sim}(z_i,z_i^+)/\tau)
}{
\sum_j \exp(\mathrm{sim}(z_i,z_j)/\tau)
}
\]

Interpretation: Contrastive learning pulls related representations together and pushes unrelated representations apart.

Semi-supervised learning combines a small amount of labeled data with a larger amount of unlabeled data. Transfer learning adapts representations learned in one domain to another. Reinforcement learning from human feedback combines supervised pretraining, human preference data, reward modeling, and policy optimization.

These hybrid approaches reflect a major shift in AI systems. Rather than treating paradigms as isolated boxes, modern systems combine signals: labels, unlabeled structure, human feedback, retrieval evidence, interaction data, simulation, and rewards. This makes systems more powerful, but also more difficult to audit. A model’s behavior may reflect multiple training stages, each with different data sources, objectives, and alignment risks.

Hybrid Learning Systems
Approach	Learning Signal	Value	Governance Concern
Self-supervised learning	Prediction tasks generated from unlabeled data.	Scales representation learning across massive corpora.	Hidden data bias and weak provenance can scale with the model.
Semi-supervised learning	Small labeled set plus larger unlabeled set.	Reduces label burden and improves coverage.	Unlabeled structure may reinforce label bias or class assumptions.
Transfer learning	Reuse of representations trained elsewhere.	Improves efficiency and performance in new domains.	Source-domain assumptions may not match target-domain needs.
Human preference learning	Rankings or preferences from evaluators.	Aligns outputs with human judgments.	Preferences reflect evaluator norms, incentives, and blind spots.
Reinforcement learning from human feedback	Supervised pretraining, reward modeling, and policy optimization.	Improves interactive system behavior.	Multi-stage provenance and reward proxy risks become harder to audit.

Note: Hybrid systems should document each training stage, data source, objective, evaluator population, reward model, and evaluation suite.

\[
Hybrid\ Capability \Rightarrow Hybrid\ Accountability
\]

Interpretation: When systems combine learning signals, governance must trace how each signal shaped model behavior.

Learning Paradigms in Complex Systems

In real-world systems, learning paradigms are embedded within broader environments. Data is generated by human behavior, institutional processes, platforms, sensors, markets, policies, ecosystems, and infrastructures. Learning systems then act on that data and may change the environment that produces future data.

A supervised recommendation model may alter user behavior, which changes future labels. An unsupervised clustering system may reshape institutional categories, which changes how people are treated or measured. A reinforcement learning agent may change system dynamics through action, making future states depend on past model behavior. These feedback loops complicate evaluation because the learning system is not merely observing the world. It is participating in it.

This is why learning paradigms should be understood as components of adaptive systems. The technical learning objective is only one layer. The broader system includes data generation, representation, training, inference, deployment, monitoring, user response, institutional incentive, and governance. The same model can behave differently when embedded in different social or operational environments.

Learning Paradigms Inside Complex Systems
System Layer	Function	Paradigm Relevance	Failure Mode
Data generation	Creates observations, labels, interactions, or rewards.	Defines what the system can learn from.	Data reflects measurement gaps, bias, or historical policy.
Representation	Transforms raw data into features or embeddings.	Shapes supervised, unsupervised, and RL state spaces.	Important context is lost or distorted.
Objective design	Defines loss, structure, reward, or preference target.	Directs learning behavior.	Proxy objective diverges from real purpose.
Deployment	Places model outputs or actions into workflows.	Connects learning to consequence.	Predictions or actions alter future data.
Monitoring	Tracks drift, failure, reward behavior, or feedback loops.	Maintains evidence after release.	System degradation or gaming remains invisible.
Governance	Defines review, accountability, limits, and correction.	Matches oversight to learning risk.	Responsibility diffuses behind technical categories.

Note: Learning paradigms become consequential when they are embedded in systems that classify, rank, allocate, recommend, control, or act.

\[
Model\ Output \rightarrow Environment\ Change \rightarrow Future\ Data
\]

Interpretation: Deployed learning systems can reshape the data-generating process, creating feedback loops that require monitoring and governance.

Governance, Safety, and Decision-Making Implications

Each learning paradigm creates distinct governance challenges.

Supervised systems require attention to labels, target validity, data provenance, subgroup performance, and historical bias. If the labels encode flawed institutional decisions, the model may reproduce those decisions at scale.

Unsupervised systems require attention to interpretation. Clusters, latent dimensions, anomaly scores, and embeddings may appear objective, but their meaning depends on method, preprocessing, distance metrics, scale, and domain context. Governance must prevent ambiguous structure from being treated as natural fact.

Reinforcement learning systems require careful reward design, simulation validation, safety constraints, exploration limits, and monitoring. Because reinforcement learning agents optimize behavior over time, they may discover strategies that technically maximize reward while violating the intended purpose of the system.

A responsible AI governance process must therefore ask: What signal was used? What objective was optimized? What feedback shaped the system? What does the model know directly, infer indirectly, or learn through action? What failure modes are specific to the learning paradigm? What human oversight is required?

Governance Questions by Learning Paradigm
Paradigm	Governance Question	Evidence Needed	Risk if Ignored
Supervised learning	Are labels valid, fair, and fit for purpose?	Label documentation, target rationale, subgroup diagnostics, leakage review.	Flawed labels become automated predictions.
Unsupervised learning	What does discovered structure mean?	Stability tests, domain interpretation, preprocessing records, downstream validation.	Statistical artifacts become institutional categories.
Reinforcement learning	Does reward optimization produce intended behavior?	Reward specification, simulation tests, safety constraints, policy audits.	Agent exploits reward loopholes or unsafe strategies.
Self-supervised learning	What data shaped the learned representation?	Corpus provenance, filtering records, representation audits, bias tests.	Hidden data patterns become foundation behavior.
Hybrid systems	How did multiple training stages interact?	Training-stage documentation, preference data, reward model records, evaluation logs.	System behavior cannot be traced to its learning history.

Note: Governance should not treat all AI systems as if they learn the same way. Oversight should follow the learning signal, objective, feedback path, and deployment consequence.

\[
Learning\ Method \Rightarrow Failure\ Mode \Rightarrow Governance\ Requirement
\]

Interpretation: Different paradigms produce different risks, so they require different evaluation and accountability structures.

Mathematical Lens: Labels, Latent Structure, Reward, and Policy

A mathematics-first view begins with supervised learning:

\[
D_{\mathrm{sup}}=\{(x_i,y_i)\}_{i=1}^{n}
\]

Interpretation: Supervised learning uses labeled examples.

The supervised objective minimizes prediction loss:

\[
\theta^*
=
\arg\min_{\theta}
\frac{1}{n}
\sum_{i=1}^{n}
\ell(y_i,f_\theta(x_i))
\]

Interpretation: Parameters are chosen to reduce error between predictions and labels.

Unsupervised learning models the distribution of inputs:

\[
x_i\sim P(X)
\]

Interpretation: Unsupervised learning observes inputs without target labels.

Latent-variable modeling introduces hidden structure:

\[
P(x)=\int P(x\mid z)P(z)\,dz
\]

Interpretation: Observed data can be explained by integrating over latent variables \(z\).

Clustering assigns observations to groups:

\[
c_i=\arg\min_k \|x_i-\mu_k\|^2
\]

Interpretation: Each observation is assigned to the nearest cluster center under a chosen distance metric.

Reinforcement learning models sequential interaction:

\[
s_t,a_t,r_t,s_{t+1}
\]

Interpretation: An agent observes state \(s_t\), takes action \(a_t\), receives reward \(r_t\), and transitions to \(s_{t+1}\).

The reinforcement objective maximizes expected return:

\[
J(\pi)
=
\mathbb{E}_{\pi}
\left[
\sum_{t=0}^{\infty}
\gamma^t r_t
\right]
\]

Interpretation: The policy \(\pi\) is optimized for long-run discounted reward.

The value function estimates future return from a state:

\[
V^\pi(s)
=
\mathbb{E}_{\pi}
\left[
\sum_{t=0}^{\infty}
\gamma^t r_t
\mid s_0=s
\right]
\]

Interpretation: The value function estimates expected return when starting from state \(s\) and following policy \(\pi\).

The action-value function estimates return from a state-action pair:

\[
Q^\pi(s,a)
=
\mathbb{E}_{\pi}
\left[
\sum_{t=0}^{\infty}
\gamma^t r_t
\mid s_0=s,a_0=a
\right]
\]

Interpretation: \(Q^\pi(s,a)\) estimates the value of taking action \(a\) in state \(s\), then following policy \(\pi\).

A Q-learning update can be written as:

\[
Q(s_t,a_t)
\leftarrow
Q(s_t,a_t)
+
\alpha
\left[
r_t+\gamma\max_a Q(s_{t+1},a)-Q(s_t,a_t)
\right]
\]

Interpretation: Q-learning updates action values using observed reward and the best estimated future value.

A governance-aware learning-risk score can combine signal quality, objective alignment, feedback delay, and downstream consequence:

\[
Risk_i =
\alpha(1-S_i)
+
\beta(1-O_i)
+
\gamma F_i
+
\lambda C_i
\]

Interpretation: Learning risk for system \(i\) may increase when signal quality \(S_i\) is weak, objective alignment \(O_i\) is poor, feedback delay \(F_i\) is high, or downstream consequence \(C_i\) is severe.

This mathematical lens shows that learning paradigms are not just names. They are formal structures defining what information is available, what objective is optimized, and how feedback enters the system.

Variables and System Interpretation

Key Symbols for Supervised, Unsupervised, and Reinforcement Learning
Symbol or Term	Meaning	Typical Paradigm	System Interpretation
\(x\)	Input or observation	All paradigms	Information available to the model or agent.
\(y\)	Target label or output	Supervised learning	Observed answer used for training.
\(f_\theta\)	Parameterized model	Supervised / unsupervised	Maps inputs into predictions, embeddings, or representations.
\(\ell\)	Loss function	Supervised / self-supervised	Defines prediction, reconstruction, or representation error.
\(z\)	Latent variable	Unsupervised learning	Hidden structure inferred from data.
\(\mu_k\)	Cluster center	Unsupervised learning	Representative point for cluster \(k\).
\(c_i\)	Cluster assignment	Unsupervised learning	Latent group assigned to observation \(i\).
\(s\)	State	Reinforcement learning	Current situation observed by the agent.
\(a\)	Action	Reinforcement learning	Decision taken by the agent.
\(r\)	Reward	Reinforcement learning	Feedback signal produced by the environment.
\(\pi(a\mid s)\)	Policy	Reinforcement learning	Action-selection rule.
\(V^\pi(s)\)	Value function	Reinforcement learning	Expected return from a state under policy \(\pi\).
\(Q^\pi(s,a)\)	Action-value function	Reinforcement learning	Expected return from a state-action pair.
\(\gamma\)	Discount factor	Reinforcement learning	Controls how much future rewards matter.

Note: The same AI system may combine multiple learning paradigms. For example, a modern language model may use self-supervised pretraining, supervised fine-tuning, reward modeling, and reinforcement-style optimization.

Worked Example: Three Forms of Learning Signal

Consider three systems trained on related data.

A supervised classifier receives examples with labels:

\[
(x_i,y_i)
\]

Interpretation: The system learns from known input-output pairs.

An unsupervised model receives only observations:

\[
x_i
\]

Interpretation: The system must infer structure without explicit labels.

A reinforcement learning agent receives experience tuples:

\[
(s_t,a_t,r_t,s_{t+1})
\]

Interpretation: The system learns through action, reward, and state transition.

The supervised model asks: “Given this input, what output should I predict?” The unsupervised model asks: “What structure exists in this data?” The reinforcement agent asks: “What action should I take to improve future reward?” These are not merely different techniques. They are different ways of organizing learning itself.

Governance-Ready Review of Learning Signals
Learning Signal	What It Provides	Primary Question	Review Requirement
Labeled pair	Input and target output.	Is the label valid and fit for purpose?	Label audit, target rationale, subgroup evaluation.
Unlabeled observation	Data structure without explicit target.	Does discovered structure have domain meaning?	Stability testing, domain interpretation, downstream validation.
Rewarded transition	Action, consequence, and next state.	Does reward optimize intended behavior?	Reward review, simulation validation, safety constraints.
Human preference	Judgment about output quality or preference.	Whose preferences are being encoded?	Evaluator documentation, bias review, disagreement analysis.
Retrieved evidence	External context used to guide output.	Is evidence relevant, current, and accurately represented?	Source provenance, retrieval evaluation, claim-source review.

Note: The form of feedback determines what the system can learn and what kind of audit evidence is needed.

Computational Modeling

Computational modeling makes learning paradigms easier to compare. A supervised workflow can train a classifier and evaluate predictive performance. An unsupervised workflow can cluster points and examine latent structure. A reinforcement learning workflow can simulate a small environment and update action values over time. A grouped diagnostics workflow can compare error rates across signal types, conditions, or feedback structures. A SQL metadata schema can document the learning paradigm, dataset, signal type, objective, evaluation method, and governance review.

The selected examples below focus on supervised classification, unsupervised clustering, Q-learning intuition, and grouped diagnostics because they are foundational, readable, and directly reusable. The GitHub repository extends the same logic into advanced Jupyter notebooks, self-supervised learning simulations, representation learning examples, reward-modeling notes, policy evaluation scaffolds, SQL metadata, and governance documentation.

Computational Artifacts for Learning-Paradigm Governance
Artifact	Purpose	Governance Value
Supervised classifier report	Measures prediction performance from labeled examples.	Supports label-quality and predictive-validity review.
Clustering output	Identifies latent groups or structure.	Supports interpretation and stability review.
Dimensionality-reduction output	Compresses high-dimensional data into lower-dimensional structure.	Supports representation inspection and artifact detection.
Q-learning table	Shows reward-driven action-value updates.	Supports intuition about delayed feedback and policy learning.
Grouped diagnostics	Compares performance or failure by signal type and condition.	Reveals paradigm-specific risk patterns.
Governance memo	Summarizes signal, objective, feedback, risks, and review needs.	Supports auditability and responsible deployment.

Note: Learning-paradigm workflows should make signal structure visible, not only produce model outputs.

Python Workflow: Supervised Classification, Clustering, and Q-Learning Intuition

Python is useful for demonstrating learning paradigms side by side. The following example creates a synthetic dataset, trains a supervised classifier, runs unsupervised clustering, demonstrates a small Q-learning update table, and writes governance-ready outputs.

"""
Supervised, Unsupervised, and Reinforcement Learning
Python workflow: classification, clustering, and Q-learning intuition.

This educational workflow demonstrates:
1. supervised classification from labeled examples
2. unsupervised clustering from unlabeled structure
3. Q-learning updates from reward-bearing experience
4. governance-ready output records

It does not require private data.
"""

from __future__ import annotations

from pathlib import Path

import numpy as np
import pandas as pd

from sklearn.cluster import KMeans
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, adjusted_rand_score
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler


RANDOM_SEED = 42
rng = np.random.default_rng(RANDOM_SEED)

OUTPUT_DIR = Path("outputs")
OUTPUT_DIR.mkdir(exist_ok=True)


def create_supervised_dataset() -> tuple[np.ndarray, np.ndarray]:
    """Create a synthetic dataset with labels."""
    x, y = make_classification(
        n_samples=1500,
        n_features=6,
        n_informative=4,
        n_redundant=1,
        n_clusters_per_class=2,
        random_state=RANDOM_SEED,
    )
    return x, y


def run_supervised_learning(x: np.ndarray, y: np.ndarray) -> pd.DataFrame:
    """Train a supervised classifier and evaluate accuracy."""
    x_train, x_test, y_train, y_test = train_test_split(
        x,
        y,
        test_size=0.30,
        stratify=y,
        random_state=RANDOM_SEED,
    )

    supervised_model = Pipeline(
        steps=[
            ("scale", StandardScaler()),
            (
                "classifier",
                LogisticRegression(max_iter=1000, random_state=RANDOM_SEED),
            ),
        ]
    )

    supervised_model.fit(x_train, y_train)
    prediction = supervised_model.predict(x_test)

    return pd.DataFrame(
        [
            {
                "paradigm": "supervised",
                "signal_type": "labeled input-output pairs",
                "objective": "minimize prediction loss",
                "evaluation_measure": "held-out accuracy",
                "score": accuracy_score(y_test, prediction),
                "records_evaluated": len(y_test),
            }
        ]
    )


def run_unsupervised_learning(x: np.ndarray, y: np.ndarray) -> pd.DataFrame:
    """Run clustering and compare clusters to hidden labels for demonstration."""
    cluster_model = KMeans(n_clusters=2, random_state=RANDOM_SEED, n_init="auto")
    cluster_labels = cluster_model.fit_predict(x)

    cluster_alignment = adjusted_rand_score(y, cluster_labels)

    return pd.DataFrame(
        [
            {
                "paradigm": "unsupervised",
                "signal_type": "unlabeled observations",
                "objective": "discover latent cluster structure",
                "evaluation_measure": "adjusted Rand index against hidden labels",
                "score": cluster_alignment,
                "records_evaluated": len(y),
            }
        ]
    )


def run_q_learning_demo() -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Run a tiny tabular Q-learning demonstration.

    This example is intentionally small so the update logic is transparent.
    """
    states = [0, 1, 2]
    actions = [0, 1]

    q = np.zeros((len(states), len(actions)))

    alpha = 0.30
    gamma = 0.90

    experience = [
        (0, 1, 0.0, 1),
        (1, 1, 1.0, 2),
        (2, 0, 0.0, 2),
        (0, 1, 0.0, 1),
        (1, 1, 1.0, 2),
    ]

    updates = []

    for step, (state, action, reward, next_state) in enumerate(experience, start=1):
        old_value = q[state, action]
        target = reward + gamma * np.max(q[next_state])
        q[state, action] += alpha * (target - q[state, action])

        updates.append(
            {
                "step": step,
                "state": state,
                "action": action,
                "reward": reward,
                "next_state": next_state,
                "old_q": old_value,
                "target": target,
                "new_q": q[state, action],
            }
        )

    q_table = pd.DataFrame(q, columns=[f"action_{a}" for a in actions])
    q_table.insert(0, "state", states)

    q_summary = pd.DataFrame(
        [
            {
                "paradigm": "reinforcement",
                "signal_type": "state-action-reward transitions",
                "objective": "maximize discounted return",
                "evaluation_measure": "final Q-value table inspection",
                "score": float(q.max()),
                "records_evaluated": len(experience),
            }
        ]
    )

    return q_summary, pd.DataFrame(updates), q_table


def create_governance_memo(summary: pd.DataFrame) -> str:
    """Create a governance memo comparing learning paradigms."""
    supervised_score = summary.loc[
        summary["paradigm"] == "supervised",
        "score",
    ].iloc[0]

    unsupervised_score = summary.loc[
        summary["paradigm"] == "unsupervised",
        "score",
    ].iloc[0]

    reinforcement_score = summary.loc[
        summary["paradigm"] == "reinforcement",
        "score",
    ].iloc[0]

    return f"""# Learning Paradigm Governance Memo

## Summary

Supervised held-out accuracy: {supervised_score:.3f}
Unsupervised cluster alignment score: {unsupervised_score:.3f}
Reinforcement maximum Q-value: {reinforcement_score:.3f}

## Interpretation

- Supervised learning depends on the validity and fairness of labels.
- Unsupervised learning can reveal structure, but structure requires interpretation.
- Reinforcement learning depends on reward design and environment validity.
- These paradigms require different evaluation evidence.
- Before deployment, document signal type, objective, feedback timing, data provenance,
  evaluation design, failure modes, and human oversight requirements.
"""


def main() -> None:
    """Run supervised, unsupervised, and reinforcement learning demonstrations."""
    x, y = create_supervised_dataset()

    supervised_summary = run_supervised_learning(x, y)
    unsupervised_summary = run_unsupervised_learning(x, y)
    q_summary, q_updates, q_table = run_q_learning_demo()

    summary = pd.concat(
        [supervised_summary, unsupervised_summary, q_summary],
        ignore_index=True,
    )

    memo = create_governance_memo(summary)

    summary.to_csv(OUTPUT_DIR / "python_learning_paradigm_summary.csv", index=False)
    q_updates.to_csv(OUTPUT_DIR / "python_q_learning_updates.csv", index=False)
    q_table.to_csv(OUTPUT_DIR / "python_q_learning_table.csv", index=False)
    (OUTPUT_DIR / "python_learning_paradigm_governance_memo.md").write_text(memo)

    print("Learning paradigm summary")
    print(summary)

    print("\nQ-learning updates")
    print(q_updates)

    print("\nQ-table")
    print(q_table)

    print("\nGovernance memo")
    print(memo)


if __name__ == "__main__":
    main()

This workflow is intentionally simple, but it shows the fundamental distinction: supervised learning uses labels, unsupervised learning estimates structure, and reinforcement learning updates behavior from reward-bearing experience.

R Workflow: Learning-Paradigm Diagnostics by Signal Type

R is useful for grouped diagnostics and evaluation summaries. The following workflow simulates performance and risk patterns across learning paradigms and feedback conditions, then writes governance-ready summaries.

# Supervised, Unsupervised, and Reinforcement Learning
# R workflow: learning-paradigm diagnostics by signal type.
#
# This educational workflow simulates diagnostic differences across
# supervised, unsupervised, and reinforcement learning settings.

set.seed(42)

if (!dir.exists("outputs")) {
  dir.create("outputs")
}

n <- 1200

paradigm_eval <- data.frame(
  record_id = paste0("LP", sprintf("%04d", 1:n)),
  paradigm = sample(
    c("supervised", "unsupervised", "reinforcement"),
    n,
    replace = TRUE,
    prob = c(0.45, 0.30, 0.25)
  ),
  signal_quality = sample(
    c("high", "medium", "low"),
    n,
    replace = TRUE,
    prob = c(0.40, 0.40, 0.20)
  )
)

base_risk <- ifelse(
  paradigm_eval$paradigm == "supervised", 0.10,
  ifelse(paradigm_eval$paradigm == "unsupervised", 0.16, 0.22)
)

signal_multiplier <- ifelse(
  paradigm_eval$signal_quality == "high", 0.75,
  ifelse(paradigm_eval$signal_quality == "medium", 1.00, 1.45)
)

paradigm_eval$failure_probability <- pmin(base_risk * signal_multiplier, 0.90)
paradigm_eval$failure_event <- rbinom(
  n,
  size = 1,
  prob = paradigm_eval$failure_probability
)

summary_table <- aggregate(
  failure_event ~ paradigm + signal_quality,
  data = paradigm_eval,
  FUN = mean
)

names(summary_table)[3] <- "simulated_failure_rate"

paradigm_summary <- aggregate(
  failure_event ~ paradigm,
  data = paradigm_eval,
  FUN = mean
)

names(paradigm_summary)[2] <- "mean_failure_rate"

signal_summary <- aggregate(
  failure_event ~ signal_quality,
  data = paradigm_eval,
  FUN = mean
)

names(signal_summary)[2] <- "mean_failure_rate"

overall_summary <- data.frame(
  records_reviewed = nrow(paradigm_eval),
  overall_failure_rate = mean(paradigm_eval$failure_event),
  maximum_group_failure_rate = max(summary_table$simulated_failure_rate),
  minimum_group_failure_rate = min(summary_table$simulated_failure_rate),
  diagnostic_gap = max(summary_table$simulated_failure_rate) -
    min(summary_table$simulated_failure_rate)
)

review_flags <- summary_table[
  summary_table$simulated_failure_rate >
    overall_summary$overall_failure_rate + 0.05,
]

write.csv(paradigm_eval, "outputs/r_learning_paradigm_records.csv", row.names = FALSE)
write.csv(summary_table, "outputs/r_learning_paradigm_diagnostics.csv", row.names = FALSE)
write.csv(paradigm_summary, "outputs/r_learning_paradigm_summary.csv", row.names = FALSE)
write.csv(signal_summary, "outputs/r_signal_quality_summary.csv", row.names = FALSE)
write.csv(overall_summary, "outputs/r_learning_overall_summary.csv", row.names = FALSE)
write.csv(review_flags, "outputs/r_learning_review_flags.csv", row.names = FALSE)

memo <- paste0(
  "# Learning Paradigm Diagnostics Memo\n\n",
  "Records reviewed: ", nrow(paradigm_eval), "\n",
  "Overall failure rate: ", round(mean(paradigm_eval$failure_event), 3), "\n",
  "Maximum group failure rate: ",
  round(max(summary_table$simulated_failure_rate), 3), "\n",
  "Minimum group failure rate: ",
  round(min(summary_table$simulated_failure_rate), 3), "\n",
  "Diagnostic gap: ",
  round(overall_summary$diagnostic_gap, 3), "\n\n",
  "Interpretation:\n",
  "- Learning systems should be evaluated in relation to the feedback structure they depend on.\n",
  "- Supervised systems require label-quality review.\n",
  "- Unsupervised systems require interpretation and stability review.\n",
  "- Reinforcement learning systems require reward, environment, and safety review.\n",
  "- Low-quality signals should trigger additional governance controls before deployment.\n"
)

writeLines(memo, "outputs/r_learning_paradigm_diagnostics_memo.md")

print("Learning paradigm diagnostics")
print(summary_table)

print("Paradigm summary")
print(paradigm_summary)

print("Signal quality summary")
print(signal_summary)

print("Overall summary")
print(overall_summary)

print("Review flags")
print(review_flags)

cat(memo)

This workflow is synthetic, but the diagnostic logic is useful. Learning systems should be evaluated in relation to the feedback structure they depend on. Label quality, latent-structure ambiguity, delayed rewards, exploration risks, and deployment shift create different kinds of failure.

GitHub Repository

The article body includes selected computational examples so the conceptual and mathematical argument remains readable. The full repository contains expanded computational infrastructure: advanced Jupyter notebooks, supervised classification labs, clustering and dimensionality-reduction examples, Q-learning simulations, self-supervised contrastive-learning demos, grouped diagnostics, SQL metadata schemas, model-card notes, governance documentation, and reproducible outputs.

Complete Code Repository

The full code distribution for this article includes Python, R, SQL, Julia, Rust, Go, TypeScript, C++, supervised classification labs, clustering and dimensionality-reduction workflows, Q-learning simulations, self-supervised contrastive-learning demos, reward-modeling notes, policy evaluation scaffolds, grouped diagnostics, SQL metadata, model-card notes, advanced notebooks, reproducible outputs, and audit scaffolding for studying supervised, unsupervised, and reinforcement learning.

View the Full GitHub Repository

From Learning Paradigms to Auditable AI Systems

Supervised, unsupervised, and reinforcement learning show that artificial intelligence is not one kind of learning. It is a family of learning regimes defined by signal, objective, feedback, and interaction. A model trained from labels, a model that discovers latent structure, and an agent that learns from reward all acquire information differently and therefore require different evaluation and governance strategies.

The central lesson is that learning paradigm determines risk structure. Supervised learning can scale flawed labels. Unsupervised learning can impose misleading categories. Reinforcement learning can optimize poorly specified rewards. Self-supervised and hybrid systems can combine several training stages, making provenance and auditability more complex. The question is not only whether a model performs well, but what kind of evidence shaped its behavior.

The future of trustworthy AI will require clear documentation of how learning occurred: what data was used, what signals were available, what objective was optimized, what feedback shaped the system, what evaluation was performed, and what limitations remain. Signal provenance, objective review, reward documentation, representation audits, grouped diagnostics, simulation validation, and post-deployment monitoring should become normal parts of AI system development.

Within the Artificial Intelligence Systems knowledge series, this article belongs near Machine Learning Foundations: How Systems Learn from Data, Model Training, Optimization, and Evaluation, Deep Learning Systems: Representation, Scale, and Generalization, Neural Networks and Pattern Recognition, Model Validation, Benchmarking, and Generalization Theory, Reinforcement Learning in Dynamic Environments, Data Quality, Bias, and Measurement in Machine Learning, and AI Governance and Regulatory Systems. It provides the conceptual bridge between learning theory, data structure, feedback design, model evaluation, and AI governance.

The final point is practical. Learning paradigms should not be treated as textbook categories alone. They should be treated as audit categories. To govern an AI system responsibly, one must know whether it learned from labels, latent structure, reward, preference, retrieval, simulation, interaction, or some hybrid of these signals. In short, learning paradigms must become visible, documented, and auditable.

References

Bishop, C.M. (2006) Pattern Recognition and Machine Learning. New York: Springer. Available at: https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/
Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. Cambridge, MA: MIT Press. Available at: https://www.deeplearningbook.org/
Hastie, T., Tibshirani, R. and Friedman, J. (2009) The Elements of Statistical Learning. 2nd edn. New York: Springer. Available at: https://hastie.su.domains/ElemStatLearn/
LeCun, Y., Bengio, Y. and Hinton, G. (2015) ‘Deep learning’, Nature, 521, pp. 436–444. Available at: https://www.nature.com/articles/nature14539
Murphy, K.P. (2022) Probabilistic Machine Learning: An Introduction. Cambridge, MA: MIT Press. Available at: https://probml.github.io/pml-book/book1.html
Murphy, K.P. (2023) Probabilistic Machine Learning: Advanced Topics. Cambridge, MA: MIT Press. Available at: https://probml.github.io/pml-book/book2.html
Ouyang, L. et al. (2022) ‘Training language models to follow instructions with human feedback’, Advances in Neural Information Processing Systems. Available at: https://arxiv.org/abs/2203.02155
Shalev-Shwartz, S. and Ben-David, S. (2014) Understanding Machine Learning: From Theory to Algorithms. Cambridge: Cambridge University Press. Available at: https://www.cs.huji.ac.il/~shais/UnderstandingMachineLearning/copy.html
Silver, D. et al. (2016) ‘Mastering the game of Go with deep neural networks and tree search’, Nature, 529, pp. 484–489. Available at: https://www.nature.com/articles/nature16961
Sutton, R.S. and Barto, A.G. (2018) Reinforcement Learning: An Introduction. 2nd edn. Cambridge, MA: MIT Press. Available at: http://incompleteideas.net/book/the-book-2nd.html