I remember the confusion I had when professor Osamu Watanabe
first gave me the Occam’s razor problem.
With that confusion, I start to love the exciting world of theoretical machine learning. This post is
about ERM scheme and some proofs around it. This post is my understanding of the second chapter
in the book **Under standing machine learning: From theory of algorithm** by Shai Shalev-Shwartz and
Shai Ben-David.

## Playground

Here are some notations and scheme for our playground:

- Data space: - Where we “draw” data samples.
- Data distribution: - How the data in is distributed.
- Label space: - Where we have the labels for our samples.
- Labeling function: - Given a data sample, return its true label.
- Training set: of size - The set of training samples and its training labels.
- Hypotheses class: - A finite set of hypotheses .
- Hypothesis: - An estimation of , is the result of some learning algorithm and . Also called “prediction rule” in this post.
**Our task**: Given a training set , we do not know or for all data points, how do we argue about the*quality*of the learned hypothesis compared to ?

## Measuring risk on the training dataset

Suppose you are a hero going on an adventure, then suddenly the Orange Monster
appears, gives you a bunch of… emails and challenges you to mark which email
is an important email. The Orange Monster knows the **true** labeling of each
email it gave you, it just wants to test your wisdom. Let’s say you invented
some way (let’s call that way - stands for *hypothesis*) to mark the
emails. How do you or the Orange Monster evaluate your solution? Obviously
(or empirically), you would take the number of *incorrect* solution, divide
it to the total number of emails (). And that is how you compute the
Empirical Risk. Machine learning works in a similar way, if you give it a
bunch of emails and ask it to classify which email is spam email, the machine
will need some kind of evaluation to measure its own prediction. However,
both we and the machines do not know about the true distribution of emails.
Unfortunately we can only evaluate our predictions on the data given to us.
Our journey begins when we think: “Hmm, so how much the empirical risk can
tell us about the performance of our prediction algorithm on the real data
population?”.

Empirical Risk Minimization (ERM) is a class of algorithm that returns the hypothesis that minimizes given a set of training samples . The empirical risk is defined as:

This formula can be read as: The *empirical risk* is the number of samples () at which our
predictor doesn’t match with the true labeling function (subscript S denotes the risk is based
on the training set of size only). In this scheme, we are given a set of hypotheses
, a data space , a distribution over the data space
, and a labeling function that labels a
data samples to its “true” label. **An algorithm is called ERM when it returns the hypothesis
that minimizes the empirical risk**. In another word, empirical risk can be viewed as the loss
over the training data . It is possible that The weakness of *empirical risk*
(or empirical loss) is when the
training samples set contains only *misleading samples*.

Idealistically, we want to know the **error of a prediction rule** . This term
can be interpreted as the error of predictor over (evaluated on) the data distribution
and the labeling function (the true loss!). However, the data distribution and the true
labeling function typically are not available in practice. Therefore, we have to rely on the
empirical risk, which is evaluated on the training data . By definition, the (true) **error
of a prediction rule** is:

The error of a prediction rule is the probability that the prediction rule returns a wrong label, given a data sample from .

## Overfitting in ERM

Since the risk (loss) is measured on the training data , ERM is prone to over fitting. Simply speaking, the output hypothesis from ERM can just map training samples to training labels. If such hypothesis is chosen, the risk empirical will be 0, but the prediction rule error is not guaranteed to be “small”. In other words, we do not know if we have a “good” hypothesis or just an overfitting hypothesis over . It is proven that the infinite polynomial threshold-based hypothesis class can always overfit the training data.

## Bound for “bad” training samples

To address the case where the empirical risk is 0 but the hypothesis is “bad”, let’s call the
minimum acceptable *true* risk epsilon (), and have two
assumption about the data distribution and the hypothesis class:

- Each data sample in is distributed independently and identically (i.i.d.).
- There exists s.t. (Realizability Assumption).

We care (or worry) about the case where , but , which means even the predictor (hypothesis) that yeilds the smallest empirical loss is still not good enough for the whole data prediciton.

Let’s call the class of bad hypotheses , which contains the hypotheses
that has the error prediction rule () larger than for some
defined . Now, we would like to know **how likely an ERM algorithm will
choose one of the bad hypotheses**. Formally, we would like to bound the probability
of getting the sample set that leads to a bad hypothesis
:

In here, we analyze the effect of the sample size to the probability of
getting a bad hypothesis from the ERM learning scheme. To be clear, in here,
**bad hypothesis** means the hypothesis that gives high error rate; **bad training
dataset** means the training dataset that makes ERM returns a bad hypothesis, and
**misleading training dataset** means the training dataset that have zero empirical
risk. The quick outline of the development for the upper bound is as follow:

- Prove that the sets of bad training datasets is a subset of set of all misleading training datasets.
- Expand set of all misleading training datasets by union of bad training datasets corresponding to each bad hypothesis.
- The probability of getting a bad training dataset (hence bad hypothesis) is bounded by the probability of getting a misleading training dataset.
- Bound the probability of getting a misleading training dataset by union bound.
- Fix a misleading training set and a distribution, use i.i.d. assumption to bound the probability of getting that set. Use .
- Finally show that:

%_

We call the set of *missleading* examples . This set contains sets of training
dataset that can lead to a bad hypothesis: