I remember the confusion I had when professor Osamu Watanabe first gave me the Occam’s razor problem. With that confusion, I start to love the exciting world of theoretical machine learning. This post is about ERM scheme and some proofs around it. This post is my understanding of the second chapter in the book Under standing machine learning: From theory of algorithm by Shai Shalev-Shwartz and Shai Ben-David.
Here are some notations and scheme for our playground:
- Data space: - Where we “draw” data samples.
- Data distribution: - How the data in is distributed.
- Label space: - Where we have the labels for our samples.
- Labeling function: - Given a data sample, return its true label.
- Training set: of size - The set of training samples and its training labels.
- Hypotheses class: - A finite set of hypotheses .
- Hypothesis: - An estimation of , is the result of some learning algorithm and . Also called “prediction rule” in this post.
- Our task: Given a training set , we do not know or for all data points, how do we argue about the quality of the learned hypothesis compared to ?
Measuring risk on the training dataset
Suppose you are a hero going on an adventure, then suddenly the Orange Monster appears, gives you a bunch of… emails and challenges you to mark which email is an important email. The Orange Monster knows the true labeling of each email it gave you, it just wants to test your wisdom. Let’s say you invented some way (let’s call that way - stands for hypothesis) to mark the emails. How do you or the Orange Monster evaluate your solution? Obviously (or empirically), you would take the number of incorrect solution, divide it to the total number of emails (). And that is how you compute the Empirical Risk. Machine learning works in a similar way, if you give it a bunch of emails and ask it to classify which email is spam email, the machine will need some kind of evaluation to measure its own prediction. However, both we and the machines do not know about the true distribution of emails. Unfortunately we can only evaluate our predictions on the data given to us. Our journey begins when we think: “Hmm, so how much the empirical risk can tell us about the performance of our prediction algorithm on the real data population?”.
Empirical Risk Minimization (ERM) is a class of algorithm that returns the hypothesis that minimizes given a set of training samples . The empirical risk is defined as:
This formula can be read as: The empirical risk is the number of samples () at which our predictor doesn’t match with the true labeling function (subscript S denotes the risk is based on the training set of size only). In this scheme, we are given a set of hypotheses , a data space , a distribution over the data space , and a labeling function that labels a data samples to its “true” label. An algorithm is called ERM when it returns the hypothesis that minimizes the empirical risk. In another word, empirical risk can be viewed as the loss over the training data . It is possible that The weakness of empirical risk (or empirical loss) is when the training samples set contains only misleading samples.
Idealistically, we want to know the error of a prediction rule . This term can be interpreted as the error of predictor over (evaluated on) the data distribution and the labeling function (the true loss!). However, the data distribution and the true labeling function typically are not available in practice. Therefore, we have to rely on the empirical risk, which is evaluated on the training data . By definition, the (true) error of a prediction rule is:
The error of a prediction rule is the probability that the prediction rule returns a wrong label, given a data sample from .
Overfitting in ERM
Since the risk (loss) is measured on the training data , ERM is prone to over fitting. Simply speaking, the output hypothesis from ERM can just map training samples to training labels. If such hypothesis is chosen, the risk empirical will be 0, but the prediction rule error is not guaranteed to be “small”. In other words, we do not know if we have a “good” hypothesis or just an overfitting hypothesis over . It is proven that the infinite polynomial threshold-based hypothesis class can always overfit the training data.
Bound for “bad” training samples
To address the case where the empirical risk is 0 but the hypothesis is “bad”, let’s call the minimum acceptable true risk epsilon (), and have two assumption about the data distribution and the hypothesis class:
- Each data sample in is distributed independently and identically (i.i.d.).
- There exists s.t. (Realizability Assumption).
We care (or worry) about the case where , but , which means even the predictor (hypothesis) that yeilds the smallest empirical loss is still not good enough for the whole data prediciton.
Let’s call the class of bad hypotheses , which contains the hypotheses that has the error prediction rule () larger than for some defined . Now, we would like to know how likely an ERM algorithm will choose one of the bad hypotheses. Formally, we would like to bound the probability of getting the sample set that leads to a bad hypothesis :
In here, we analyze the effect of the sample size to the probability of getting a bad hypothesis from the ERM learning scheme. To be clear, in here, bad hypothesis means the hypothesis that gives high error rate; bad training dataset means the training dataset that makes ERM returns a bad hypothesis, and misleading training dataset means the training dataset that have zero empirical risk. The quick outline of the development for the upper bound is as follow:
- Prove that the sets of bad training datasets is a subset of set of all misleading training datasets.
- Expand set of all misleading training datasets by union of bad training datasets corresponding to each bad hypothesis.
- The probability of getting a bad training dataset (hence bad hypothesis) is bounded by the probability of getting a misleading training dataset.
- Bound the probability of getting a misleading training dataset by union bound.
- Fix a misleading training set and a distribution, use i.i.d. assumption to bound the probability of getting that set. Use .
- Finally show that:
We call the set of missleading examples . This set contains sets of training dataset that can lead to a bad hypothesis: