Note: this post meant to help clarify the tutorial questions (number 1, 3, 4, 5) for COMP 9417 – Week 11, School of Computer Science and Engineering, UNSW (s1 – 2017)
Regardless of machine learning algorithm that we choose, we still may want to know about these stuffs (slide – page 6):
 How many training examples a learner should have before it converges to correct hypothesis? (sample complexity)
 How large is a hypothesis space? How complex is a learner’s hypothesis? (hypothesis complexity)
 How many errors a learner are allowed to misclassify before it finally converges to a successful hypothesis? (mistake bounds)
Let’s take a look at first aspect.
Sample complexity – a simple example
Say we have simple consistent learner (learner that makes no mistake on the training data) to learn disjunction of d features. Target concept c is assumed to be expressed as disjunction of literals, where a literal is defined as some features x_{i} being true._{ }
First of all, we need to think about how to create such a learner. In lecture slide (page 50), here is FINDS algorithm to learn conjunction of literals:

 Initialise h to the most specific hypothesis:
 For each positive training instance x : remove from h any literal that is not satisfied by x
 Output hypothesis h
It is easier to understand this simple algorithm with an example.
Then we could just modify above FINDS algorithm to learn disjunction of literals.
 Initialise h to the most specific hypothesis:
 For each negative training instance x : remove from h any literal that is satisfied by x
 Output hypothesis h
Again, example is the easiest way to understand this concept.
The next step is finding the hypothesis space (). If we assumed that the target concept is the disjunction of d relevant binary features, then the hypothesis space is the amount of possible combination of those literals:
Once we know , we can analyse the sample complexity of our learner by finding m (the amount of training examples needed so the learner can converge to a correct hypothesis):
Since we know that : , then
Not just analyse the sample complexity, we can also evaluate our hypothesis complexity, which is the next topic in this post.
Hypothesis complexity – VC dimension
So far, we only measure hypothesis complexity from its “size” via a hypothesis space. Although it is useful, a hypothesis space could be very big (even infinite). It is good if we have another measurement that informs us a hypothesis’ expressivity or capacity to classsify the examples. Vapnik–Chervonenkis (VC) dimension provides us with this information. Here is definition of VC dimension:
size of largest set of instances that can be shattered by a particular hypothesis language or model class
For example, in 1 dimension world, how many points can be “shattered” into every possible combination? 2 points can be shattered, while 3 points can’t. While for 2 dimension world, the maximum number of points that can be shattered is 3. The learner can’t be shattered 4 points as there will be case of XOR relation between the points. In general, the number of VC dimension is the number of dimension plus 1. See the picture below for the better descriptions.
For 2 dimension case, there is possibility that 3 points are colinear. In this case, it is impossible to shatter those points. However, the shattering theory mentions that the shattering does not have to be happened in all possible configurations (or subsets). As long as there is 1 configuration that can be shattered completely, then the amount of points can be taken as VC dimension.
Next, we will discuss about the mistake bounds.
Mistake bounds
We want to know how many times our learner are allowed to misclassify before it finally converges to a correct hypothesis.
Here we will analyse the mistake bounds for Halving Algorithm for Boolean functions. In general, here is Halving Algorithm mechanism:
 Initialise the set of consistent hypothesis
 Repeat get new instance
 let be subset of that predict 0
 let be subset of that predict 1
 if predict 0 else 1
 if then else
Example of simple Halving algorithm implementation with a learner that has 6 hypotheses. In every iteration, assuming a correct hypothesis is existed, the hypotheses set will be reduced at least a half (if there is a mistake) or up to a half (if the prediction is correct).
From above example, a learner is only making mistake once, which is less than worstcase mistake bound (2.58). The bestcase mistake bound will give no mistake at all.
Next, we will evaluate the mistake bounds of a Winnow 2 algorithm. Winnow 2 is similar with the perceptron algorithm. The main difference is how the weights are updated if misclassifications happen. When correct class is 0 (negative), but the prediction is 1 (positive), the weights are demoted (divided by parameter). Otherwise, when correct class is 1 (positive), but the prediction is 0 (negative), the weights are promoted (multipled by parameter). Check the explanation below.
Apparently, after 5 examples are executed, there are 2 features that has dominant weights: and which actually confirms the target concept (). Are the weights converged after 5 trials? Check the picture below.
We can see that our learner makes 4 mistakes before finally converges to a correct hypothesis. This is still below worstcase mistake bound (4.6).
Sources:

 Lecture slide Algorithm Independent Aspects of Machine Learning (15 May 2017), Mike Bain, CSE – UNSW
 Tutorial questions and solutions of Algorithm Independent Aspects of Machine Learning, Mike Bain, CSE – UNSW

Flach, P. (2012). Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press. –> pages 124 – 126

https://www.quora.com/ExplainVCdimensionandshatteringinlucidWay –> easy to understand explanation about VC dimension
Hi,
You mention shattering doesn’t have to happen for all possible configurations/subsets as long as there is 1 configuration/subset that can be shattered. So why can’t 4 points be shattered in 2 dimensions? There is the case where 2 points on the left and 2 points on the right and separated in the middle. So wouldn’t these 4 points be shattered?