Vapnik–Chervonenkis theory


Vapnik–Chervonenkis theory was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.
VC theory is related to statistical learning theory and to empirical processes. Richard M. Dudley and Vladimir Vapnik, among others, have applied VC-theory to empirical processes.

Introduction

VC theory covers at least four parts :
VC Theory is a major subbranch of statistical learning theory. One of its main applications in statistical learning theory is to provide generalization conditions for learning algorithms. From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization.
In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics.

Overview of VC theory in Empirical Processes

Background on Empirical Processes

Let be random elements defined on a measurable space. For any measure on, and any measurable functions, define
Measurability issues will be ignored here, for more technical detail see. Let be a class of measurable functions and define:
Define the empirical measure
where here stands for the Dirac measure. The empirical measure induces a map given by:
Now suppose is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes for which statements such as the following hold:
In the former case is called Glivenko-Cantelli class, and in the latter case the class is called Donsker or -Donsker. A Donsker class is Glivenko-Cantelli in probability by an application of Slutsky's theorem.
These statements are true for a single, by standard LLN, CLT arguments under regularity conditions, and the difficulty in the Empirical Processes comes in because joint statements are being made for all. Intuitively then, the set cannot be too large, and as it turns out that the geometry of plays a very important role.
One way of measuring how big the function set is to use the so-called covering numbers. The covering number
is the minimal number of balls needed to cover the set . The entropy is the logarithm of the covering number.
Two sufficient conditions are provided below, under which it can be proved that the set is Glivenko-Cantelli or Donsker.
A class is -Glivenko-Cantelli if it is -measurable with envelope such that and satisfies:
The next condition is a version of the celebrated Dudley's theorem. If is a class of functions such that
then is -Donsker for every probability measure such that. In the last integral, the notation means

Symmetrization

The majority of the arguments of how to bound the empirical process, rely on symmetrization, maximal and concentration inequalities and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions it is presented here.
Consider the empirical process:
Turns out that there is a connection between the empirical and the following symmetrized process:
The symmetrized process is a Rademacher process, conditionally on the data. Therefore, it is a sub-Gaussian process by Hoeffding's inequality.
Lemma. For every nondecreasing, convex and class of measurable functions,
The proof of the Symmetrization lemma relies on introducing independent copies of the original variables and replacing the inner expectation of the LHS by these copies. After an application of Jensen's inequality different signs could be introduced without changing the expectation. The proof can be found below because of its instructive nature.



Introduce the "ghost sample" to be independent copies of. For fixed values of one has:
Therefore, by Jensen's inequality:
Taking expectation with respect to gives:
Note that adding a minus sign in front of a term doesn't change the RHS, because it's a symmetric function of and. Therefore, the RHS remains the same under "sign perturbation":
for any. Therefore:
Finally using first triangle inequality and then convexity of gives:
Where the last two expressions on the RHS are the same, which concludes the proof.


A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to and then argue conditionally on the data, using the fact that Rademacher processes are simple processes with nice properties.

VC Connection

It turns out that there is a fascinating connection between certain combinatorial properties of the set and the entropy numbers. Uniform covering numbers can be controlled by the notion of Vapnik-Chervonenkis classes of sets - or shortly VC sets.
Consider a collection of subsets of the sample space. is said to pick out a certain subset of the finite set if for some. is said to shatter if it picks out each of its subsets. The VC-index of is the smallest for which no set of size is shattered by.
Sauer's lemma then states that the number of subsets picked out by a VC-class satisfies:
Which is a polynomial number of subsets rather than an exponential number. Intuitively this means that a finite VC-index implies that has an apparent simplistic structure.
A similar bound can be shown for the so-called VC subgraph classes. For a function the subgraph is a subset of such that:. A collection of is called a VC subgraph class if all subgraphs form a VC-class.
Consider a set of indicator functions in for discrete empirical type of measure . It can then be shown that quite remarkably, for :
Further consider the symmetric convex hull of a set : being the collection of functions of the form with. Then if
the following is valid for the convex hull of :
The important consequence of this fact is that
which is just enough so that the entropy integral is going to converge, and therefore the class is going to be -Donsker.
Finally an example of a VC-subgraph class is considered. Any finite-dimensional vector space of measurable functions is VC-subgraph of index smaller than or equal to.



Take points. The vectors:
are in a dimensional subspace of. Take, a vector that is orthogonal to this subspace. Therefore:
Consider the set. This set cannot be picked out since if there is some such that that would imply that the LHS is strictly positive but the RHS is non-positive.


There are generalizations of the notion VC subgraph class, e.g. there is the notion of pseudo-dimension. The interested reader can look into.

VC Inequality

A similar setting is considered, which is more common to machine learning. Let is a feature space and. A function is called a classifier. Let be a set of classifiers. Similarly to the previous section, define the shattering coefficient :
Note here that there is a 1:1 go between each of the functions in and the set on which the function is 1. We can thus define to be the collection of subsets obtained from the above mapping for every. Therefore, in terms of the previous section the shattering coefficient is precisely
This equivalence together with Sauer's Lemma implies that is going to be polynomial in, for sufficiently large provided that the collection has a finite VC-index.
Let is an observed dataset. Assume that the data is generated by an unknown probability distribution. Define to be the expected 0/1 loss. Of course since is unknown in general, one has no access to. However the empirical risk, given by:
can certainly be evaluated. Then one has the following Theorem:

Theorem (VC Inequality)

For binary classification and the 0/1 loss function we have the following generalization bounds:
In words the VC inequality is saying that as the sample increases, provided that has a finite VC dimension, the empirical 0/1 risk becomes a good proxy for the expected 0/1 risk. Note that both RHS of the two inequalities will converge to 0, provided that grows polynomially in.
The connection between this framework and the Empirical Process framework is evident. Here one is dealing with a modified empirical process
but not surprisingly the ideas are the same. The proof of the VC inequality, relies on symmetrization, and then argue conditionally on the data using concentration inequalities. The interested reader can check the book Theorems 12.4 and 12.5.