## Monday, January 3, 2011

### stanford machine learning course - lecture#10

My takeaways from 10th lecture of stanford machine learning course.

This lecture continues the discussion of previous lecture for the case when set of all the hypothesis is infinite. It proves, by informal argument, that sample complexity grows linearly with the "number of parameters" of the hypothesis class to hold the bounds on generalized error with certain probability. Then it explains shattering and VC dimension. And then, states the theorem regarding bounds on generalized error. It is also noted that all the theory developed so far (including that in the prev lecture) is based on the learning algorithms that work by ERM(empirical risk minimization).

Then, we go into the very practical side of applying learning algorithms and try to answer the questions regarding model selection and feature selection.

Model Selection --
Model selection means selecting a particular model for a given algorithm. For example selecting the value of bandwidth parameter for locally weighted regression, selecting the value of degree of the polynomial in polynomial regression model. There are 3 main ways to do it.

simple cross validation(or hold out cross validation) : when there is plenty of training set, then we divide it into training set(70%) and cross validation set(30%). For each model, we train it using the training set and find generalized error of the classifier for the cross-validation set. Finally, we select the model with least generalized error on the training set.

k-fold cross validation : when data is scarce, then this technique is used where whole dataset is divided into k sets containing m/k elements each(m is the size of total dataset available). For each small set, we train the classifier on union of remaining k-1 small sets.. find generalized error on one hold out set. We do same for each small set and average the generalized error and call it the generalized error of the particular model. In the end we select the model with least generalized error. Typicall k=10 works.

leave-one-out cross validation: this is special case of k-fold cross validation where k = m. It is used when available dataset is too small.

Feature Selection --
If we have n features, and n is very large. The problem of feature selection means to select appropriate subset of n features such that classifier's generalized error remains roughly the same. The lecture describes 2 ways.

Forward Search : we start with empty set and add one feature in each iteration found most important to minimize training error. we stop when we reach a threshold k, the number of features we want.

Backward Search : we start with set containing all the n features and on each iteration remove one least important feature till we reach our threshold.

both of the above are examples of wrapper model feature selection, since it is a procedure that "wraps" around your learning algorithm and repeatedly makes calls to it to evaluate how well it does using different feature subsets. Computationally this is very expensive.

Filter feature selection methods of feature selection are not as effective as wrapper model but are computationally cheap. They work simply by computing the correlation/mutual-information of pairs of (feature-xi, output-y). we leave out the features i which have smaller scores.