## Monday, December 27, 2010

### stanford machine learning course - lecture#8

My takeaways from 8th lecture of stanford machine learning course.

Previous lecture concluded with deriving a dual problem(maximize W($\alpha$) w.r.t $\alpha$ within 2 constraints) for optimal margin classifier and noting that using SVM prediction depends only upon the inner product of input variable with those in the training sets.

This lecture introduces Kernel function, an inner product of feature mappings, i.e.
K(x,z) = $\phi(x)^T \phi(z)$ = $<\phi(x), \phi(z)> where$\phi$is called the feature mapping. Interestingly, K(x,z) can be very inexpensive to compute even when$\phi(x)\$ is very expensive to compute(lecture presents some examples including Gaussian Kernel). For this reason, kernel trick that is if you have any algorithm that you can write only in terms of inner products between input attribute vectors, then by replacing this with K(x,z) you can "magically" allow your algorithm to work efficiently in high dimensional feature space. And, this is the trick we apply to above optimal margin classifier and it becomes SVM. Also, the lecture proves one part of Mercer's theorem that says K is a valid kernel if and only if corresponding kernel matrix is positive semidefinite. Using this theorem it can be easily determined if a function is valid kernel.

So far we assumed that our training data was linearly separable. To relax this constraint, to make the algorithm work for non-linearly separable datasets as well, we can reformulate(called l1 regularization) the dual problem for optimal margin classifier.

Then, another optimization algorithm(we've studied gradient ascent/descent and newton's method already) called coordinate ascent is introduced that can be used to maximize any function w.r.t to some input parameters without any constraints. Now, we use a slightly modified version of coordinate ascent to solve dual problem derived above. And, this new way is called the SMO(sequential minimal optimization) algorithm