Sunday, December 5, 2010

stanford machine learning course - lecture#3

My takeaways from 3rd lecture of stanford machine learning course.

He starts with explaining probabilistic interpretation with some probability assumptions why least square cost function is a good choice for linear regression problem. Then, at a high level, concept of overfitting and underfitting is described. Then he makes a switch to classification problem and describes why linear regression algorithms discussed in the previous lecture are not good choice for classification problems. After setting up the background, he describes logistic regression algorithm for classification problem. Here, the hypothesis function becomes sigmoid/logistic function instead of linear. And, using the probabilistic interpretation we maximize log likelihood function using gradient ascent(pretty similar to gradient descent) and derive the update rule that looks identical to LMS(least mean square) rule derived in previous lecture. But, its not same because hypothesis is no longer linear but logistic function and take a note that its not a coincidence but there is a deeper reason behind this which will be explained in later lectures when he teaches GLM(Generalized Linear Models).
Then, he takes a digression and explains perceptron learning algorithm(for historical reasons) where hypothesis function is threshold function. This is not considered a good choice because this choice can not be supported with probabilistic interpretations like that in case of previous algorithms.

[This is covered in next lecture actually] Next, Newton's method is explained and then log likelihood is maximized by adapting newton's method by finding parameter $\theta$ where derivative of log likelihood is zero(actually it'll just find the optimal solution as we're not checking the sign of second derivative but in practice it usually maximizes the log likelihood). Usually newton's method converges more quickly than gradient ascent, but involves calculating inverse of hessian matrix of log likelihood function(expensive huh?) and hence good when number of features is small. The generalized version of newton's method in multidimensional setting is called Newton-Raphson method and also Fisher scoring.