My takeaways from 9th lecture of stanford machine learning course.
Up until now, all the lecture focused on supervised learning algorithms and that covers most of them. This lecture talks "learning theory" i.e. how to analyze when to use or not to use a particular algorithm and this is what separates an expert from an average. In particular, this lecture introduces high bias(underfitting) and high variance(overfitting). It talks about empirical error and generalized error. It, for the case of finite sized set of all the hypothesis, proves the uniform convergence result. Based on it, an equation is derived for generalized error of best possible hypothesis(obtained by empirical risk minimization) that shows a bias-variance trade-off in model selection.
Next lecture will cover the case when set of all hypothesis is not finite.
Thursday, December 30, 2010
Monday, December 27, 2010
stanford machine learning course - lecture#8
My takeaways from 8th lecture of stanford machine learning course.
Previous lecture concluded with deriving a dual problem(maximize W($\alpha$) w.r.t $\alpha$ within 2 constraints) for optimal margin classifier and noting that using SVM prediction depends only upon the inner product of input variable with those in the training sets.
This lecture introduces Kernel function, an inner product of feature mappings, i.e.
K(x,z) = $\phi(x)^T \phi(z)$ = $<\phi(x), \phi(z)>
where $\phi$ is called the feature mapping. Interestingly, K(x,z) can be very inexpensive to compute even when $\phi(x)$ is very expensive to compute(lecture presents some examples including Gaussian Kernel). For this reason, kernel trick that is if you have any algorithm that you can write only in terms of inner products between input attribute vectors, then by replacing this with K(x,z) you can "magically" allow your algorithm to work efficiently in high dimensional feature space. And, this is the trick we apply to above optimal margin classifier and it becomes SVM. Also, the lecture proves one part of Mercer's theorem that says K is a valid kernel if and only if corresponding kernel matrix is positive semidefinite. Using this theorem it can be easily determined if a function is valid kernel.
So far we assumed that our training data was linearly separable. To relax this constraint, to make the algorithm work for non-linearly separable datasets as well, we can reformulate(called l1 regularization) the dual problem for optimal margin classifier.
Then, another optimization algorithm(we've studied gradient ascent/descent and newton's method already) called coordinate ascent is introduced that can be used to maximize any function w.r.t to some input parameters without any constraints. Now, we use a slightly modified version of coordinate ascent to solve dual problem derived above. And, this new way is called the SMO(sequential minimal optimization) algorithm
Previous lecture concluded with deriving a dual problem(maximize W($\alpha$) w.r.t $\alpha$ within 2 constraints) for optimal margin classifier and noting that using SVM prediction depends only upon the inner product of input variable with those in the training sets.
This lecture introduces Kernel function, an inner product of feature mappings, i.e.
K(x,z) = $\phi(x)^T \phi(z)$ = $<\phi(x), \phi(z)>
where $\phi$ is called the feature mapping. Interestingly, K(x,z) can be very inexpensive to compute even when $\phi(x)$ is very expensive to compute(lecture presents some examples including Gaussian Kernel). For this reason, kernel trick that is if you have any algorithm that you can write only in terms of inner products
So far we assumed that our training data was linearly separable. To relax this constraint, to make the algorithm work for non-linearly separable datasets as well, we can reformulate(called l1 regularization) the dual problem for optimal margin classifier.
Then, another optimization algorithm(we've studied gradient ascent/descent and newton's method already) called coordinate ascent is introduced that can be used to maximize any function w.r.t to some input parameters without any constraints. Now, we use a slightly modified version of coordinate ascent to solve dual problem derived above. And, this new way is called the SMO(sequential minimal optimization) algorithm
Thursday, December 23, 2010
stanford machine learning course - lecture#7
My takeaways from 7th lecture of stanford machine learning course.
This lecture starts with defining optimal margin classifier that works by maximizing the (geometric)margin. This basically defines an optimization problem. Then we transform it into a convex optimization problem so that it could be solved using an out of the box quadratic problem(a type of convex optimization problem) solver. [It is good to follow 2 handouts, given as course material, on convex optimization if you have never studied it.]
Then, the lecture introduces essential material from convex optimization theory, mainly the lagrangian and primal, dual problems and KKT conditions when solution to primal and dual problems become same. Usually dual problem is easier to solve, So if a problem satisfies KKT conditions then we can just solve dual problem to find solution to primal problem also.
After that we get back to optimal margin classifier optimization problem and convert it into a dual problem. Solving which we see an interesting property that for any x, output variable y can be evaluated by only calculating inner product of x with a handful of input variables from training set. Because of this property, this algorithm scales really well to solve classification problems. And, this property is what makes support vector machine to efficiently learn in very high dimensional spaces.
This lecture starts with defining optimal margin classifier that works by maximizing the (geometric)margin. This basically defines an optimization problem. Then we transform it into a convex optimization problem so that it could be solved using an out of the box quadratic problem(a type of convex optimization problem) solver. [It is good to follow 2 handouts, given as course material, on convex optimization if you have never studied it.]
Then, the lecture introduces essential material from convex optimization theory, mainly the lagrangian and primal, dual problems and KKT conditions when solution to primal and dual problems become same. Usually dual problem is easier to solve, So if a problem satisfies KKT conditions then we can just solve dual problem to find solution to primal problem also.
After that we get back to optimal margin classifier optimization problem and convert it into a dual problem. Solving which we see an interesting property that for any x, output variable y can be evaluated by only calculating inner product of x with a handful of input variables from training set. Because of this property, this algorithm scales really well to solve classification problems. And, this property is what makes support vector machine to efficiently learn in very high dimensional spaces.
Wednesday, December 22, 2010
reasons I will not use flex for
This is not an anti-flex post but the list of issues I *personally* dealt with and would avoid using flex for. Of course there are work arounds for all of these but I would expect them to be available out of the box.
1. Flex does not send "cookies" header while "file upload" using FileReference.upload() in Firefox.
2. If you make a http request from flex app, you only get success/failure and not HTTP status code.
3. Flex application compiles into "swf" files which are rather large for a web application and for the people using your website with lower bandwidths it becomes terribly slow. (it may sometimes need to download the flex framework files also if the person has never visited a site having flex content before)
some more demotivator types..
4. It simply can not be debugged on the fly in the browser. For javascript, now a days you can do all the debugging on the browser itself and it becomes so easy.
5. FlexBuilder commertial license costs money :).
6. Personally, as a developer, I would rather want to learn/master javascript than actionscript.
7. World is pushing HTML[5]/Javascript/Css[3] hard. Javascript browser implementations, UI libraries/widgets are only going to get better and faster.
We chose flex over html/javascript because it seemed that it would be easier/quicker to create cool UI(RIA) with flex than html/javascript. But, now we believe same and more cool UI are possible and easily doable(with so many options, jqueryUI, GWT etc) with html/js also.
1. Flex does not send "cookies" header while "file upload" using FileReference.upload() in Firefox.
2. If you make a http request from flex app, you only get success/failure and not HTTP status code.
3. Flex application compiles into "swf" files which are rather large for a web application and for the people using your website with lower bandwidths it becomes terribly slow. (it may sometimes need to download the flex framework files also if the person has never visited a site having flex content before)
some more demotivator types..
4. It simply can not be debugged on the fly in the browser. For javascript, now a days you can do all the debugging on the browser itself and it becomes so easy.
5. FlexBuilder commertial license costs money :).
6. Personally, as a developer, I would rather want to learn/master javascript than actionscript.
7. World is pushing HTML[5]/Javascript/Css[3] hard. Javascript browser implementations, UI libraries/widgets are only going to get better and faster.
We chose flex over html/javascript because it seemed that it would be easier/quicker to create cool UI(RIA) with flex than html/javascript. But, now we believe same and more cool UI are possible and easily doable(with so many options, jqueryUI, GWT etc) with html/js also.
Saturday, December 18, 2010
stanford machine learning course - lecture#6
My takeaways from 6th lecture of stanford machine learning course.
In this lecture naive bayes classifier algorithm, introduced in previous lecture, is extended to the case where each feature($x_i$) can be any positive integer as opposed to just being 0 or 1. For this reason, in previous case the model is called multi-variate Bernoulli event model and in this case its called multinomial event model. This model is commonly used for text classification problems and hence also known as event model for text classification.
Then, briefly, neural network is mentioned with 2 videos showing early on milestone projects. One was about recognizing a digit from a picture and the other one about text to speech algorithm. Both of them used neural network. It was to show that neural network was considered a very good solution for a wide range of classification problems untill support vector machine, a supervised learning algorithm, was invented. It is considered among the best off the shelf supervised learning algorithm now a days.
Remaining lecture sets the foundations for understanding support vector machine to be explained in later lectures. Now a new notation , concept of functional and geometrical margins are explained. And, maximizing margin classifier is briefly mentioned. It is told that this classifier can perform almost equivalent to logistic regression but more importantly this happens to be the base for support vector machine that can handle features in infinite dimension.
In this lecture naive bayes classifier algorithm, introduced in previous lecture, is extended to the case where each feature($x_i$) can be any positive integer as opposed to just being 0 or 1. For this reason, in previous case the model is called multi-variate Bernoulli event model and in this case its called multinomial event model. This model is commonly used for text classification problems and hence also known as event model for text classification.
Then, briefly, neural network is mentioned with 2 videos showing early on milestone projects. One was about recognizing a digit from a picture and the other one about text to speech algorithm. Both of them used neural network. It was to show that neural network was considered a very good solution for a wide range of classification problems untill support vector machine, a supervised learning algorithm, was invented. It is considered among the best off the shelf supervised learning algorithm now a days.
Remaining lecture sets the foundations for understanding support vector machine to be explained in later lectures. Now a new notation , concept of functional and geometrical margins are explained. And, maximizing margin classifier is briefly mentioned. It is told that this classifier can perform almost equivalent to logistic regression but more importantly this happens to be the base for support vector machine that can handle features in infinite dimension.
Wednesday, December 15, 2010
stanford machine learning course - lecture#5
My takeaways from 5th lecture of stanford machine learning course.
The algorithms, we studied in the lectures so far, are all discriminative learning algorithms. Formally, algorithms that try to learn p(y|x) directly are called discriminative learning algorithms. This lecture is about generative learning algorithms where we try to model p(x|y) , p(y) and using these we obtain p(y|x) using the bayes theorem.
Next, multivariate normal distribution (or multivariate gaussian distribution) is explained and how its shape/size changes when you change covariance matrix.
Then, Gaussian Discriminant Analysis is explained that is used to solve classification problems(where y can have multiple values and not just 2 like that in case of logistic regression) where features x are continuous-valued random variables. It assumes that p(x|y) has multivariate normal distribution. It also explains that if p(x|y) is multivariate normal distribution then p(y|x) turns out to be logistic regression(viceversa is not true) and shows that if p(x|y) is approximately normal then one should be using gaussian discriminant model as it gives better results but if p(x|y) is completely unknown then using logistic regression might give better results as p(y|x) turns out to be logistic in many more cases, for example when p(x|y) is poisson.
Next, Naive Bayes classifier is explained that again is used for classification problems and makes naive bayes assumption for the features, assumes that x is discrete random variable. Then, Laplace smoothing is introduced for the cases that are absent in training data for naive bayes classifier.
The algorithms, we studied in the lectures so far, are all discriminative learning algorithms. Formally, algorithms that try to learn p(y|x) directly are called discriminative learning algorithms. This lecture is about generative learning algorithms where we try to model p(x|y) , p(y) and using these we obtain p(y|x) using the bayes theorem.
Next, multivariate normal distribution (or multivariate gaussian distribution) is explained and how its shape/size changes when you change covariance matrix.
Then, Gaussian Discriminant Analysis is explained that is used to solve classification problems(where y can have multiple values and not just 2 like that in case of logistic regression) where features x are continuous-valued random variables. It assumes that p(x|y) has multivariate normal distribution. It also explains that if p(x|y) is multivariate normal distribution then p(y|x) turns out to be logistic regression(viceversa is not true) and shows that if p(x|y) is approximately normal then one should be using gaussian discriminant model as it gives better results but if p(x|y) is completely unknown then using logistic regression might give better results as p(y|x) turns out to be logistic in many more cases, for example when p(x|y) is poisson.
Next, Naive Bayes classifier is explained that again is used for classification problems and makes naive bayes assumption for the features, assumes that x is discrete random variable. Then, Laplace smoothing is introduced for the cases that are absent in training data for naive bayes classifier.
Wednesday, December 8, 2010
stanford machine learning course - lecture#4
My takeaways from 4th lecture of stanford machine learning course.
This lecture starts with explaining Newton's method as noted in the last post. Then introduces exponential family distributions and shows that bernoulli and gaussian belong to exponential family distributions. In fact many more like poisson, gamma, beta, multinomial distribution etc, all of them belong to exponential family.
Then GLM(Generalized Linear Model) is introduced which can be use to solve any regression or classification problem as long as P(y|x;$\theta$) has a distribution that belongs to exponential family and certain conditions are fulfilled.
Then the lecturer shows that linear and logistic regression are just 2 special cases of GLM. He derives the hypothesis functions for linear and logistic regression using GLM as for linear regression P(y|x;$\theta$) is gaussian and for logistic regression it is bernoulli... both of them are exponential family distributions.
Then Softmax regression is explained and worked out which is yet another special case of GLM where response variable has multinomial distribution. Softmax regression can be used for classification problems where output/response variable can take more than 2 values and logistic regression is not appropriate.
This lecture starts with explaining Newton's method as noted in the last post. Then introduces exponential family distributions and shows that bernoulli and gaussian belong to exponential family distributions. In fact many more like poisson, gamma, beta, multinomial distribution etc, all of them belong to exponential family.
Then GLM(Generalized Linear Model) is introduced which can be use to solve any regression or classification problem as long as P(y|x;$\theta$) has a distribution that belongs to exponential family and certain conditions are fulfilled.
Then the lecturer shows that linear and logistic regression are just 2 special cases of GLM. He derives the hypothesis functions for linear and logistic regression using GLM as for linear regression P(y|x;$\theta$) is gaussian and for logistic regression it is bernoulli... both of them are exponential family distributions.
Then Softmax regression is explained and worked out which is yet another special case of GLM where response variable has multinomial distribution. Softmax regression can be used for classification problems where output/response variable can take more than 2 values and logistic regression is not appropriate.
Sunday, December 5, 2010
stanford machine learning course - lecture#3
My takeaways from 3rd lecture of stanford machine learning course.
He starts with explaining probabilistic interpretation with some probability assumptions why least square cost function is a good choice for linear regression problem. Then, at a high level, concept of overfitting and underfitting is described. Then he makes a switch to classification problem and describes why linear regression algorithms discussed in the previous lecture are not good choice for classification problems. After setting up the background, he describes logistic regression algorithm for classification problem. Here, the hypothesis function becomes sigmoid/logistic function instead of linear. And, using the probabilistic interpretation we maximize log likelihood function using gradient ascent(pretty similar to gradient descent) and derive the update rule that looks identical to LMS(least mean square) rule derived in previous lecture. But, its not same because hypothesis is no longer linear but logistic function and take a note that its not a coincidence but there is a deeper reason behind this which will be explained in later lectures when he teaches GLM(Generalized Linear Models).
Then, he takes a digression and explains perceptron learning algorithm(for historical reasons) where hypothesis function is threshold function. This is not considered a good choice because this choice can not be supported with probabilistic interpretations like that in case of previous algorithms.
[This is covered in next lecture actually] Next, Newton's method is explained and then log likelihood is maximized by adapting newton's method by finding parameter $\theta$ where derivative of log likelihood is zero(actually it'll just find the optimal solution as we're not checking the sign of second derivative but in practice it usually maximizes the log likelihood). Usually newton's method converges more quickly than gradient ascent, but involves calculating inverse of hessian matrix of log likelihood function(expensive huh?) and hence good when number of features is small. The generalized version of newton's method in multidimensional setting is called Newton-Raphson method and also Fisher scoring.
He starts with explaining probabilistic interpretation with some probability assumptions why least square cost function is a good choice for linear regression problem. Then, at a high level, concept of overfitting and underfitting is described. Then he makes a switch to classification problem and describes why linear regression algorithms discussed in the previous lecture are not good choice for classification problems. After setting up the background, he describes logistic regression algorithm for classification problem. Here, the hypothesis function becomes sigmoid/logistic function instead of linear. And, using the probabilistic interpretation we maximize log likelihood function using gradient ascent(pretty similar to gradient descent) and derive the update rule that looks identical to LMS(least mean square) rule derived in previous lecture. But, its not same because hypothesis is no longer linear but logistic function and take a note that its not a coincidence but there is a deeper reason behind this which will be explained in later lectures when he teaches GLM(Generalized Linear Models).
Then, he takes a digression and explains perceptron learning algorithm(for historical reasons) where hypothesis function is threshold function. This is not considered a good choice because this choice can not be supported with probabilistic interpretations like that in case of previous algorithms.
[This is covered in next lecture actually] Next, Newton's method is explained and then log likelihood is maximized by adapting newton's method by finding parameter $\theta$ where derivative of log likelihood is zero(actually it'll just find the optimal solution as we're not checking the sign of second derivative but in practice it usually maximizes the log likelihood). Usually newton's method converges more quickly than gradient ascent, but involves calculating inverse of hessian matrix of log likelihood function(expensive huh?) and hence good when number of features is small. The generalized version of newton's method in multidimensional setting is called Newton-Raphson method and also Fisher scoring.
Friday, December 3, 2010
stanford machine learning course - lecture#2
I watched the 2nd lecture video and followed lecture notes from stanford machine learning course. At a very high level, Following is what I took away from it.
It formally explains what Supervised Learning is and explains what is the difference between a Regression problem and Classification problem. Then it defines the (least squares)cost function, called J($\theta$) and talks about algorithms that find $\theta$ that minimize the cost function J($\theta$).
First It describes Gradient Descent(Batch/Stochastic) algorithm that leads to Least mean square or Widrow-Hoff learning rule.
Along with some notation for derivatives of Matrix, It derives Normal equation, that can directly determine $\theta$ to minimize cost function, by solving the equation derivative-of(J($\theta$)) = 0
It formally explains what Supervised Learning is and explains what is the difference between a Regression problem and Classification problem. Then it defines the (least squares)cost function, called J($\theta$) and talks about algorithms that find $\theta$ that minimize the cost function J($\theta$).
First It describes Gradient Descent(Batch/Stochastic) algorithm that leads to Least mean square or Widrow-Hoff learning rule.
Along with some notation for derivatives of Matrix, It derives Normal equation, that can directly determine $\theta$ to minimize cost function, by solving the equation derivative-of(J($\theta$)) = 0
Subscribe to:
Posts (Atom)