My takeaways from 11th lecture of stanford machine learning course.

This lecture first describes the bayesian statistics.

Lectures in the beginning of this course talked about parameter($\theta $) fitting by maximizing the likelihood function. We treated $\theta $ as a constant value but unknown and tried to estimate it by statistical procedures such as ML(maximum likelihood). This approach is called frequentist statistics. In Bayesian statistics, we think of $\theta $ as a *random variable* whose value is unknown. In this approach, we would specify a prior distribution (a probability mass/density function) on $\theta $ that expresses our "prior beliefs" about it. Using this we derive formula to compute/predict output variable for some input, given a training set.

Then, briefly, online learning is mentioned where algorithm is supposed to make predictions continuously even while it is learning whereas the algorithms we talked about earlier used to get trained once at the beginning and then used for prediction.

Next, comes the most important/practical part of this lecture(in fact the course). This is the guideline about how to apply machine learning algorithms. I should probably watch this part again and again. Also, (very informative)slides shown are available here[warning: pdf]. In summary it teaches you 3 things.

Diagnostics: How to diagnose when you're not getting expected results. So that you can pinpoint the right problem and fix it.

Error Analysis, Ablative Analysis: Error analysis is the way to learn to know improving what parts of the system will get you maximum accuracy boost. Ablative analysis is the way to learn what features are important and adding most to the accuracy.

How to start with a learning problem: This is like software. One way is to spend time upfront, design things carefully and come up with beautiful algorithms that just work. Next approach is to quickly hack stuff together that works and continuously improve it.

## Wednesday, January 5, 2011

## Monday, January 3, 2011

### stanford machine learning course - lecture#10

My takeaways from 10th lecture of stanford machine learning course.

This lecture continues the discussion of previous lecture for the case when set of all the hypothesis is infinite. It proves, by informal argument, that sample complexity grows linearly with the "number of parameters" of the hypothesis class to hold the bounds on generalized error with certain probability. Then it explains shattering and VC dimension. And then, states the theorem regarding bounds on generalized error. It is also noted that all the theory developed so far (including that in the prev lecture) is based on the learning algorithms that work by ERM(empirical risk minimization).

Then, we go into the very practical side of applying learning algorithms and try to answer the questions regarding model selection and feature selection.

Model Selection --

Model selection means selecting a particular model for a given algorithm. For example selecting the value of bandwidth parameter for locally weighted regression, selecting the value of degree of the polynomial in polynomial regression model. There are 3 main ways to do it.

simple cross validation(or hold out cross validation) : when there is plenty of training set, then we divide it into training set(70%) and cross validation set(30%). For each model, we train it using the training set and find generalized error of the classifier for the cross-validation set. Finally, we select the model with least generalized error on the training set.

k-fold cross validation : when data is scarce, then this technique is used where whole dataset is divided into k sets containing m/k elements each(m is the size of total dataset available). For each small set, we train the classifier on union of remaining k-1 small sets.. find generalized error on one hold out set. We do same for each small set and average the generalized error and call it the generalized error of the particular model. In the end we select the model with least generalized error. Typicall k=10 works.

leave-one-out cross validation: this is special case of k-fold cross validation where k = m. It is used when available dataset is too small.

Feature Selection --

If we have n features, and n is very large. The problem of feature selection means to select appropriate subset of n features such that classifier's generalized error remains roughly the same. The lecture describes 2 ways.

Forward Search : we start with empty set and add one feature in each iteration found most important to minimize training error. we stop when we reach a threshold k, the number of features we want.

Backward Search : we start with set containing all the n features and on each iteration remove one least important feature till we reach our threshold.

both of the above are examples of wrapper model feature selection, since it is a procedure that "wraps" around your learning algorithm and repeatedly makes calls to it to evaluate how well it does using different feature subsets. Computationally this is very expensive.

Filter feature selection methods of feature selection are not as effective as wrapper model but are computationally cheap. They work simply by computing the correlation/mutual-information of pairs of (feature-xi, output-y). we leave out the features i which have smaller scores.

This lecture continues the discussion of previous lecture for the case when set of all the hypothesis is infinite. It proves, by informal argument, that sample complexity grows linearly with the "number of parameters" of the hypothesis class to hold the bounds on generalized error with certain probability. Then it explains shattering and VC dimension. And then, states the theorem regarding bounds on generalized error. It is also noted that all the theory developed so far (including that in the prev lecture) is based on the learning algorithms that work by ERM(empirical risk minimization).

Then, we go into the very practical side of applying learning algorithms and try to answer the questions regarding model selection and feature selection.

Model Selection --

Model selection means selecting a particular model for a given algorithm. For example selecting the value of bandwidth parameter for locally weighted regression, selecting the value of degree of the polynomial in polynomial regression model. There are 3 main ways to do it.

simple cross validation(or hold out cross validation) : when there is plenty of training set, then we divide it into training set(70%) and cross validation set(30%). For each model, we train it using the training set and find generalized error of the classifier for the cross-validation set. Finally, we select the model with least generalized error on the training set.

k-fold cross validation : when data is scarce, then this technique is used where whole dataset is divided into k sets containing m/k elements each(m is the size of total dataset available). For each small set, we train the classifier on union of remaining k-1 small sets.. find generalized error on one hold out set. We do same for each small set and average the generalized error and call it the generalized error of the particular model. In the end we select the model with least generalized error. Typicall k=10 works.

leave-one-out cross validation: this is special case of k-fold cross validation where k = m. It is used when available dataset is too small.

Feature Selection --

If we have n features, and n is very large. The problem of feature selection means to select appropriate subset of n features such that classifier's generalized error remains roughly the same. The lecture describes 2 ways.

Forward Search : we start with empty set and add one feature in each iteration found most important to minimize training error. we stop when we reach a threshold k, the number of features we want.

Backward Search : we start with set containing all the n features and on each iteration remove one least important feature till we reach our threshold.

both of the above are examples of wrapper model feature selection, since it is a procedure that "wraps" around your learning algorithm and repeatedly makes calls to it to evaluate how well it does using different feature subsets. Computationally this is very expensive.

Filter feature selection methods of feature selection are not as effective as wrapper model but are computationally cheap. They work simply by computing the correlation/mutual-information of pairs of (feature-xi, output-y). we leave out the features i which have smaller scores.

## Saturday, January 1, 2011

### reliability attributes

Often times in the discussion of distributed systems and scale out architecture, we demand system being reliable. One of the application, I was part of designing/developing the scale out architecture for it primarily to fulfill the business demand that we should be able to add more hardware to support more load. Now, recently I watched this video from cloudera's free basic hadoop training and it talks about "reliability attributes". I realized that we need same reliability attributes in scale out architecture also and often talk about them( implicitly) in my team. And, this list makes them explicit(coming straight from the linked video)...

Partial Failure: System should be able to support partial failures. That is, if x out of n nodes(where x < n) in the system go down then only system's performance(e.g. throughput) should gracefully go down in the proportion of x/n instead of it being go down completely and not doing any work(or not serving any requests.

Fault Tolerance: This is more related to background jobs, so map-reduce in particular. If one of the nodes go down then its work must be picked up by some other functional unit. This is also sometimes solved by having redundancy in the system. For example, within one cluster, we have multiple app servers serving same set of requests and a load balancer to manage them. In case, one of the server goes down, load balancer detects it and stops sending any requests to it.

Individual Recoverability: Nodes that fail and restart should be able rejoin the group(or cluster) without needing a full restart.

Consistency: Internal failure should not cause externally visible issues.

Scalability: If we add more nodes, we should be able to handle more load in proportion of the new nodes added.

Partial Failure: System should be able to support partial failures. That is, if x out of n nodes(where x < n) in the system go down then only system's performance(e.g. throughput) should gracefully go down in the proportion of x/n instead of it being go down completely and not doing any work(or not serving any requests.

Fault Tolerance: This is more related to background jobs, so map-reduce in particular. If one of the nodes go down then its work must be picked up by some other functional unit. This is also sometimes solved by having redundancy in the system. For example, within one cluster, we have multiple app servers serving same set of requests and a load balancer to manage them. In case, one of the server goes down, load balancer detects it and stops sending any requests to it.

Individual Recoverability: Nodes that fail and restart should be able rejoin the group(or cluster) without needing a full restart.

Consistency: Internal failure should not cause externally visible issues.

Scalability: If we add more nodes, we should be able to handle more load in proportion of the new nodes added.

Subscribe to:
Posts (Atom)