Thursday, December 30, 2010

stanford machine learning course - lecture#9

My takeaways from 9th lecture of stanford machine learning course.

Up until now, all the lecture focused on supervised learning algorithms and that covers most of them. This lecture talks "learning theory" i.e. how to analyze when to use or not to use a particular algorithm and this is what separates an expert from an average. In particular, this lecture introduces high bias(underfitting) and high variance(overfitting). It talks about empirical error and generalized error. It, for the case of finite sized set of all the hypothesis, proves the uniform convergence result. Based on it, an equation is derived for generalized error of best possible hypothesis(obtained by empirical risk minimization) that shows a bias-variance trade-off in model selection.
Next lecture will cover the case when set of all hypothesis is not finite.

Monday, December 27, 2010

stanford machine learning course - lecture#8

My takeaways from 8th lecture of stanford machine learning course.

Previous lecture concluded with deriving a dual problem(maximize W($\alpha$) w.r.t $\alpha$ within 2 constraints) for optimal margin classifier and noting that using SVM prediction depends only upon the inner product of input variable with those in the training sets.

This lecture introduces Kernel function, an inner product of feature mappings, i.e.
K(x,z) = $\phi(x)^T \phi(z)$ = $<\phi(x), \phi(z)>
where $\phi$ is called the feature mapping. Interestingly, K(x,z) can be very inexpensive to compute even when $\phi(x)$ is very expensive to compute(lecture presents some examples including Gaussian Kernel). For this reason, kernel trick that is if you have any algorithm that you can write only in terms of inner products between input attribute vectors, then by replacing this with K(x,z) you can "magically" allow your algorithm to work efficiently in high dimensional feature space. And, this is the trick we apply to above optimal margin classifier and it becomes SVM. Also, the lecture proves one part of Mercer's theorem that says K is a valid kernel if and only if corresponding kernel matrix is positive semidefinite. Using this theorem it can be easily determined if a function is valid kernel.

So far we assumed that our training data was linearly separable. To relax this constraint, to make the algorithm work for non-linearly separable datasets as well, we can reformulate(called l1 regularization) the dual problem for optimal margin classifier.

Then, another optimization algorithm(we've studied gradient ascent/descent and newton's method already) called coordinate ascent is introduced that can be used to maximize any function w.r.t to some input parameters without any constraints. Now, we use a slightly modified version of coordinate ascent to solve dual problem derived above. And, this new way is called the SMO(sequential minimal optimization) algorithm

Thursday, December 23, 2010

stanford machine learning course - lecture#7

My takeaways from 7th lecture of stanford machine learning course.

This lecture starts with defining optimal margin classifier that works by maximizing the (geometric)margin. This basically defines an optimization problem. Then we transform it into a convex optimization problem so that it could be solved using an out of the box quadratic problem(a type of convex optimization problem) solver. [It is good to follow 2 handouts, given as course material, on convex optimization if you have never studied it.]
Then, the lecture introduces essential material from convex optimization theory, mainly the lagrangian and primal, dual problems and KKT conditions when solution to primal and dual problems become same. Usually dual problem is easier to solve, So if a problem satisfies KKT conditions then we can just solve dual problem to find solution to primal problem also.
After that we get back to optimal margin classifier optimization problem and convert it into a dual problem. Solving which we see an interesting property that for any x, output variable y can be evaluated by only calculating inner product of x with a handful of input variables from training set. Because of this property, this algorithm scales really well to solve classification problems. And, this property is what makes support vector machine to efficiently learn in very high dimensional spaces.

Wednesday, December 22, 2010

reasons I will not use flex for

This is not an anti-flex post but the list of issues I *personally* dealt with and would avoid using flex for. Of course there are work arounds for all of these but I would expect them to be available out of the box.

1. Flex does not send "cookies" header while "file upload" using FileReference.upload() in Firefox.

2. If you make a http request from flex app, you only get success/failure and not HTTP status code.

3. Flex application compiles into "swf" files which are rather large for a web application and for the people using your website with lower bandwidths it becomes terribly slow. (it may sometimes need to download the flex framework files also if the person has never visited a site having flex content before)

some more demotivator types..

4. It simply can not be debugged on the fly in the browser. For javascript, now a days you can do all the debugging on the browser itself and it becomes so easy.

5. FlexBuilder commertial license costs money :).

6. Personally, as a developer, I would rather want to learn/master javascript than actionscript.

7. World is pushing HTML[5]/Javascript/Css[3] hard. Javascript browser implementations, UI libraries/widgets are only going to get better and faster.

We chose flex over html/javascript because it seemed that it would be easier/quicker to create cool UI(RIA) with flex than html/javascript. But, now we believe same and more cool UI are possible and easily doable(with so many options, jqueryUI, GWT etc) with html/js also.

Saturday, December 18, 2010

stanford machine learning course - lecture#6

My takeaways from 6th lecture of stanford machine learning course.

In this lecture naive bayes classifier algorithm, introduced in previous lecture, is extended to the case where each feature($x_i$) can be any positive integer as opposed to just being 0 or 1. For this reason, in previous case the model is called multi-variate Bernoulli event model and in this case its called multinomial event model. This model is commonly used for text classification problems and hence also known as event model for text classification.
Then, briefly, neural network is mentioned with 2 videos showing early on milestone projects. One was about recognizing a digit from a picture and the other one about text to speech algorithm. Both of them used neural network. It was to show that neural network was considered a very good solution for a wide range of classification problems untill support vector machine, a supervised learning algorithm, was invented. It is considered among the best off the shelf supervised learning algorithm now a days.
Remaining lecture sets the foundations for understanding support vector machine to be explained in later lectures. Now a new notation , concept of functional and geometrical margins are explained. And, maximizing margin classifier is briefly mentioned. It is told that this classifier can perform almost equivalent to logistic regression but more importantly this happens to be the base for support vector machine that can handle features in infinite dimension.

Wednesday, December 15, 2010

stanford machine learning course - lecture#5

My takeaways from 5th lecture of stanford machine learning course.

The algorithms, we studied in the lectures so far, are all discriminative learning algorithms. Formally, algorithms that try to learn p(y|x) directly are called discriminative learning algorithms. This lecture is about generative learning algorithms where we try to model p(x|y) , p(y) and using these we obtain p(y|x) using the bayes theorem.
Next, multivariate normal distribution (or multivariate gaussian distribution) is explained and how its shape/size changes when you change covariance matrix.
Then, Gaussian Discriminant Analysis is explained that is used to solve classification problems(where y can have multiple values and not just 2 like that in case of logistic regression) where features x are continuous-valued random variables. It assumes that p(x|y) has multivariate normal distribution. It also explains that if p(x|y) is multivariate normal distribution then p(y|x) turns out to be logistic regression(viceversa is not true) and shows that if p(x|y) is approximately normal then one should be using gaussian discriminant model as it gives better results but if p(x|y) is completely unknown then using logistic regression might give better results as p(y|x) turns out to be logistic in many more cases, for example when p(x|y) is poisson.
Next, Naive Bayes classifier is explained that again is used for classification problems and makes naive bayes assumption for the features, assumes that x is discrete random variable. Then, Laplace smoothing is introduced for the cases that are absent in training data for naive bayes classifier.

Wednesday, December 8, 2010

stanford machine learning course - lecture#4

My takeaways from 4th lecture of stanford machine learning course.

This lecture starts with explaining Newton's method as noted in the last post. Then introduces exponential family distributions and shows that bernoulli and gaussian belong to exponential family distributions. In fact many more like poisson, gamma, beta, multinomial distribution etc, all of them belong to exponential family.
Then GLM(Generalized Linear Model) is introduced which can be use to solve any regression or classification problem as long as P(y|x;$\theta$) has a distribution that belongs to exponential family and certain conditions are fulfilled.

Then the lecturer shows that linear and logistic regression are just 2 special cases of GLM. He derives the hypothesis functions for linear and logistic regression using GLM as for linear regression P(y|x;$\theta$) is gaussian and for logistic regression it is bernoulli... both of them are exponential family distributions.

Then Softmax regression is explained and worked out which is yet another special case of GLM where response variable has multinomial distribution. Softmax regression can be used for classification problems where output/response variable can take more than 2 values and logistic regression is not appropriate.

Sunday, December 5, 2010

stanford machine learning course - lecture#3

My takeaways from 3rd lecture of stanford machine learning course.

He starts with explaining probabilistic interpretation with some probability assumptions why least square cost function is a good choice for linear regression problem. Then, at a high level, concept of overfitting and underfitting is described. Then he makes a switch to classification problem and describes why linear regression algorithms discussed in the previous lecture are not good choice for classification problems. After setting up the background, he describes logistic regression algorithm for classification problem. Here, the hypothesis function becomes sigmoid/logistic function instead of linear. And, using the probabilistic interpretation we maximize log likelihood function using gradient ascent(pretty similar to gradient descent) and derive the update rule that looks identical to LMS(least mean square) rule derived in previous lecture. But, its not same because hypothesis is no longer linear but logistic function and take a note that its not a coincidence but there is a deeper reason behind this which will be explained in later lectures when he teaches GLM(Generalized Linear Models).
Then, he takes a digression and explains perceptron learning algorithm(for historical reasons) where hypothesis function is threshold function. This is not considered a good choice because this choice can not be supported with probabilistic interpretations like that in case of previous algorithms.

[This is covered in next lecture actually] Next, Newton's method is explained and then log likelihood is maximized by adapting newton's method by finding parameter $\theta$ where derivative of log likelihood is zero(actually it'll just find the optimal solution as we're not checking the sign of second derivative but in practice it usually maximizes the log likelihood). Usually newton's method converges more quickly than gradient ascent, but involves calculating inverse of hessian matrix of log likelihood function(expensive huh?) and hence good when number of features is small. The generalized version of newton's method in multidimensional setting is called Newton-Raphson method and also Fisher scoring.

Friday, December 3, 2010

stanford machine learning course - lecture#2

I watched the 2nd lecture video and followed lecture notes from stanford machine learning course. At a very high level, Following is what I took away from it.

It formally explains what Supervised Learning is and explains what is the difference between a Regression problem and Classification problem. Then it defines the (least squares)cost function, called J($\theta$) and talks about algorithms that find $\theta$ that minimize the cost function J($\theta$).
First It describes Gradient Descent(Batch/Stochastic) algorithm that leads to Least mean square or Widrow-Hoff learning rule.
Along with some notation for derivatives of Matrix, It derives Normal equation, that can directly determine $\theta$ to minimize cost function, by solving the equation derivative-of(J($\theta$)) = 0

Wednesday, November 24, 2010

sql "IN" clause gotcha

Though standard sql puts no limit on the number of entries you can put inside an IN clause, but various vendors typically put a limit. Oracle puts a limit of 1000 to it. This caused a bug in our app as we dynamically generate a sql query where number of entries went beyond 1000 today.

You can put any one of the resolution stated below(in decreasing order of preference)

1. Change your logic so that number of entries never go beyond 1000(it may be a good practice in general to follow).
2. See if you can use "BETWEEN" clause for your particular case.

Instead of using "Select blah from tablexyz WHERE col IN (c1,c2,....c50000)"
3. Use "Select blah from tablexyz WHERE col IN (select id from ctable)" and the subquery "select id from ctable" returns resultset containing c1,c2,...c50000.
4. Use "Select blah from tablexyz WHERE col IN (c1,c2,...c1000) OR col IN (c1001,c1002,...,c2000) OR ... OR col IN (c49001,c49002,...,c50000)".
5. Use multiple sql queries each with one IN clause containing 1000 entries.

Much of above came from this thread.

Tuesday, November 23, 2010

Wrapup - Introduction to probability

On Sep 05' 10 I wrote a plan to work through Introduction to Probability and to finish all the assignments given in probability course in MIT.

Today I'm glad to announce that it is finally finished and all the notes and solutions can be found in these posts.

The reason I studied it is that probability comes up as a basic requirement when you follow any math oriented text on things like Randomized Algorithms, AI, NLP, Machine Learning etc. So, I wanted to get enough grounding in probability basics and this book, as the title suggests, does exactly that. Another plus point is that, its used in the above mentioned MIT course so you have well defined assignments for practice and book also has a lot of exercises with solutions.

And, Thanks(again!) to Ravi for lending the book.

Note: If you need solutions to assignments, they are also available on MIT site. I did them mostly to force myself into doing those problems to make sure I'm not doing the passive reading of the book and it helps you think what you read and in really "getting it". Many times, I could not come up with the solution and I did peek in official solutions :).

probability: problem-set#12

My solutions for problem-set#12 from the MIT introductory probability course.



probability: problem-set#11

My solutions for problem-set#11 from the MIT introductory probability course

probability: problem-set#10

My solutions for problem-set#10 from the MIT introductory probability course.



Tuesday, November 2, 2010

bypassing group policy on windows

Group Policy is a feature of windows operating systems to give centralized control of all the computers on a domain to the system administrators. They can do various things like disabling various actions(such as blocking access to Task Manager, disabling access to IE advanced settings etc).

This post describes, how to bypass the group policy.

This is a two step process.

1. All the settings are stored in the registry, so first step is to find the location of the registry that controls the action you're looking for. A little googling can help you find that or use http://gps.cloudapp.net/

2. Start -> Run -> regedit, and set the registry value to whatever you want.

And you are done. But, remember it may go away after some time or when you restart the computer as group policy will update/reset itself periodically and on restarts. Until that time you are all set.

For example, let say access to "Empty temporary internet files folder when browser is closed" option in advanced settings of internet explorer options is disabled and you want to change its value.

1. Find the location of registry. Searching for "empty temporary internet files" in http://gps.cloudapp.net/ reveals following registry values associated with it.
HKCU\Software\Policies\Microsoft\Windows\CurrentVersion\internet Settings\Cache
HKLM\Software\Policies\Microsoft\Windows\CurrentVersion\internet Settings\Cache

2. Open regedit and set the value to above registries to 1(if you wanted to uncheck the option) or whatever you need.

Restart IE and you'll see the change.

Caution: Doing this might be a cause of concern to your sys admins or against your corporate policies, so do this only if you absolutely have to. And, corruption in windows registry might make your system unusable.

Disclaimer: I'm no windows administration expert, so some of above information may be partially incorrect or there might be better ways to handle it.

probability: problem-set#9

My solutions for problem-set#9 from the MIT introductory probability course.




probability: problem-set#8

My solutions for problem-set#8 from the MIT introductory probability course.


probability: problem-set#7

My solutions for problem-set#7 from the MIT introductory probability course.








Saturday, October 30, 2010

web application logging dilemma

By exploring languages like lisp, scheme, scala etc I've come to disliking "verbose code" and I try to write less code as much possible(I still follow the KISS principle though, so no clever tricks unless really necessary). But industry(and day to day) reality is Java, and "verbosity" manifests itself in java no matter what you do.
Typically, Writing a debug log in a java application takes *at least* 2 lines..
if(log.isDebugEnabled())
log.debug("your error message" + usually_a_concatenation)

As a result, My less verbose code instinct makes me write fewer and fewer debug logs and lets me write only the necessary error logs because number of lines just explode when you write too many logs and doesn't look good. And also, more logs hamper performance.

I am writing this post because last night I spent 2 hours figuring out an issue in staging servers. It took so much time because there were not enough log statements(so I had to do a lot of looking into the code, guess work and trials to figure out where the problem might be) and in an enterprise setup it becomes a very slow process to try stuff at stage/production servers.

The lesson learned is that, there should be enough(infact alot of) logs for the whole initialization process. So that you don't have to waste time guessing. Since they will get printed only on server startup so they wouldn't hamper the performance of the over all system. And enough debug logs per request, so that issues can be tracked down easily by printing debug logs. Though it will increase the number of lines, will not look good but it is probably worth it :(.

Saturday, October 16, 2010

8-puzzle in javascript

If you're just interested in the game, this is it.

Here is the longer story...

I often heard that javascript is not a toy language anymore, its object oriented in a better way than java, its dynamically typed, supports closures etc and google does wonders with just html and javascript(e.g. the google pacman game).That feature list, and that its one of the mainstream languages in the software industry with those features, was enough for me to get excited about it. So, for a long time, I have wanted to try it out.

Last weekend, I set out to build the 8-puzzle game with html and javascript with a A* search based solver. I started by following the 4 lecture series by douglas crockford(the javascript evangelist in Yahoo). Having had some experience with scheme, it helped me to understand the philosophy behind the way things are done in javascript. Well, those 4 lectures were enough for me to get started and here I present the 8-puzzle game.

Implementation of the A* based solver is fairly well understood thing already. I used Manhattan disance heuristic. After writing the solver I realized that for some randomly generated configurations of the 8-puzzle, it was running endlessly. Thankfully, today there are debuggers, profilers and REPLs available for javascript(Firebug alone gives you all that and much more) so I could analyze my code pretty easily.
And, in the end realized that not all the randomly generated instances of 8-puzzle are solvable(for those configurations my solver was just busy searching the whole state space reachable and hence was so busy). This page describes it well and gives you enough hint on how to write a function to quickly check if an instance of 8-puzzle is solvable or not(in the given link search for "if you take a state, physically lift two consecutive tiles in that state and swap their positions, then the new state will be disconnected from old state.")

Saturday, October 9, 2010

MobiMeter - measure network latency from mobile phone

Last week I was involved in doing performance tests on a mobile application that talks to some servers to pull/push data. I wanted to determine the network latency of those servers from mobile. So, I wrote a small(just 2 screens) j2me application to measure network latency of any public GET URL from your mobile phone.

Code is available on google-code.

setting up freeNX server on Fedora-13

I have two machines, 1 running fedora-13 and other one running windows XP. I was exploring the options so that I can connect to fedora from windows and get a gnome desktop session. First option that came to my mind is setting up a vnc server on fedora and connecting to it from a vnc client running on windows.
While I was googling, I came through another option - NoMachine NX. NoMachine has a proprietary NX protocol that is supposed to be very optimized and (from the reviews I read) provides much better experience than vnc. And, I thought I will try this one.

NoMachine itself offers free nx client for windows(and for other platforms as well) and there is a free nx server implementation available. So, here is how you install both.

  • on windows machine download nx client for windows and install it, its a fairly trivial process. At the time, I got version 3.4.0-7.
  • on fedora-13 machine, do "yum install freenx-server". At the time, I got freenx-server-0.7.3-18.fc13.i686

Its easy to setup the client with information of the machine you want to connect to. BTW as nx uses ssh, make sure ssh is running on the fedora machine and that you can connect to it using a ssh client on windows machine.

Well, then I tried to connect... provided the username/password for an account that is already setup on fedora. But, it did not work. And, here is the troubleshooting process.

on windows, I used putty to connect to the fedora machine just to make sure that pure ssh connection to fedora was working fine. it was fine. Then, I looked up the "details" provided by nx client, and here is what I saw..
NX> 203 NXSSH running with pid: 4864
NX> 285 Enabling check on switch command
NX> 285 Enabling skip of SSH config files
NX> 285 Setting the preferred NX options
NX> 200 Connected to address: 192.168.1.3 on port: 22
NX> 202 Authenticating user: nx
NX> 208 Using auth method: publickey
NX> 204 Authentication failed.
It looked like that my regular user was connecting fine but nx client is not being able to connect user "nx" using public key authentication. I checked on fedora machine and "nx" user was setup properly(used, "grep -i nx /etc/passwd"). Now it was a trivial affair of getting public key authentication working for user "nx". So, here is how you do it.

Create public-private dsa keys(note that current nx client can not support rsa keys), using "ssh-keygen -t dsa" and remember *not to* give a pass-phrase. On fedora machine, copy the contents of public key to ~nx/.ssh/authorized_keys2 and on windows machine, start nx client, click configure, then tab general, then button Key and copy the contents of private key to the opened window. click the save and ok buttons.(BTW, this is a very short version of setting up ssh public key authentication for a user. If you're not familiar with it already then just look for "ssh public key authentication" on google, this is not a nx client specific but regular ssh stuff)

Now, connect and voila :)

[UPDATE: Nov 05' 11] NoMachine has started giving a free edition of its server for linux and I have started using it instead of freenx-server. It can be downloaded here. All the necessary installation details are there on the mentioned page. Same are copy/pasted here.

RPM version

  • Download the RPMs
  • Change your working directory to the location where you saved the package and install it by running from a console:

    # sudo rpm -i nxclient-3.5.0-7.i386.rpm
    # sudo rpm -i nxnode-3.5.0-7.i386.rpm
    # sudo rpm -i nxserver-3.5.0-9.i386.rpm
If you don't have the sudo utility installed, log on as superuser ("root") and run the commands without sudo.
Note: The NX service can be controlled by the command /usr/NX/bin/nxserver --status|--start|--stop|--restart. Additional commands are available to configure the server. Try /usr/NX/bin/nxserver --help for more information.


Once you do the install you will still need to add your dsa public key to the NX_USER_HOME/.ssh/authorized_key2 .
On Fedora, NX_USER_HOME = /usr/NX/home/nx , Btw it can be easily found by running following command
> grep -i nx /etc/passwd
nx:x:488:474::/usr/NX/home/nx:/usr/NX/bin/nxserver

Important Note: If you have freenx-server setup already then you should remove that before installing nx{client,node,server} by running following commands.
>/sbin/service freenx-server stop
>yum remove freenx-server
>userdel -fr nx #this user was created while freenx-server installation, removing this is *critical*

Friday, October 8, 2010

troubleshooting wireless card with Fedora-13

Last week, I bought a new shiny lenovo Z560-59045422 laptop. It has windows-7 home basic pre-installed with 4 partitions. As usual, I set on to installing fedora(latest one available right now is Fedora-13). I shrinked the C: drive and deleted D: drive so as to have most of space available to Fedora. Then I installed Fedora-13 on it using the dvd ISO image.

I rebooted the machine, and everything worked fine except that Network Manager did not show anything like "Wireless Networks". That is, I could not get connected to my wifi network, however wired ethernet was working fine. It took very small time, with some googling, to figure out that fedora iso probably did not have drivers for the wireless card.

With some more googling, I landed on Getting Started with Wireless - Fedora Unity Project and determined, using /sbin/lspci, that my laptop has "Broadcom Corporation BCM4313 802.11b/g LP-PHY (rev 01)" wireless card. And then I landed on Broadcom Linux STA Driver - Fedora Unity Project. As mentioned on the previous link, I did following things...
  • Added RPMFusion repository (both the free and non-free, it is essential to have non-free because only it has what we need, the broadcom-wl)
  • Updated the system, "yum update"
  • Installed broadcom-wl package, "yum install broadcom-wl"
  • rebooted the system
Its working like a charm now, I'm so happy :)

Monday, October 4, 2010

Wednesday, September 29, 2010

Tuesday, September 28, 2010

probability: problem-set#3

My solutions for problem-set#3 from the MIT introductory probability course.





Monday, September 27, 2010

probability: problem-set#2

My solutions for problem-set#2 from the MIT introductory probability course.





Sunday, September 26, 2010

Wednesday, September 22, 2010

mistakes made in DDL script and lessons learned

Last year, I got the opportunity to create a java web application from scratch. Here, I want to talk about what I should have done differently when first DDL script for the app was created.

I'm correcting all these things now, but this is the technical debt that could have been avoided.

note: some of the jargons might be oracle specific

1. Create separate tablespaces for table, index and lobs.
This is a standard practice and helps DBAs in a lot of ways with management of db. The one I created had USERS as its default tablespace and none of CREATE TABLE, INDEX etc statements said anything about tablespace and everything resulted in USERS tablespace.

2. Give upfront thought to version upgradation from db perspective.
Once your application goes to production, and db gets loaded with a lot of data. It becomes rather difficult to upgrade schema from previous version of application to newer one. Its very important to have your migration strategy thought through and in place from the beginning itself.

3. Do not ignore "ON DELETE" clause when creating foreign key constraint.
In Oracle, you got two options here - ON DELETE SET NULL, ON DELETE CASCADE
you should be choosing one over the other based on how you want your data to evolve for the table.

4. Keep two users to manage your db.
One is the schema owner who runs all the DDLs and another one the schema user who has just enough privileges to run the application. Application should use the "schema user" user only to save application cdoe from getting any unwanted access to the db. Typically, schema user will need following privileges.

- CREATE SESSION
- CREATE SYNONYM
- SELECT/UPDATE/INSERT/DELETE privilege on all tables
- SELECT privilege on all SEQUENCEs

Tuesday, September 21, 2010

Similarities Discrete and Continuous Random Variables

From 1st three chapters of Introduction to Probability, it is pretty clear that many properties for both kind of random variables turn out to be strikingly similar. This post is an attempt to consolidate them for easy reference.

Note: In following table, $A_i$s are positive probability events that form a partition of sample space unless otherwise stated.













































DescriptionDiscreteContinuous
Basic Expectation Related
E[X]$\sum_x xp_X(x)$$\int xf_X(x)dx$
E[g(X)]$\sum_x g(x)p_X(x)$$\int g(x)f_X(x)dx$
if Y = aX + bE[Y] = aE[X] + bE[Y] = aE[X] + b
if Y = aX + bvar(Y) = $a^2$var(X)var(Y) = $a^2$var(X)
Joint PMF\PDFs
Marginal PMF\PDFs$p_X(x)$ = $\sum_y p_{X,Y}(x,y)$$f_X(x)$ = $\int f_{X,Y}(x,y)dy$
E[g(X,Y)]$\sum_x \sum_y g(x,y)p_{X,Y}(x,y)$$\int \int g(x,y)f_{X,Y}(x,y)dxdy$
if g(X,Y) = aX + bY + cE[g(X,Y)] = aE[X] + bE[Y] + cE[g(X,Y)] = aE[X] + bE[Y] + c
Conditional PMF\PDF$p_{X,Y}(x,y)$ = $p_Y(y)p_{X|Y}(x|y)$$f_{X,Y}(x,y)$ = $f_Y(y)f_{X|Y}(x|y)$
Conditional Expectation
E[X|A]$\sum_x xp_{X|A}(x)$$\int xf_{X|A}(x)dx$
E[g(X)|A]$\sum_x g(x)p_{X|A}(x)$$\int g(x)f_{X|A}(x)dx$
E[X|Y=y]$\sum_x xp_{X|Y}(x|y)$$\int xf_{X|Y}(x|y)dx$
E[g(X)|Y=y]$\sum_x g(x)p_{X|Y}(x|y)$$\int g(x)f_{X|Y}(x|y)dx$
E[g(X,Y)|Y=y]$\sum_x g(x,y)p_{X|Y}(x|y)$$\int g(x,y)f_{X|Y}(x|y)dx$
Total Probability Related
$p_X(x)$ = $\sum_{i = 1}^n P(A_i)p_{X|A_i}(x)$$f_X(x)$ = $\sum_{i = 1}^n P(A_i)f_{X|A_i}(x)$
P(A)$\sum_x P(A|X=x)p_X(x)$$\int P(A|X=x)f_X(x)dx$
Total Expectation Related
E[X]$\sum_y p_Y(y)E[X|Y=y]$$\int f_Y(y)E[X|Y=y]$
E[g(X)]$\sum_y p_Y(y)E[g(X)|Y=y]$$\int f_Y(y)E[g(X)|Y=y]$
E[g(X,Y)]$\sum_y p_Y(y)E[g(X,Y)|Y=y]$$\int f_Y(y)E[g(X,Y)|Y=y]$
E[X]$\sum_{i=1}^n P(A_i)E[X|A_i]$$\sum_{i=1}^n P(A_i)E[X|A_i]$
E[g(X)]$\sum_{i=1}^n P(A_i)E[g(X)|A_i]$$\sum_{i=1}^n P(A_i)E[g(X)|A_i]$
$A_i$s, partition on BE[X|B] = $\sum_{i=1}^n P(A_i|B)E[X|A_i\cap B]$
X and Y are independent
$p_{X,Y}$ = $p_X(x)p_Y(y)$$f_{X,Y}$ = $f_X(x)f_Y(y)$
E[XY]E[X]E[Y]E[X]E[Y]
E[g(X)h(Y)]E[g(X)]E[h(Y)]E[g(X)]E[h(Y)]
var(X + Y)var(X) + var(Y)var(X) + var(Y)



Some Continuous Random Variables

From Introduction to Probability.

Continuous Uniform Over [a,b]:

$f_X(x)$ = $\frac{1}{b-a}$ , if $a \leq x \leq b$ or 0 otherwise

E[X] = $\frac{a+b}{2}$
var(X) = $\frac{(b-a)^2}{12}$

Exponential with Parameter $\lambda$:

$f_X(x)$ = $\lambda e^{-\lambda x}$, if $x \geq 0$ or 0 otherwise

$F_X(x)$ = $1 - e^{-\lambda x}$, if $x \geq 0$ or 0 otherwise

E[X] = $\frac{1}{\lambda}$
var(x) = $\frac{1}{{\lambda}^2}$

Normal with Parameters $\mu$ and ${\sigma}^2 > 0$:

$f_X(x)$ = $\frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{{(x-\mu)}^2}{2{\sigma}^2}}$

E[X] = $\mu$
var(X) = ${\sigma}^2$

Introduction to probability, ch#3(General Random Variables) notes - Part-2

Introduction to probability, ch#3(General Random Variables) notes - Part-2

-- Multiple Continuous Random Variables: --
Let X and Y be jointly continuous random variables with joint PDF $f_{X,Y}$

The joint, marginal and conditional PDFs are related as follow
$f_{X,Y}(x,y)$ = $f_Y(y)f_{X|Y}(x|y)$
$f_X(x)$ = $\int_{-\infty}^{\infty}f_{X,Y}(x,y)dy$ = $\int_{-\infty}^{\infty}f_Y(y)f_{X|Y}(x|y)dy$
conditional PDF $f_{X|Y}(x|y)$ is defined only for those y for which $f_Y(y)$ > 0

These PDFs can be used to calculate probabilities as follow:
$P((X,Y) \in B)$ = $\int \int_{(x,y) \in B} f_{X,Y}(x,y)dxdy$
$P(X \in A)$ = $\int_{x \in A} f_X(x)dx$
$(X \in A | y=y)$ = $\int_{x \in A} f_{X|Y}(x|y)dx$

They can be used to calculate Expectations:
E[g(X)] = $\int g(x)f_X(x)dx$
E[g(X,Y)] = $\int \int g(x,y)f_{X,Y}(x,y)dxdy$
E[g(X)|Y=y] = $\int g(x)f_{X|Y}(x|y)dx$
E[g(X,Y)|Y=y] = $\int g(x,y)f_{X|Y}(x|y)dx$

For any event A, we have following version of total probability theorem:
P(A) = $\int P(A | Y=y)f_Y(y)dy$

And, there are following versions of total expectation theorem:
E[X] = $\int E[X|Y=y]f_Y(y)dy$
E[g(X)] = $\int E[g(X)|Y=y]f_Y(y)dy$
E[g(X,Y)] = $\int E[g(X,Y)|Y=y]f_Y(y)dy$

Independence:
continuous random variables X and Y are independent iff
$f_{X,Y}(x,y)$ = $f_X(x)f_Y(y)$ for all x,y

and, following properties hold for independent X and Y
E[XY] = E[X]E[Y]
E[g(X)h(Y)] = E[g(X)]E[h(Y)]
var(X + Y) = var(X) + var(Y)

-- Bayes' Rule for Continuous Random Variables --
Let Y be a continuous random variable.

And, If X is a continuous random variable then
$f_{X|Y}(x|y)$ = $\frac{f_X(x)f_{Y|X}(y|x)}{f_Y(y)}$ = $\frac{f_X(x)f_{Y|X}(y|x)}{\int f_X(t)f_{Y|X}(y|t)dt}$

If A is an event, we have
P(A|Y=y)$f_Y(y)$ = P(A)$f_{Y|A}(y)$

So, P(A|Y=y) = $\frac{P(A)f_{Y|A}(y)}{P(A)f_{Y|A}(y|A) + P(A^c)f_{Y|A^c}(y)}$

If N is a discrete random variable, then we can choose A to be event N = n, we get
P(N=n|Y=y)$f_Y(y)$ = $p_N(n)f_{Y|N}(y|n)$

So, P(N=n|Y=y) = $\frac{p_N(n)f_{Y|N}(y|n)}{\sum_i p_N(i)f_{Y|N}(y|i)}$


-- Derived Distributions:
This describes how we calculate the PDF of Y, given that Y = g(X) and PDF of X

The approach is to calculate CDF of Y and then differentiate it.

$F_Y(y)$ = $P(g(X) \leq y)$ = $\int_{x|g(x) \leq y} f_X(x) dx$
and, $f_Y(y)$ = $\frac{dF_Y}{dy}(y)$

when Y = aX + b. Then
$f_Y(y)$ = $\frac{1}{|a|}f_X(\frac{y-b}{a})$

when Y = g(X) and g is *strictly* monotonic, then there always exists a function h s.t.
y = g(x) iff x = h(y)

Assuming h is differentiable, we have
$f_Y(y)$ = $f_X(h(y))|\frac{dh}{dy}(y)|$

Monday, September 20, 2010

Introduction to probability, ch#3(General Random Variables) notes - Part-1

Introduction to probability, ch#3(General Random Variables) notes - Part-1

A "continuous" random variable can take any value on the real number line unlike a discrete random variable.

PDF(Probability Density Function), counterpart of PMF of discrete random variables, of a continuous random variable X is a nonnegative function $f_X$ s.t.
P($X \in B$) = $\int_B f_X(x)dx$
for every subset of the real line. In particular, the probability that the value of X falls within an interval is
P($a \leq X \leq b$) = $\int_a^b f_X(x)dx$.

Note that, for a single value a, P(X = a) = $\int_a^a f_X(x)dx$ = 0. Hence, excluding or including the endpoints of any interval has no effect on its probability. That is
P($a \leq X \leq b$) = P($a < X < b$) = P($a \leq X < b$) = P($a < X \leq b$)

PDF also has to satisfy the normalization property, $\int_{-\infty}^{\infty} f_X(x)dx$ = 1

Also, note that PDF is not the probability of any particular event and hence not necessarily restricted to less than or equal to 1. It just has to be nonnegative and satisfy normalization property stated above.

-- Expectation and Variance: --
Let X be a continuous random variable with PDF $f_X(x)$

E[X] = $\int_{-\infty}^{\infty} xf_X(x)dx$
E[g(X)] = $\int_{-\infty}^{\infty} g(x)f_X(x)dx$

var(X) = E[(X - E[X])^2] = $\int_{-\infty}^{\infty} (x - E[X])^2f_X(x)dx$ = $E[X^2] - (E[X])^2$

If, Y = aX + b then
E[Y] = aE[X] + b
var(Y) = $a^2$var(X)


-- Cumulative Distribution Function --
CDF is a unified concept to deal with discrete as well as continuous random variables. CDF of X is denoted by $F_X$.
$F_X(x)$ = P($X \leq x$) = $\sum_{k \leq x} p_X(k)$ , if X is discrete
$F_X(x)$ = P($X \leq x$) = $\int_{-\infty}^{x} f_X(t)dt$, if X is continuous

Conceptually, $F_X(x)$ "accumulates" probability "upto" the value x.

If X is discrete, then we can obtain PMF of X from CDF of X as follow
$p_X(k)$ = $P(X \leq k)$ - $P(X \leq k-1)$ = $F_X(k)$ - $F_X(k-1)$, for all integers k

If X is continuous, then the PDF can be obtained by differentiating the CDF(assuming it has a derivative)
$f_X(x)$ = $\frac{dF_X}{dx}(x)$

-- Conditioning on an event --
Let A be some event, then conditional PDF of X, $f_{X|A}$ is defined as follow
$P(X \in B | A)$ = $\int_B f_{X|A}(x)dx$

in the special case when conditioning event is of the form $X \in A$, we have
$P(X \in B | X \in A)$ = $\frac{P(X \in B, X \in A)}{P(X \in A)}$ = $\frac{\int_{A \cap B}f_X(x)dx}{P(X \in A)}$

so, $\int_B f_{X|A}dx$ = $\frac{\int_{A \cap B}f_X(x)dx}{P(X \in A)}$

Hence, for events of the type $X \in A$
$f_{X|A}$ = $\frac{f_X(x)}{P(X \in A)}$, if $x \in A$ and 0 otherwise

A version of total probability theorem:
Let $A_1, A_2, ..., A_n$ form partition of the sample space with $P(A_i)$ for each i, then
$f_X(x)$ = $\sum_{i=1}^{n} P(A_i)f_{X|A_i}(x)$

Conditional Expectation
E[X|A] = $\int_{-\infty}^{\infty} xf_{X|A}(x)dx$
E[g(X)|A] = $\int_{-\infty}^{\infty} g(x)f_{X|A}(x)dx$

total expectation theorem:
E[X] = $\sum_{i=1}^{n} P(A_i)E[X|A_i]$
E[g(X)] = $\sum_{i=1}^{n} P(A_i)E[g(X)|A_i]$

Sunday, September 19, 2010

Some Discrete Random Variables

From Introduction to Probability.


Discrete Uniform over [a,b]
When X can take any value from a to b(both inclusive) with equal probability.

$p_X(k)$ = $\frac{1}{b-a+1}$ if k = a, a+1, a+2, ..., b-1, b
$p_X(k)$ = 0 otherwise

E[X] = $\frac{a+b}{2}$

var(X) = $\frac{(b-a)(b-a+2)}{12}


Bernoulli with Parameter p
It describes the success or failure in a single trial. For example, we can say that X = 1 if a toss results in HEAD or else X = 0 if the toss results in TAIL and probability of HEAD is p.

$p_X(k)$ = p if k = 1
$p_X(k)$ = 1-p if k = 0

E[X] = p

var(X) = p(1-p)


Binomial with Parameters p and n
It describes the number of successes in n independent bernoulli trials. For example, total number of HEADs obtained in n tosses.

$p_X(k)$ = ${n \choose k}p^k(1-p)^{n-k}$, if k = 0,1,2,...n

E[X] = np

var(X) = np(1-p)

Also, we can use bernoulli random variables to model binomial random variable as follow.
Let $X_1, X_2, ....X_n$ are n independent Bernoulli Random variables with parameter p. Then
Binomial Random Variable, X = $X_1 + X_2 + ... + X_n$


Geometric with Parameter p
It describes the number of trials until the first success, in a sequence of independent Bernoulli trials. For example, the number of tosses required to get first HEAD.

$p_X(k)$ = $(1-p)^{k-1}p$, if k = 1,2,....

E[X]] = $\frac{1}{p}$

var(X) = $\frac{1-p}{p^2}


Poisson with Parameter $\lambda$
It approximates the binomial PMF when n is large and p is small and $\lambda$ = np

$p_X(k)$ = $e^{-\lambda}\frac{\lambda^k}{k!}$, k = 0,1,2...

E[X] = $\lambda$ = var(X)

Introduction to probability, ch#2(Discrete Random Variables) notes

Introduction to probability, ch#2(Discrete Random Variables) notes

A Random variable is a real-valued function of the outcome of the experiment. Its called discrete if range of that function is finite or countably infinite.
A discrete rancom variable has an associated PMF(probability mass function), which gives the probability of each numerical value that the random variable can take. For example, if x is any possible value of a random variable X, the probability mass of X, denoted by $p_X(x)$, is the probability of the event {X = x} consisting of all the outcomes that give rise to a value of X = x, so
$p_X(x)$ = P({X = x})
Note that, as per the normalization probability axiom, $\sum_{x}p_X(x)$ = 1. Also for any given S, we have P($X \in S$) = $\sum_{x \in S}p_X(x)$

-- Expectation/Mean --
E[X] = $\sum_{x}xp_X(x)$
and for any function g(X) of X, E[g(X)] = $\sum_{x}g(x)p_X(x)$

-- Variance --
var(X) = E[$(X-E[X])^2$] = E[$X^2$] - ${(E[X])}^2$
Standard Deviation, $\sigma_X$ = $\sqrt{var(X)}$

-- nth moment of X --
nth moment of X = E[$X^n$]

Mean and Variance of a Linear Function of a Random Variable:
Let X and Y be two random variables s.t. Y = aX + b, where a and b are given scalars. Then
E[Y] = aE[X] + b
var(Y) = $a^2$var(X)

-- Joint PMFs: --
Let X and Y be random variables associated with the same experiment.

The joing PMF $p_{X,Y}$ of X and Y is defined by
$p_{X,Y}(x,y)$ = P(X = x, Y = y)

The marginal PMFs of X and Y respectively are
$p_X(x)$ = $\sum_{y}p_{X,Y}(x,y)$
$p_Y(y)$ = $\sum_{x}p_{X,Y}(x,y)$

E[g(X,Y)] = $\sum_{x}\sum_{y}g(x,y)p_{X,Y}(x,y)$
In case, g(X,Y) is linear. Then E[aX + bY + c] = aE[X] + bE[Y] + c

-- Conditional PMFs: --
The conditional PMF of X given an even A with P(A) > 0, is defined by
$p_{X|A}$(x) = P(X=x|A)
It follows all the probability axioms including normalization, that is $\sum_{x}p_{X|A}(x)$ = 1

The conditional PMF of X given another Random Variable Y = y is related to the joint PMF by
$p_{X,Y}(x,y)$ = $p_Y(y)p_{X|Y}(x|y)$

and this can be used to calculate marginal PMF of X(similar total probability theorem)
$p_X(x)$ = $\sum_{y}p_Y(y)p_{X|Y}(x|y)$

-- Conditional Expectation: --
Conditional expectation of X given an event A with positive probability is defined by
E[X|A] = $\sum_{x}xp_{X|A}(x)$
and for a function, g(X) we have
E[g(X)|A] = $\sum_{x}g(x)p_{X|A}(x)$

The conditional expectation of X given a value y of Y is defined by
E[X|Y=y] = $\sum_{x}xp_{X|Y=y}(x)$

And this leads to the total expectation theorem
E[X] = $\sum_{y}p_Y(y)E[X|y=y]$

Let $A_1, A_2,... ,A_n$ are positive probability evens such that they form a partition of the sample space. Then
E[X] = $\sum_{i=1}^{n}P(A_i)E[X|A_i]$

Let $A_1, A_2,... ,A_n$ be partition of an even B and $P(A_i \cap B)$ > 0 for all i. Then
E[X|B] = $\sum_{i=1}^{n}P(A_i|B)E[X|A_i \cap B]$


-- Independent Random Variables: --
Let A be an event, with P(A) > 0, and let X and Y be random variables associated with the same experiment. Then

X is independent of event A if, $p_{X|A}(x)$ = $p_X(x)$ for all x.

X and Y are independent. Then
$p_{X,Y}(x,y)$ = $p_X(x)p_Y(y)$ for all x,y
E[XY] = E[X]E[Y]
E[g(X)h(Y)] = E[g(X)]E[h(Y)]
var(X+Y) = var(X) + var(Y)

Thursday, September 16, 2010

Introduction to Probability - Ch#1 notes

Introduction to Probability, Ch#1 notes

A probabilistic model is a mathematical description of an uncertain situation. Every probabilistic model involves an underlying process, called the experiment, that will produce exactly one out of several possible (mutually exclusive)outcomes. The set of *all* possible outcomes is called the sample space of the experiment, and is denoted by $\Omega$. A subset of the sample space, a collection of possible outcomes, is called an event.
The probability law assigns to any event A, a nonnegative P(A)(called the probability of A) that encodes our knowledge or belief about the collective "likelihood" of the elements of A. The probability law *must* satisfy following probability axioms

Nonnegativity: P(A) $\geq$ 0, for every event A.
Additivity: If A and B are two disjoint events, then the probability of their union satisfies, $P(A \cup B)$ = P(A) + P(B) . It can be generalized to any number of disjoint events.
Normalization: P($\Omega$) = 1

-- Conditional Probability --
The conditional probability of an even A, given an event B with P(B) > 0, is defined by P(A|B) = $\frac{P(A \cap B)}{P(B)}$ and specifies a new (conditional) probability law on the same sample space $\Omega$. And, All the probability axioms remain valid on the conditional probability law.

Multiplication Rule:
Assuming that all of the conditioning events have positive probability, we have
P($\cap_{i = 1}^{n} A_i$) = $P(A_1)P(A_2|A_1)P(A_3|A_1 \cap A_2)...P(A_n|\cap_{i = 1}^{n-1} A_i)$

Total Probability Theorem:
Let $A_1, A_2, ..., A_n$ be disjoint events that form a partition of the sample space and assume that probability of all these events is greater than 0. Then, for any event B, we have
P(B) = $P(A_1)P(B|A_1)$ + $P(A_2)P(B|A_2)$ + ... + $P(A_n)P(B|A_n)$

Bayes' Rule:
Let $A_1, A_2, ..., A_n$ be disjoint events that form a partition of the sample space and assume that probability of all these events is greater than 0. Then, for any even B such that P(B) > 0, we have
$P(A_i|B)$ = $\frac{P(A_i)P(B|A_i)}{\sum_{i = 1}^{n}P(A_i)P(B|A_i)}$

-- Independence: --
Two evens A and B are independent if and only if, $P(A \cap B)$ = P(A)P(B). This can be generalized to any number of elements.
Also, as a consequence P(A|B) = P(A), provided P(B) > 0

Conditional Independence:
Two events A and B are said to be conditionally independent given another event C with P(C) > 0, if
$P(A \cap B| C)$ = P(A|C)P(B|C).
And, if in addition $P(B \cap C)$ > 0, then conditional independence is equivalent to the condition
P(A |$B \cap C$) = P(A|C)

Note that, independence does not imply conditional independence and viceversa

Saturday, September 11, 2010

Facebook Engineering Puzzles

A colleague of mine pointed me to the facebook engineering puzzles page yesterday. And I tried to give a shot to first one (the liar-liar puzzle). It took a good number of my hours today but it was fun.

And, I succeeded in first submission itself :). Here is the mail from facebook that I received:

from Facebook PuzzleBot
to g.himanshu@gmail.com
date Sat, Sep 11, 2010 at 4:45 PM
subject liarliar evaluation results
mailed-by facebook.com


Dear submitter,

Thank you for your submission of a puzzle solution to Facebook! After running your solution to liarliar (received on September 11, 2010, 12:37 am), I have determined it to be correct. Your solution ran for 2508.850 ms on its longest test case. If you have not already, try installing the official Facebook Puzzles application from http://apps.facebook.com/facebookpuzzles/ and publish a story about your solution! To publish, just go to http://apps.facebook.com/facebookpuzzles/mypuzzles.php and click on the publish link.

If you are applying for a position within Facebook, the puzzle difficulty solved will be taken into account with regards to how much time you had available to solve it (remember that Hors D'oeuvres are only tests for your benefit). I have taken the liberty of alerting our recruiting team about your success. Best of luck!

Sincerely,
-The puzzle robot


sorry, but no open sourced code this time for obvious reason. :P

[Update#1] I got the 2nd one(breathalyzer) accepted also. It took multiple submissions though :( and hence more time.

[Update(May 02 - 2011)] After a long time, yesterday I worked on Gattaca puzzle and nailed it. Regarding the timing, bot says, it took 4132.647 ms on the longest test case. My first solution was O(2^n) as I thought this was NP-hard problem which got rejected. I then followed the discussions on it and realized that this was *not* NP-hard and managed to work out the code that runs in O(n^2) in the worst case.

[Update(May 08 - 2011)] dancebattle is finished too. My first submission was accepted but longest test case time was 1108.22 ms. Then I optimized it a little bit, resubmitted and then bot says, it took 263.009 ms on it's longest test case :).

[Update(May 09 - 2011)] smallworld is done. Note that trivial O(n^2) solution does not succeed. Bot says, my solution took 9928.117 ms on its longest test case.

[Update(May 16 - 2011)] Today I received acceptance email for my solution to fridgemadness (Refrigerator Madness). Hardest part in this puzzle is to really understand the problem :). Regarding the timing, puzzle bot says, it took 362.562 ms on its longest test case.

Thursday, September 9, 2010

Habits that make you a better programmer in the long run

I read a thread on hacker news on "what habits made you a better programmer?". Here are some of the things(in no particular order) that resonated with me personally.

  • Learning new things. Every year or so try to learn a new major skill like functional languages, AI, Compilers, OS etc.
  • Think first, write later.
  • Read great code written by great people and read your own code also.
  • Surround yourself with real smart people.
  • Spend an extra hour or so after you're *done* with something to give it finishing touches, do the edge cases.
  • Work on what is exciting to you, then switch when the excitement wears off and switch back again when excitement for it comes back. Though, it needs some level of discipline to come back.
  • Have discipline to go 100%, keep the notes and write blog.
  • Learning to love new languages that make you think differently.
  • Learn one editor really well.
  • Write small utility project/programs for the tool/technology you're curious about rather than waiting for THE master project.
  • Find someone to pair program with who is a great programmer.
  • Persistence and Patience are a virtue.
  • Keeping healthy body and mind is very important. Practice yoga or something regularly.
  • Socialize, be part of hack/code fests and such forums and groups.

Wednesday, September 8, 2010

Performance Tuning

I'm scribbling here my recent(and ongoing) experience with performance tuning a java web application. Like any other article on optimization, I'll start with the quote, "Premature optimization is the root of all evil.". My take is that don't *optimize* prematurely, but do not make it an excuse for *not measuring* the performance of your system. It is good to know how your system performs at all times.

Anyway, here is the common way(was reinforced after attending a talk on performance tuning in devcamp-2010) that everyone knows about but I'm just putting here anyway...

1. Pick conservative hardware and install your application on it.

2. Create a large sample(but realistic) dataset. It might be good to just take a dump from production and load your system with it.

3. Put your system to extreme load. In general the load will be very general doing all the functions on your system. But, in the special case where you're just exploring one feature's performance/GC-activity then put the load on that feature only. Use any of the load generation tools... Apache Bench(non-gui) and Apache Jmeter(gui as well as non-gui modes) are two free ones.

4. Measure, Measure and Measure.
This gets tricky, I've tried to use hprof for memory dumps, Runtime.freeMemory and Runtime.maxMemory methods for memory consumption analysis but ended up using a profiler(a commertial one that is standard in our org) and it really helps.

5. Make your change if any and then measure again and keep repeating.

6. Automate, as much possible, the whole process for faster feedback loops.

So far we've found following major boosters..

1. Putting indexes, we always were very careful while putting indexes as they become overhead for write/update heavy tables, but later found there are many tables where indexes are helping alot more than we imagined.

2. We noticed that alot of garbage was generated and most of it was char[] data resulting from a lot of Strings. And finally figured out that the culprit was heavy amount of logs *printed*. However, Note that log statements in the code are not that expensive but printing them is. So, for prod/perf environment set print log level to ERROR and be very careful while putting log statements in app(I mean being judicious about what level you print the log in.. DEBUG/INFO/WARN/ERROR/FATAL), anything in log.error should really be an error.

3. Guarding log.debug statements with log.isDebugEnabled.

4. Caching with LRU(policy may very well depend on particular use case)

[Update:] We realized that our background jobs are very memory/cpu hungry, so we created a separate job node to run just the background jobs. It is improving the performance of serving synchronous client requests tremendously. However, we are still working on to make background jobs more efficient in terms of memory/cpu usage.

[Update:10-12-10] We realized that we usually used bind variables in sql/hql queries when wrote them by hand but when generated dynamically, we forgot(its easy) to use bind variables. Also, we never used bind variables wherever we had IN clauses. Since we all know that it is more efficient to use bind variables to reduce parsing times in db(they get more cache hits on already parsed sqls), So we are paying this technical debt and correcting all of this now. While we're doing it, it seems it would have been a simpler exercise if all the hql and sql statements were written in one place then it would have been easier to go through all of them as right now they are all scattered in different places in the big project code-base

[Update:14-12-10] We are resorting to use bulk manipulation queries and they are giving significant improvements. To take an extreme example, it is much efficient to directly execute query, "UPDATE TABLE t SET c = 'blah'", than iterating over every data object, updating it and saving it(unfortunately, ORMs give us the habit of thinking always in terms of objects and we forget to use these simple optimizations).
Also, I find the contents on this stackoverflow thread very informative and It seems I've also made some of the mistakes mentioned there :).

[Update:10-03-11] Never leave memory requirements of an operation unbounded. If your memory requirement increases in proportion to some dataset size and you believe that dataset size is upper bounded then place some mechanism so that if the dataset size goes beyond your assumption then somehow you get the notification. In my experience, usually my assumptions were either wrong or with time the load grew and assumption became wrong.. so it is much better if you plan your code so that dataset size is also bounded by design rather than by belief :). For example if you're fetching a list of records from database, probably there will be good enough reasons to use pagination parameters.