Wednesday, September 19, 2012


Given sample data of two random variables X and Y, When is the correlation analysis between then valid?
1. X and Y, both, have normal distribution because to do correlation analysis you need data points from all the spectrum of values possible. For example if all your X values in the sample are close to a single value let say 0(whereas possible values of X are from let say -500 to +500) then no matter how big the sample is, you are not going to get any meaningful correlation analysis results.
How to see: Just draw histogram of the sample and see if it is close to bell curve. Also, see summary statistics for the samples.

2. X and Y have linear relationship.
How to see: Examine the scatterplot of Y vs X, Plot the histogram of residuals(prediction error from linear model)

3. Homoskedasticity
How to see: Plot residuals vs X values. there should be no relationship, it should be random

Reliability of the correlation coefficient?
One approach is to use null hypothesis significance testing. For example H0 is that there is no correlation. Find the p-value and accept/reject null hypothesis. You can also look at the confidence interval for correlation coefficient.

 Notes on Sampling
In most of the statistical experiments, it is not possible to gather data from whole population and you gather data from a "random representative sample" which is "large enough" to give decent estimation of parameters.
So, there will be a difference between the actual population parameter(say mean) and same measured from the sample. The difference is called the sample error. Now, mostly, we don't know the actual population parameter value(in this case the true mean in the population).. so how do we estimate what our sample error is(btw estimate of sample error is called standard error, SE)?
SE of estimated mean =  SD/sqrt(N)
where SD is standard deviation of the sample.
But, how is above formula obtained?

let say we take many many samples from the population and find the mean value from each of the samples, then these estimated mean values will be possibly different for different samples. Essentially, estimated means coming from different samples have a probability distribution(you can get some idea of this distribution by plotting the histogram, called probability histogram, of all the different sample means), which is called distribution of the sample mean and mean of the distribution of sample means is equal to true mean of the population. Also, SE is the standard deviation of probability histogram.

By "central limit theorem", the fact is that distribution of the sample mean has to be normal as long as sample sizes are large enough(N > 30) or if the random variable in question is normally distributed.

Notes on Hypothesis testing(taking sample mean as an example):
p-value = P(sample-data-observed | H0 is true)
which in simple terms means, the probability of observing given sample or more extreme than that given the null hypothesis is true. Hecne, if p-value is low(e.g. less than a threshold) then we can reject the null hypothesis.

Usually, the approximate probability distribution of  observed-sample-data given H0 is known(sample distribution). And, using that probability distribution we calculate the p-value.

For example, sample distribution of the mean has normal distribution. That says..
(X_bar - mu)/(sigma/sqrt(n)) ~ standard normal distribution(Z)

X_bar : sample mean
mu : true population mean
sigma : true population standard deviation
n : sample size

Let say H0 is that true population mean(mu) is 0 then
z-score = (X_bar - 0)/(sigma/sqrt(n))

which is basically how many standard deviation away are you from the true mean. If absolute value of z-score is too much then that means assumed H0 is not true and should be rejected.

Now, let us try to understand how z-score(in general the test-statistic) and p-value are related. To, understand that let us see the standard normal distribution.

It is clear from the picture above that if H0 is true then with 95.4% chance we should get a absolute z-score less than 2(in fact [-2*(sigma/sqrt(n),+2*(sigma/sqrt(n)] is called 95.4% confidence interval for the true population mean).
And with 4.6%(= 100 - 95.4) we observe a sample such that z-score is not in [-2,+2]. So, the probability of observing this sample data or more extreme is 0.046 and that is our p-value.

Since we are talking about sample mean, so one more thing worth mentioning is that, usually sigma is not known and we approximate that by square root of MSS(mean sum of square deviation). Sample mean distribution will still be approximately normal as long as sample size is large otherwise the sample mean distribution will be more close to t distribution with n-1 degrees of freedom and we will use t-score instead of z-score and p-value/confidence-interval etc will be calculated from t-distribution.

Monday, September 3, 2012

ad hoc jvm monitoring and debugging

Many a times we notice some issue in production environments and quickly need to investigate the details with ssh connection to a terminal. jdk comes packaged with some very useful tools to monitor/debug jvm which can come in handy..

jps - lists the vm-id's of instrumented jvm processes running on the machine

jinfo - gives you a lot of detail about the configuration a jvm process was started with(which, otherwise, is scattered around innumerable number of configuration files and environment variables)

jstat - a jvm monitoring tool, very quick to setup. it can basically monitor the heap, gc frequencies and time elasped doing gc. It can be setup with one command e.g. "jstat -gcold -t vm-id 10s"

jstack - this lets you print the stack trace of all the threads. it can automatically find out the deadlocks. Also, you can find high contention locks by taking multiple thread dumps over a small period of time and see which locks are being waited upon most frequently. Use "jstack -F -m vm-id". Use additional "-l" option to print lock information (it will be slow though).

jmap - basically the heap analyzer. among other things, you can use it to dump the contents of heap to a file(e.g. jmap -dump:format=b,file=mydump.hprof vm-id). you can use jhat to explore this file using a browser or use eclipse-mat that gives better ui and functionality.

hprof -  standard java cpu/memory profiler bundled with jdk. this is not really ad-hoc as you would have to add it to jvm options at start up instead of just attaching at will. output can be processed via jhat and other fancy tools such as yourkit.
java -agentlib:hprof=help
java -agentlib:hprof=heap=sites,cpu=samples,depth=10,monitor=y,thread=y,doe=y,format=b,file=/home/himanshu/profile_dump.hprof

Note that, when the jvm process is started by a different user than the one you are logged in with, your logged-in user might not have permissions to attach to the jvm process and you may need to use sudo with all of the above commands.

Btw, these tools are not limited to jvm processes running locally but can be used with remove jvm processes as well using rmi. In this case you could use graphical clients JConsole and JVisualVM also.

A bit orthogonal to jvm monitoring, but following are some noted jvm startup options that are very helpful when things go wrong.

If an error occurs, save the error data to given file.

Dump the heap to given file in case of out of memory error.

Prints useful information about gc in given file. you can use gcviewer to analyze this file.


Sunday, September 2, 2012

Dynamic Proxies in Java

Java Proxy class in reflection package lets you dynamically create a proxy class(and its instances) that implements interfaces of your choice. Proxy instance contains a InvocationHandler supplied by you. Any method calls to the proxy instance call the handle method of passed invocation handler where you can determine the behavior that you want on the call.
So, for example, you can very easily wrap around implementation of an interface and do some magic before/after the method invocation(and much more of course). Using this you can get lightweight "aop" like functionality for method interception.

A much more informative article on the same topic is

Saturday, September 1, 2012

Java SE 7 new features

JDK7 has got many new features and folks have done a great job of writing good articles about all of them, so I wouldn't repeat any of that.
For my quick access, I will keep a reference to many such articles and a quick list of interesting features.

- Type inferencing for generic instance creation
- Strings in switch statements
- try-with-resources and using exception suppression instead of masking it
- Catching multiple exceptions in single catch block
- NIO 2.0 changes
- Fork-Join paradigm

Good article on try-with-resources and exception suppression

Fork-Join article