My Technical Scratch Pad.

Thursday, January 24, 2013

Good Unit Tests

Found a good thread on stackoverflow regarding the good unit tests : http://stackoverflow.com/questions/61400/what-makes-a-good-unit-test

Pasting the most important ans from above thread here..
--------------------------------------------------
Good Tests should be A TRIP (The acronymn isn't sticky enough - I have a printout of the cheatsheet in the book that I had to pull out to make sure I got this right..)

Automatic : Invoking of tests as well as checking results for PASS/FAIL should be automatic
Thorough: Coverage; Although bugs tend to cluster around certain regions in the code, ensure that you test all key paths and scenarios.. Use tools if you must to know untested regions
Repeatable: Tests should produce the same results each time.. every time. Tests should not rely on uncontrollable params.
Independent: Very important.
- Tests should test only one thing at a time. Multiple assertions are okay as long as they are all testing one feature/behavior. When a test fails, it should pinpoint the location of the problem.
- Tests should not rely on each other - Isolated. No assumptions about order of test execution. Ensure 'clean slate' before each test by using setup/teardown appropriately
Professional: In the long run you'll have as much test code as production (if not more), therefore follow the same standard of good-design for your test code. Well factored methods-classes with intention-revealing names, No duplication, tests with good names, etc.
Good tests also run Fast. any test that takes over half a second to run.. needs to be worked upon. The longer the test suite takes for a run.. the less frequently it will be run. The more changes the dev will try to sneak between runs.. if anything breaks.. it will take longer to figure out which change was the culprit.

Readable : This can be considered part of Professional - however it can't be stressed enough. An acid test would be to find someone who isn't part of your team and asking him/her to figure out the behavior under test within a couple of minutes. Tests need to be maintained just like production code - so make it easy to read even if it takes more effort. Tests should be symmetric (follow a pattern) and concise (test one behavior at a time). Use a consistent naming convention (e.g. the TestDox style). Avoid cluttering the test with "incidental details".. become a minimalist.

Apart from these, most of the others are guidelines that cut down on low-benefit work: e.g. 'Don't test code that you don't own' (e.g. third-party DLLs). Don't go about testing getters and setters. Keep an eye on cost-to-benefit ratio or defect probability.
-----------------------------------------------

Some of my own from experience(not strict rules that I follow but just guidelines)..

- Whenever writing tests, keep in mind, how much the test code changes if you change the implementation(keeping the interface same) of your method under test. In short, try to test specification as much possible.

- Recently, to abstract away a key-value store, I created an interface, KeyValueDb. In addition to the real key-value store based implementation I wrote another one which is backed by memory. All the code that uses KeyValueDb, I used memory-backed implementation in the unit tests and surprisingly tests look much cleaner than using mock(KeyValueDb). However, mocks are required in some places to create scenarios hard to create otherwise(e.g. socket timeout).

Also, though I don't do T.D.D.[neither Design nor Development], but writing unit tests while/after I am done coding something gives me confidence that it works. It sure helps catch bugs and sometimes results in good method level refactorings .

Not tired of reading yet? Here is another intersting post... http://mike-bland.com/2012/07/10/test-mercenaries.html .

Sunday, November 11, 2012

linux cmd cheats

#to look at traffic at port 80, in ascii format(-A), snaplen unlimited(-s 0)
tcpdump -i any -s 0 -A tcp port 80

#randomly select N lines from
shuf -n N input > output

#removing/selecting non-ascii chars
sed 's/[\d128-\d255]//' input_file
grep -P '[\x80-\xFF]' input_file

# Printing control character on shell (for example to use as delimiter in cut command)
Type Ctrl-v and then Ctrl-a will print ^A

# cat -n prints line numbers as well

//Note that cut command does not order the fields even if you say -f2,1 .They will still be printed in the order they appear in the input file. You can use awk for that
# cat data
1,2
3,4
# cut -d ',' -f2,1 data
1,2
3,4
# awk 'BEGIN { FS = "," } { print \$2,\$1}' data
2 1
4 3

#pretty-print json

echo '{"a":1,"b":2}' | python -mjson.tool

# Redirect stderr to stdout and stdout to a filefoo > log_file 2>&1

#convert seconds since epoch to date string
date -d @1396324799

Monday, October 29, 2012

confidence interval - a interpretation

Given a population sample..

(X_Bar - mu)/(S/sqrt(n)) ~ t distribution with n-1 degrees of freedom

this fact is used to derive the (1-alpha)100% confidence interval for population mean(mu) which is..

X_Bar +/- (S/sqrt(n))t_n-1_1-alpha/2

where t_n-1_1-alpha/2 is (1-alpha/2)th quantile of t-distribution with n-1 degrees of freedom.

Note that this interval is random that is if you take multiple samples with same sample size then the value of this interval will most likely be different for different samples from the same population. However, the true population mean(mu) has a fixed value.

One interpretation of confidence interval is that if you take multiple samples and find some confidence interval, say 95% confidence interval, from all the samples then 95% of the time mu will lie inside those intervals.

Wednesday, September 19, 2012

stats1-coursera-week2-notes

Given sample data of two random variables X and Y, When is the correlation analysis between then valid?
1. X and Y, both, have normal distribution because to do correlation analysis you need data points from all the spectrum of values possible. For example if all your X values in the sample are close to a single value let say 0(whereas possible values of X are from let say -500 to +500) then no matter how big the sample is, you are not going to get any meaningful correlation analysis results.
How to see: Just draw histogram of the sample and see if it is close to bell curve. Also, see summary statistics for the samples.

2. X and Y have linear relationship.
How to see: Examine the scatterplot of Y vs X, Plot the histogram of residuals(prediction error from linear model)

3. Homoskedasticity
How to see: Plot residuals vs X values. there should be no relationship, it should be random

Reliability of the correlation coefficient?
One approach is to use null hypothesis significance testing. For example H0 is that there is no correlation. Find the p-value and accept/reject null hypothesis. You can also look at the confidence interval for correlation coefficient.

Notes on Sampling
In most of the statistical experiments, it is not possible to gather data from whole population and you gather data from a "random representative sample" which is "large enough" to give decent estimation of parameters.
So, there will be a difference between the actual population parameter(say mean) and same measured from the sample. The difference is called the sample error. Now, mostly, we don't know the actual population parameter value(in this case the true mean in the population).. so how do we estimate what our sample error is(btw estimate of sample error is called standard error, SE)?
SE of estimated mean = SD/sqrt(N)
where SD is standard deviation of the sample.
But, how is above formula obtained?

let say we take many many samples from the population and find the mean value from each of the samples, then these estimated mean values will be possibly different for different samples. Essentially, estimated means coming from different samples have a probability distribution(you can get some idea of this distribution by plotting the histogram, called probability histogram, of all the different sample means), which is called distribution of the sample mean and mean of the distribution of sample means is equal to true mean of the population. Also, SE is the standard deviation of probability histogram.

By "central limit theorem", the fact is that distribution of the sample mean has to be normal as long as sample sizes are large enough(N > 30) or if the random variable in question is normally distributed.

Notes on Hypothesis testing(taking sample mean as an example):
p-value = P(sample-data-observed | H0 is true)
which in simple terms means, the probability of observing given sample or more extreme than that given the null hypothesis is true. Hecne, if p-value is low(e.g. less than a threshold) then we can reject the null hypothesis.

Usually, the approximate probability distribution of observed-sample-data given H0 is known(sample distribution). And, using that probability distribution we calculate the p-value.

For example, sample distribution of the mean has normal distribution. That says..
(X_bar - mu)/(sigma/sqrt(n)) ~ standard normal distribution(Z)

where
X_bar : sample mean
mu : true population mean
sigma : true population standard deviation
n : sample size

Let say H0 is that true population mean(mu) is 0 then
z-score = (X_bar - 0)/(sigma/sqrt(n))

which is basically how many standard deviation away are you from the true mean. If absolute value of z-score is too much then that means assumed H0 is not true and should be rejected.

Now, let us try to understand how z-score(in general the test-statistic) and p-value are related. To, understand that let us see the standard normal distribution.

It is clear from the picture above that if H0 is true then with 95.4% chance we should get a absolute z-score less than 2(in fact [-2*(sigma/sqrt(n),+2*(sigma/sqrt(n)] is called 95.4% confidence interval for the true population mean).
And with 4.6%(= 100 - 95.4) we observe a sample such that z-score is not in [-2,+2]. So, the probability of observing this sample data or more extreme is 0.046 and that is our p-value.

Since we are talking about sample mean, so one more thing worth mentioning is that, usually sigma is not known and we approximate that by square root of MSS(mean sum of square deviation). Sample mean distribution will still be approximately normal as long as sample size is large otherwise the sample mean distribution will be more close to t distribution with n-1 degrees of freedom and we will use t-score instead of z-score and p-value/confidence-interval etc will be calculated from t-distribution.

Monday, September 3, 2012

ad hoc jvm monitoring and debugging

Many a times we notice some issue in production environments and quickly need to investigate the details with ssh connection to a terminal. jdk comes packaged with some very useful tools to monitor/debug jvm which can come in handy..

jps - lists the vm-id's of instrumented jvm processes running on the machine

jinfo - gives you a lot of detail about the configuration a jvm process was started with(which, otherwise, is scattered around innumerable number of configuration files and environment variables)

jstat - a jvm monitoring tool, very quick to setup. it can basically monitor the heap, gc frequencies and time elasped doing gc. It can be setup with one command e.g. "jstat -gcold -t vm-id 10s"

jstack - this lets you print the stack trace of all the threads. it can automatically find out the deadlocks. Also, you can find high contention locks by taking multiple thread dumps over a small period of time and see which locks are being waited upon most frequently. Use "jstack -F -m vm-id". Use additional "-l" option to print lock information (it will be slow though).

jmap - basically the heap analyzer. among other things, you can use it to dump the contents of heap to a file(e.g. jmap -dump:format=b,file=mydump.hprof vm-id). you can use jhat to explore this file using a browser or use eclipse-mat that gives better ui and functionality.

hprof - standard java cpu/memory profiler bundled with jdk. this is not really ad-hoc as you would have to add it to jvm options at start up instead of just attaching at will. output can be processed via jhat and other fancy tools such as yourkit.
java -agentlib:hprof=help
java -agentlib:hprof=heap=sites,cpu=samples,depth=10,monitor=y,thread=y,doe=y,format=b,file=/home/himanshu/profile_dump.hprof

http://www.javaworld.com/article/2075884/core-java/diagnose-common-runtime-problems-with-hprof.html
http://docs.oracle.com/javase/8/docs/technotes/samples/hprof.html

Note that, when the jvm process is started by a different user than the one you are logged in with, your logged-in user might not have permissions to attach to the jvm process and you may need to use sudo with all of the above commands.

Btw, these tools are not limited to jvm processes running locally but can be used with remove jvm processes as well using rmi. In this case you could use graphical clients JConsole and JVisualVM also.

A bit orthogonal to jvm monitoring, but following are some noted jvm startup options that are very helpful when things go wrong.

-XX:ErrorFile=/path/to/hs_err_pid.log
If an error occurs, save the error data to given file.

-XX:-HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/path/to/java_pid.hprof
Dump the heap to given file in case of out of memory error.

-XX:-PrintGCDetails
-Xloggc:/path/to/gclog.log
Prints useful information about gc in given file. you can use gcviewer to analyze this file.

References:
http://docs.oracle.com/javase/8/docs/technotes/tools/
http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html
http://docs.oracle.com/javase/8/docs/technotes/tools/windows/java.html

Sunday, September 2, 2012

Dynamic Proxies in Java

Java Proxy class in reflection package lets you dynamically create a proxy class(and its instances) that implements interfaces of your choice. Proxy instance contains a InvocationHandler supplied by you. Any method calls to the proxy instance call the handle method of passed invocation handler where you can determine the behavior that you want on the call.
So, for example, you can very easily wrap around implementation of an interface and do some magic before/after the method invocation(and much more of course). Using this you can get lightweight "aop" like functionality for method interception.

A much more informative article on the same topic is http://www.ibm.com/developerworks/java/library/j-jtp08305/index.html

Saturday, September 1, 2012

Java SE 7 new features

JDK7 has got many new features and folks have done a great job of writing good articles about all of them, so I wouldn't repeat any of that.
For my quick access, I will keep a reference to many such articles and a quick list of interesting features.

- Type inferencing for generic instance creation
- Strings in switch statements
- try-with-resources and using exception suppression instead of masking it
- Catching multiple exceptions in single catch block
- NIO 2.0 changes
- Fork-Join paradigm

References:
Good article on try-with-resources and exception suppression
http://www.oracle.com/technetwork/articles/java/trywithresources-401775.html

Fork-Join article
http://www.cs.washington.edu/homes/djg/teachingMaterials/spac/grossmanSPAC_forkJoinFramework.html
http://stackoverflow.com/questions/7926864/how-is-the-fork-join-framework-better-than-a-thread-pool

Misc
http://radar.oreilly.com/2011/09/java7-features.html