Sunday, August 10, 2014

Java: Generating Sequence Diagrams using jcalltracer

Github Repository : https://github.com/tuxdna/jcalltracer

This post is about getting jcalltracer to work (on ubuntu 12.04, though everything here is pretty general).

I chose it because it works by storing information on disk instead of trying to do everything in memory, so probably scales a little bit better. Also, there wasn't a lot of competition, I couldn't find that many free alternatives. It was simple to use and worked for me.

Clone the repo. Make sure you have all the prerequisites defined on the repo page.

On calling "make" I got an error, "/usr/bin/ld: cannot find -ldb" .
I had to install libdb-dev to get rid of that error.
sudo apt-get install libdb-dev
See bin/test.sh for instructions on how to use. It is self explanatory except for following..

# Run your java program, change filterList to the package name of your interest, once jvm exits, this will produce all the necessary raw information.
java -agentpath:./libjct.so=filterList=com.test,traceFile=calltrace.log -cp test-src com.test.HelloWorld
# generate sequence diagram "tree" in xml files, one per thread
bin/tree.rb
# convert sequence diagram tree to image. personally, instead of using the for loop, you should do a grep on xml files to find threads of interest and just generate images for those threads. you might have to specify more memory using -Xmx for bigger trees
for f in thread-*.xml
do
    java -jar java/dist/calltrace2seq.jar -i $f  -o output/ -f $f
done 
All the sequence diagrams would be now in output/ directory.
I found it very hard to read information from the images for big (really big) call chains. Raw tree xml file is more useful, when opened in a viewer that can display xml as a tree and you can expand/collapse elements, I simply used eclipse to view the xml.

Friday, May 23, 2014

Java SE 8 new features

Just collecting various resources containing updates for java 8. It is a major overhaul and I am so excited with all the new features those are coming especially lambdas, streams and default methods. They will help write so much better and concise code.

Here is a list from Adam L Davis's book referenced below..

- Lambda expressions and Method references

- Default Methods (Defender methods) in interfaces
With this it may seem that Abstract classes are unnecessary, but note that default methods are introduced for interface evolution and Interfaces are still more restricted than Abstract classes due to following.
-- Abstract classes can define constructors while Interfaces can't.
-- Fields inside Interface are always considered public and final, so Interfaces can't have a mutable state.


- Static methods in interfaces

- A new Stream API.

- Optional

- A new Date/Time API.

- Nashorn, the new JavaScript engine

- Removal of the Permanent Generation

References:

What's New in Java 8 by Adam L Davis

Oracle's what's new in Java 8 page

Java 8 Docs

Sunday, June 9, 2013

Chi-Square test of independence

Chi-sq test is used to determine whether two discrete random variables X and Y are independent or not.

I will do it here for binary random variables, but same calculations can be used to figure out independence of general discrete random variables.

Let us define our random variables with an example case for better intuition.
Assume, we have N emails, some of which are spam and some are non-spam. Some emails contain the word "millionaire" and some don't. Our task is to find out whether occurrence of word "millionaire" in an email and that email being a spam are independent of not.
So here is how, we define our random variables.

X = 1 when email contains word "millionaire"
X = 0 when email does not contain word "millionaire"

Y = 1 when email is spam
Y = 0 when email is non-spam

Our task stated earlier reduces to determining whether X and Y are independent or not.

Let us look at the contingency table:

Y=1Y=0
X=1N(1,1)N(1,0)
X=0N(0,1)N(0,0)

Let us define some more notation..
N(x,y) = # of emails for which X = x and Y = y in the given N emails
for example N(1,1) is the number of emails that contain the word "millionaire" *and* are spam.

Also, E(x,y) = *expected* # of emails for which X = x and Y = y
clearly, E(x,y) = P(X=x,Y=y).N

Let say, our hypothesis is that X and Y are indeed independent. Formally, in hypothesis testing lingo, we say null hypothesis, H0: X and Y are independent.
So, assuming that null hypothesis is true

E(x,y) = P(X=x,Y=y).N = P(X=x).P(Y=y).N

estimate P(X=x) = [N(X=x,Y=1) + N(X=x,Y=0)]/N

P(Y=y) = [N(X=1,Y=y) + N(X=0,Y=y)]/N

So E(x,y) =  ([N(X=x,Y=1) + N(X=x,Y=0)] * [N(X=1,Y=y) + N(X=0,Y=y)])/N

Now the magic formula, chi-sq-value =



Magical thing about above formula is that it follows Chi-sq distribution with 1-degrees of freedom [In general if X could take I different values and Y could take J different values then it would be Chi-sq distribution with (I-1).(J-1) degrees of freedom]. I am not proving this fact in this post, and hence called it magical.
However, magic aside, intuitively speaking, it is a measure of discrepancy between expected values in the contingency table and actually observed values.

Now, using chi-sq distribution table, we compute the p-value which is probability of observing given chi-sq-value or greater in a random sample of N emails given that null hypothesis is true. There are 2 possibilities.

p-value is too low (say less than 0.001), it means null hypothesis should be rejected, that is, X and Y are dependent.

Or else, null hypothesis can be considered true, that is, X and Y are independent.

Essential information theory [For ML]

These are mostly my notes from reading of information theory section from Machine Learning : A Probabilistic Perspective.

Entropy: Entropy is the measurement of "uncertainity" in a random variable X. And, is given by


To gain some intuition, let us consider the Bernoulli random variable.
P(X = 1) = p
P(X = 0) = 1-p
H(X) = - P(1)Log(P(1)) - P(0)Log(P(0))
H(X) = - pLog(p) - (1-p)Log(1-p)  [also called binary entropy function]

Here is what it's graph looks like(taken from book)..


It is clear that entropy aka uncertainty is maximum when p = 0.5 (or the distribution is uniform in general), which is in agreement with intuition that you are in no position to guess an event when the probability of it happening is exactly same as each other event in consideration.

Also, negative of entropy is called "Information" which is in some sense an opposite of "uncertainty".

KL Divergence: It is a measure of divergence or "dissimilarity" between two probability distributions, say p and q. And, is given by

It can be proved that KL is always greater than or equal to 0. And, KL = 0 iff p = q.
Clearly, higher the KL divergence implies higher dissimilarity between two probability distributions.

Mutual Information(MI): Mutual information is a measure of degree of "correlation" between two random variables X and Y. It is obtained by finding the dissimilarity between P(X,Y) and P(X)P(Y), which is KL divergence.
So, MI, I(X,Y) = KL(P(X,Y) || P(X)P(Y))
  

I(X,Y) = 0 means X and Y are essentially the same and higher MI means higher dissimilarity and hence higher correlation.

It can be easily derived that
I(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
where H(X|Y) is "conditional entropy" of X given Y, and is given by

H(X|Y) can be interpreted as uncertainty in X after observing Y. So
I(X,Y) = H(X) - H(X|Y) gives us another interpretation of MI, that is, reduction in uncertainty of X after observing Y.

Another question is that why Pearson's correlation coefficient is not a good enough measurement of correlation. It turns out that correlation coefficient fails terribly on capturing many kind of relationships as is evident in following figure (taken from wikipedia)


Fig: Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).





Thursday, January 24, 2013

Good Unit Tests

Found a good thread on stackoverflow regarding the good unit tests : http://stackoverflow.com/questions/61400/what-makes-a-good-unit-test

 Pasting the most important ans from above thread here..
--------------------------------------------------
Good Tests should be A TRIP (The acronymn isn't sticky enough - I have a printout of the cheatsheet in the book that I had to pull out to make sure I got this right..)
  • Automatic : Invoking of tests as well as checking results for PASS/FAIL should be automatic
  • Thorough: Coverage; Although bugs tend to cluster around certain regions in the code, ensure that you test all key paths and scenarios.. Use tools if you must to know untested regions
  • Repeatable: Tests should produce the same results each time.. every time. Tests should not rely on uncontrollable params.
  • Independent: Very important.
    • Tests should test only one thing at a time. Multiple assertions are okay as long as they are all testing one feature/behavior. When a test fails, it should pinpoint the location of the problem.
    • Tests should not rely on each other - Isolated. No assumptions about order of test execution. Ensure 'clean slate' before each test by using setup/teardown appropriately
  • Professional: In the long run you'll have as much test code as production (if not more), therefore follow the same standard of good-design for your test code. Well factored methods-classes with intention-revealing names, No duplication, tests with good names, etc.
  • Good tests also run Fast. any test that takes over half a second to run.. needs to be worked upon. The longer the test suite takes for a run.. the less frequently it will be run. The more changes the dev will try to sneak between runs.. if anything breaks.. it will take longer to figure out which change was the culprit.
  • Readable : This can be considered part of Professional - however it can't be stressed enough. An acid test would be to find someone who isn't part of your team and asking him/her to figure out the behavior under test within a couple of minutes. Tests need to be maintained just like production code - so make it easy to read even if it takes more effort. Tests should be symmetric (follow a pattern) and concise (test one behavior at a time). Use a consistent naming convention (e.g. the TestDox style). Avoid cluttering the test with "incidental details".. become a minimalist.
Apart from these, most of the others are guidelines that cut down on low-benefit work: e.g. 'Don't test code that you don't own' (e.g. third-party DLLs). Don't go about testing getters and setters. Keep an eye on cost-to-benefit ratio or defect probability.
-----------------------------------------------

Some of my own from experience(not strict rules that I follow but just guidelines)..

- Whenever writing tests, keep in mind, how much the test code changes if you change the implementation(keeping the interface same) of your method under test. In short, try to test specification as much possible.

- Recently, to abstract away a key-value store, I created an interface, KeyValueDb. In addition to the real key-value store based implementation I wrote another one which is backed by memory. All the code that uses KeyValueDb, I used memory-backed implementation in the unit tests and surprisingly tests look much cleaner than using mock(KeyValueDb). However, mocks are required in some places to create scenarios hard to create otherwise(e.g. socket timeout).

Also, though I don't do T.D.D.[neither Design nor Development], but writing unit tests while/after I am done coding something gives me confidence that it works. It sure helps catch bugs and sometimes results in good method level refactorings .

Not tired of reading yet? Here is another intersting post... http://mike-bland.com/2012/07/10/test-mercenaries.html .


Sunday, November 11, 2012

linux cmd cheats


#to look at traffic at port 80, in ascii format(-A), snaplen unlimited(-s 0)
tcpdump -i any -s  0 -A tcp port 80

#randomly select N lines from
shuf -n N input > output

#removing/selecting non-ascii chars
sed  's/[\d128-\d255]//' input_file
grep -P '[\x80-\xFF]' input_file

# Printing control character on shell (for example to use as delimiter in cut command)
Type Ctrl-v and then Ctrl-a will print ^A

# cat -n prints line numbers as well

//Note that cut command does not order the fields even if you say -f2,1 .They will still be printed in the order they appear in the input file. You can use awk for that
# cat data
1,2
3,4

# cut -d ',' -f2,1 data
1,2
3,4

# awk 'BEGIN { FS = "," } { print \$2,\$1}' data
2 1
4 3
 

#pretty-print jsonecho '{"a":1,"b":2}' | python -mjson.tool

# Redirect stderr to stdout and stdout to a filefoo > log_file 2>&1


#convert seconds since epoch to date string
date -d @1396324799



Monday, October 29, 2012

confidence interval - a interpretation

Given a population sample..

(X_Bar - mu)/(S/sqrt(n)) ~ t distribution with n-1 degrees of freedom

this fact is used to derive the (1-alpha)100% confidence interval for population mean(mu) which is..

X_Bar +/- (S/sqrt(n))t_n-1_1-alpha/2

where t_n-1_1-alpha/2 is (1-alpha/2)th quantile of t-distribution with n-1 degrees of freedom.

Note that this interval is random that is if you take multiple samples with same sample size then the value of this interval will most likely be different for different samples from the same population. However, the true population mean(mu) has a fixed value.

One interpretation of confidence interval is that if you take multiple samples and find some confidence interval, say 95% confidence interval, from all the samples then 95% of the time mu will lie inside those intervals.