Sunday, June 9, 2013

Chi-Square test of independence

Chi-sq test is used to determine whether two discrete random variables X and Y are independent or not.

I will do it here for binary random variables, but same calculations can be used to figure out independence of general discrete random variables.

Let us define our random variables with an example case for better intuition.
Assume, we have N emails, some of which are spam and some are non-spam. Some emails contain the word "millionaire" and some don't. Our task is to find out whether occurrence of word "millionaire" in an email and that email being a spam are independent of not.
So here is how, we define our random variables.

X = 1 when email contains word "millionaire"
X = 0 when email does not contain word "millionaire"

Y = 1 when email is spam
Y = 0 when email is non-spam

Our task stated earlier reduces to determining whether X and Y are independent or not.

Let us look at the contingency table:


Let us define some more notation..
N(x,y) = # of emails for which X = x and Y = y in the given N emails
for example N(1,1) is the number of emails that contain the word "millionaire" *and* are spam.

Also, E(x,y) = *expected* # of emails for which X = x and Y = y
clearly, E(x,y) = P(X=x,Y=y).N

Let say, our hypothesis is that X and Y are indeed independent. Formally, in hypothesis testing lingo, we say null hypothesis, H0: X and Y are independent.
So, assuming that null hypothesis is true

E(x,y) = P(X=x,Y=y).N = P(X=x).P(Y=y).N

estimate P(X=x) = [N(X=x,Y=1) + N(X=x,Y=0)]/N

P(Y=y) = [N(X=1,Y=y) + N(X=0,Y=y)]/N

So E(x,y) =  ([N(X=x,Y=1) + N(X=x,Y=0)] * [N(X=1,Y=y) + N(X=0,Y=y)])/N

Now the magic formula, chi-sq-value =

Magical thing about above formula is that it follows Chi-sq distribution with 1-degrees of freedom [In general if X could take I different values and Y could take J different values then it would be Chi-sq distribution with (I-1).(J-1) degrees of freedom]. I am not proving this fact in this post, and hence called it magical.
However, magic aside, intuitively speaking, it is a measure of discrepancy between expected values in the contingency table and actually observed values.

Now, using chi-sq distribution table, we compute the p-value which is probability of observing given chi-sq-value or greater in a random sample of N emails given that null hypothesis is true. There are 2 possibilities.

p-value is too low (say less than 0.001), it means null hypothesis should be rejected, that is, X and Y are dependent.

Or else, null hypothesis can be considered true, that is, X and Y are independent.

Essential information theory [For ML]

These are mostly my notes from reading of information theory section from Machine Learning : A Probabilistic Perspective.

Entropy: Entropy is the measurement of "uncertainity" in a random variable X. And, is given by

To gain some intuition, let us consider the Bernoulli random variable.
P(X = 1) = p
P(X = 0) = 1-p
H(X) = - P(1)Log(P(1)) - P(0)Log(P(0))
H(X) = - pLog(p) - (1-p)Log(1-p)  [also called binary entropy function]

Here is what it's graph looks like(taken from book)..

It is clear that entropy aka uncertainty is maximum when p = 0.5 (or the distribution is uniform in general), which is in agreement with intuition that you are in no position to guess an event when the probability of it happening is exactly same as each other event in consideration.

Also, negative of entropy is called "Information" which is in some sense an opposite of "uncertainty".

KL Divergence: It is a measure of divergence or "dissimilarity" between two probability distributions, say p and q. And, is given by

It can be proved that KL is always greater than or equal to 0. And, KL = 0 iff p = q.
Clearly, higher the KL divergence implies higher dissimilarity between two probability distributions.

Mutual Information(MI): Mutual information is a measure of degree of "correlation" between two random variables X and Y. It is obtained by finding the dissimilarity between P(X,Y) and P(X)P(Y), which is KL divergence.
So, MI, I(X,Y) = KL(P(X,Y) || P(X)P(Y))

I(X,Y) = 0 means X and Y are essentially the same and higher MI means higher dissimilarity and hence higher correlation.

It can be easily derived that
I(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
where H(X|Y) is "conditional entropy" of X given Y, and is given by

H(X|Y) can be interpreted as uncertainty in X after observing Y. So
I(X,Y) = H(X) - H(X|Y) gives us another interpretation of MI, that is, reduction in uncertainty of X after observing Y.

Another question is that why Pearson's correlation coefficient is not a good enough measurement of correlation. It turns out that correlation coefficient fails terribly on capturing many kind of relationships as is evident in following figure (taken from wikipedia)

Fig: Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).

Thursday, January 24, 2013

Good Unit Tests

Found a good thread on stackoverflow regarding the good unit tests :

 Pasting the most important ans from above thread here..
Good Tests should be A TRIP (The acronymn isn't sticky enough - I have a printout of the cheatsheet in the book that I had to pull out to make sure I got this right..)
  • Automatic : Invoking of tests as well as checking results for PASS/FAIL should be automatic
  • Thorough: Coverage; Although bugs tend to cluster around certain regions in the code, ensure that you test all key paths and scenarios.. Use tools if you must to know untested regions
  • Repeatable: Tests should produce the same results each time.. every time. Tests should not rely on uncontrollable params.
  • Independent: Very important.
    • Tests should test only one thing at a time. Multiple assertions are okay as long as they are all testing one feature/behavior. When a test fails, it should pinpoint the location of the problem.
    • Tests should not rely on each other - Isolated. No assumptions about order of test execution. Ensure 'clean slate' before each test by using setup/teardown appropriately
  • Professional: In the long run you'll have as much test code as production (if not more), therefore follow the same standard of good-design for your test code. Well factored methods-classes with intention-revealing names, No duplication, tests with good names, etc.
  • Good tests also run Fast. any test that takes over half a second to run.. needs to be worked upon. The longer the test suite takes for a run.. the less frequently it will be run. The more changes the dev will try to sneak between runs.. if anything breaks.. it will take longer to figure out which change was the culprit.
  • Readable : This can be considered part of Professional - however it can't be stressed enough. An acid test would be to find someone who isn't part of your team and asking him/her to figure out the behavior under test within a couple of minutes. Tests need to be maintained just like production code - so make it easy to read even if it takes more effort. Tests should be symmetric (follow a pattern) and concise (test one behavior at a time). Use a consistent naming convention (e.g. the TestDox style). Avoid cluttering the test with "incidental details".. become a minimalist.
Apart from these, most of the others are guidelines that cut down on low-benefit work: e.g. 'Don't test code that you don't own' (e.g. third-party DLLs). Don't go about testing getters and setters. Keep an eye on cost-to-benefit ratio or defect probability.

Some of my own from experience(not strict rules that I follow but just guidelines)..

- Whenever writing tests, keep in mind, how much the test code changes if you change the implementation(keeping the interface same) of your method under test. In short, try to test specification as much possible.

- Recently, to abstract away a key-value store, I created an interface, KeyValueDb. In addition to the real key-value store based implementation I wrote another one which is backed by memory. All the code that uses KeyValueDb, I used memory-backed implementation in the unit tests and surprisingly tests look much cleaner than using mock(KeyValueDb). However, mocks are required in some places to create scenarios hard to create otherwise(e.g. socket timeout).

Also, though I don't do T.D.D.[neither Design nor Development], but writing unit tests while/after I am done coding something gives me confidence that it works. It sure helps catch bugs and sometimes results in good method level refactorings .

Not tired of reading yet? Here is another intersting post... .