Sunday, June 9, 2013

Chi-Square test of independence

Chi-sq test is used to determine whether two discrete random variables X and Y are independent or not.

I will do it here for binary random variables, but same calculations can be used to figure out independence of general discrete random variables.

Let us define our random variables with an example case for better intuition.
Assume, we have N emails, some of which are spam and some are non-spam. Some emails contain the word "millionaire" and some don't. Our task is to find out whether occurrence of word "millionaire" in an email and that email being a spam are independent of not.
So here is how, we define our random variables.

X = 1 when email contains word "millionaire"
X = 0 when email does not contain word "millionaire"

Y = 1 when email is spam
Y = 0 when email is non-spam

Our task stated earlier reduces to determining whether X and Y are independent or not.

Let us look at the contingency table:


Let us define some more notation..
N(x,y) = # of emails for which X = x and Y = y in the given N emails
for example N(1,1) is the number of emails that contain the word "millionaire" *and* are spam.

Also, E(x,y) = *expected* # of emails for which X = x and Y = y
clearly, E(x,y) = P(X=x,Y=y).N

Let say, our hypothesis is that X and Y are indeed independent. Formally, in hypothesis testing lingo, we say null hypothesis, H0: X and Y are independent.
So, assuming that null hypothesis is true

E(x,y) = P(X=x,Y=y).N = P(X=x).P(Y=y).N

estimate P(X=x) = [N(X=x,Y=1) + N(X=x,Y=0)]/N

P(Y=y) = [N(X=1,Y=y) + N(X=0,Y=y)]/N

So E(x,y) =  ([N(X=x,Y=1) + N(X=x,Y=0)] * [N(X=1,Y=y) + N(X=0,Y=y)])/N

Now the magic formula, chi-sq-value =

Magical thing about above formula is that it follows Chi-sq distribution with 1-degrees of freedom [In general if X could take I different values and Y could take J different values then it would be Chi-sq distribution with (I-1).(J-1) degrees of freedom]. I am not proving this fact in this post, and hence called it magical.
However, magic aside, intuitively speaking, it is a measure of discrepancy between expected values in the contingency table and actually observed values.

Now, using chi-sq distribution table, we compute the p-value which is probability of observing given chi-sq-value or greater in a random sample of N emails given that null hypothesis is true. There are 2 possibilities.

p-value is too low (say less than 0.001), it means null hypothesis should be rejected, that is, X and Y are dependent.

Or else, null hypothesis can be considered true, that is, X and Y are independent.

No comments:

Post a Comment