Sunday, June 9, 2013

Chi-Square test of independence

Chi-sq test is used to determine whether two discrete random variables X and Y are independent or not.

I will do it here for binary random variables, but same calculations can be used to figure out independence of general discrete random variables.

Let us define our random variables with an example case for better intuition.
Assume, we have N emails, some of which are spam and some are non-spam. Some emails contain the word "millionaire" and some don't. Our task is to find out whether occurrence of word "millionaire" in an email and that email being a spam are independent of not.
So here is how, we define our random variables.

X = 1 when email contains word "millionaire"
X = 0 when email does not contain word "millionaire"

Y = 1 when email is spam
Y = 0 when email is non-spam

Our task stated earlier reduces to determining whether X and Y are independent or not.

Let us look at the contingency table:

Y=1Y=0
X=1N(1,1)N(1,0)
X=0N(0,1)N(0,0)

Let us define some more notation..
N(x,y) = # of emails for which X = x and Y = y in the given N emails
for example N(1,1) is the number of emails that contain the word "millionaire" *and* are spam.

Also, E(x,y) = *expected* # of emails for which X = x and Y = y
clearly, E(x,y) = P(X=x,Y=y).N

Let say, our hypothesis is that X and Y are indeed independent. Formally, in hypothesis testing lingo, we say null hypothesis, H0: X and Y are independent.
So, assuming that null hypothesis is true

E(x,y) = P(X=x,Y=y).N = P(X=x).P(Y=y).N

estimate P(X=x) = [N(X=x,Y=1) + N(X=x,Y=0)]/N

P(Y=y) = [N(X=1,Y=y) + N(X=0,Y=y)]/N

So E(x,y) =  ([N(X=x,Y=1) + N(X=x,Y=0)] * [N(X=1,Y=y) + N(X=0,Y=y)])/N

Now the magic formula, chi-sq-value =



Magical thing about above formula is that it follows Chi-sq distribution with 1-degrees of freedom [In general if X could take I different values and Y could take J different values then it would be Chi-sq distribution with (I-1).(J-1) degrees of freedom]. I am not proving this fact in this post, and hence called it magical.
However, magic aside, intuitively speaking, it is a measure of discrepancy between expected values in the contingency table and actually observed values.

Now, using chi-sq distribution table, we compute the p-value which is probability of observing given chi-sq-value or greater in a random sample of N emails given that null hypothesis is true. There are 2 possibilities.

p-value is too low (say less than 0.001), it means null hypothesis should be rejected, that is, X and Y are dependent.

Or else, null hypothesis can be considered true, that is, X and Y are independent.

Essential information theory [For ML]

These are mostly my notes from reading of information theory section from Machine Learning : A Probabilistic Perspective.

Entropy: Entropy is the measurement of "uncertainity" in a random variable X. And, is given by


To gain some intuition, let us consider the Bernoulli random variable.
P(X = 1) = p
P(X = 0) = 1-p
H(X) = - P(1)Log(P(1)) - P(0)Log(P(0))
H(X) = - pLog(p) - (1-p)Log(1-p)  [also called binary entropy function]

Here is what it's graph looks like(taken from book)..


It is clear that entropy aka uncertainty is maximum when p = 0.5 (or the distribution is uniform in general), which is in agreement with intuition that you are in no position to guess an event when the probability of it happening is exactly same as each other event in consideration.

Also, negative of entropy is called "Information" which is in some sense an opposite of "uncertainty".

KL Divergence: It is a measure of divergence or "dissimilarity" between two probability distributions, say p and q. And, is given by

It can be proved that KL is always greater than or equal to 0. And, KL = 0 iff p = q.
Clearly, higher the KL divergence implies higher dissimilarity between two probability distributions.

Mutual Information(MI): Mutual information is a measure of degree of "correlation" between two random variables X and Y. It is obtained by finding the dissimilarity between P(X,Y) and P(X)P(Y), which is KL divergence.
So, MI, I(X,Y) = KL(P(X,Y) || P(X)P(Y))
  

I(X,Y) = 0 means X and Y are essentially the same and higher MI means higher dissimilarity and hence higher correlation.

It can be easily derived that
I(X,Y) = H(X) - H(X|Y) = H(Y) - H(Y|X)
where H(X|Y) is "conditional entropy" of X given Y, and is given by

H(X|Y) can be interpreted as uncertainty in X after observing Y. So
I(X,Y) = H(X) - H(X|Y) gives us another interpretation of MI, that is, reduction in uncertainty of X after observing Y.

Another question is that why Pearson's correlation coefficient is not a good enough measurement of correlation. It turns out that correlation coefficient fails terribly on capturing many kind of relationships as is evident in following figure (taken from wikipedia)


Fig: Several sets of (x, y) points, with the correlation coefficient of x and y for each set. Note that the correlation reflects the non-linearity and direction of a linear relationship (top row), but not the slope of that relationship (middle), nor many aspects of nonlinear relationships (bottom).