Given sample data of two random variables X and Y, When is the correlation analysis between then valid?
1. X and Y, both, have normal distribution because to do correlation analysis you need data points from all the spectrum of values possible. For example if all your X values in the sample are close to a single value let say 0(whereas possible values of X are from let say -500 to +500) then no matter how big the sample is, you are not going to get any meaningful correlation analysis results.
How to see: Just draw histogram of the sample and see if it is close to bell curve. Also, see summary statistics for the samples.
2. X and Y have linear relationship.
How to see: Examine the scatterplot of Y vs X, Plot the histogram of residuals(prediction error from linear model)
3. Homoskedasticity
How to see: Plot residuals vs X values. there should be no relationship, it should be random
Reliability of the correlation coefficient?
One approach is to use null hypothesis significance testing. For example H0 is that there is no correlation. Find the p-value and accept/reject null hypothesis. You can also look at the confidence interval for correlation coefficient.
Notes on Sampling
In most of the statistical experiments, it is not possible to gather data from whole population and you gather data from a "random representative sample" which is "large enough" to give decent estimation of parameters.
So, there will be a difference between the actual population parameter(say mean) and same measured from the sample. The difference is called the sample error. Now, mostly, we don't know the actual population parameter value(in this case the true mean in the population).. so how do we estimate what our sample error is(btw estimate of sample error is called standard error, SE)?
SE of estimated mean = SD/sqrt(N)
where SD is standard deviation of the sample.
But, how is above formula obtained?
let say we take many many samples from the population and find the mean value from each of the samples, then these estimated mean values will be possibly different for different samples. Essentially, estimated means coming from different samples have a probability distribution(you can get some idea of this distribution by plotting the histogram, called probability histogram, of all the different sample means), which is called distribution of the sample mean and mean of the distribution of sample means is equal to true mean of the population. Also, SE is the standard deviation of probability histogram.
By "central limit theorem", the fact is that distribution of the sample mean has to be normal as long as sample sizes are large enough(N > 30) or if the random variable in question is normally distributed.
Notes on Hypothesis testing(taking sample mean as an example):
p-value = P(sample-data-observed | H0 is true)
which in simple terms means, the probability of observing given sample or more extreme than that given the null hypothesis is true. Hecne, if p-value is low(e.g. less than a threshold) then we can reject the null hypothesis.
Usually, the approximate probability distribution of observed-sample-data given H0 is known(sample distribution). And, using that probability distribution we calculate the p-value.
For example, sample distribution of the mean has normal distribution. That says..
(X_bar - mu)/(sigma/sqrt(n)) ~ standard normal distribution(Z)
where
X_bar : sample mean
mu : true population mean
sigma : true population standard deviation
n : sample size
Let say H0 is that true population mean(mu) is 0 then
z-score = (X_bar - 0)/(sigma/sqrt(n))
which is basically how many standard deviation away are you from the true mean. If absolute value of z-score is too much then that means assumed H0 is not true and should be rejected.
Now, let us try to understand how z-score(in general the test-statistic) and p-value are related. To, understand that let us see the standard normal distribution.
It is clear from the picture above that if H0 is true then with 95.4% chance we should get a absolute z-score less than 2(in fact [-2*(sigma/sqrt(n),+2*(sigma/sqrt(n)] is called 95.4% confidence interval for the true population mean).
And with 4.6%(= 100 - 95.4) we observe a sample such that z-score is not in [-2,+2]. So, the probability of observing this sample data or more extreme is 0.046 and that is our p-value.
Since we are talking about sample mean, so one more thing worth mentioning is that, usually sigma is not known and we approximate that by square root of MSS(mean sum of square deviation). Sample mean distribution will still be approximately normal as long as sample size is large otherwise the sample mean distribution will be more close to t distribution with n-1 degrees of freedom and we will use t-score instead of z-score and p-value/confidence-interval etc will be calculated from t-distribution.
1. X and Y, both, have normal distribution because to do correlation analysis you need data points from all the spectrum of values possible. For example if all your X values in the sample are close to a single value let say 0(whereas possible values of X are from let say -500 to +500) then no matter how big the sample is, you are not going to get any meaningful correlation analysis results.
How to see: Just draw histogram of the sample and see if it is close to bell curve. Also, see summary statistics for the samples.
2. X and Y have linear relationship.
How to see: Examine the scatterplot of Y vs X, Plot the histogram of residuals(prediction error from linear model)
3. Homoskedasticity
How to see: Plot residuals vs X values. there should be no relationship, it should be random
Reliability of the correlation coefficient?
One approach is to use null hypothesis significance testing. For example H0 is that there is no correlation. Find the p-value and accept/reject null hypothesis. You can also look at the confidence interval for correlation coefficient.
Notes on Sampling
In most of the statistical experiments, it is not possible to gather data from whole population and you gather data from a "random representative sample" which is "large enough" to give decent estimation of parameters.
So, there will be a difference between the actual population parameter(say mean) and same measured from the sample. The difference is called the sample error. Now, mostly, we don't know the actual population parameter value(in this case the true mean in the population).. so how do we estimate what our sample error is(btw estimate of sample error is called standard error, SE)?
SE of estimated mean = SD/sqrt(N)
where SD is standard deviation of the sample.
But, how is above formula obtained?
let say we take many many samples from the population and find the mean value from each of the samples, then these estimated mean values will be possibly different for different samples. Essentially, estimated means coming from different samples have a probability distribution(you can get some idea of this distribution by plotting the histogram, called probability histogram, of all the different sample means), which is called distribution of the sample mean and mean of the distribution of sample means is equal to true mean of the population. Also, SE is the standard deviation of probability histogram.
By "central limit theorem", the fact is that distribution of the sample mean has to be normal as long as sample sizes are large enough(N > 30) or if the random variable in question is normally distributed.
Notes on Hypothesis testing(taking sample mean as an example):
p-value = P(sample-data-observed | H0 is true)
which in simple terms means, the probability of observing given sample or more extreme than that given the null hypothesis is true. Hecne, if p-value is low(e.g. less than a threshold) then we can reject the null hypothesis.
Usually, the approximate probability distribution of observed-sample-data given H0 is known(sample distribution). And, using that probability distribution we calculate the p-value.
For example, sample distribution of the mean has normal distribution. That says..
(X_bar - mu)/(sigma/sqrt(n)) ~ standard normal distribution(Z)
where
X_bar : sample mean
mu : true population mean
sigma : true population standard deviation
n : sample size
Let say H0 is that true population mean(mu) is 0 then
z-score = (X_bar - 0)/(sigma/sqrt(n))
which is basically how many standard deviation away are you from the true mean. If absolute value of z-score is too much then that means assumed H0 is not true and should be rejected.
Now, let us try to understand how z-score(in general the test-statistic) and p-value are related. To, understand that let us see the standard normal distribution.
It is clear from the picture above that if H0 is true then with 95.4% chance we should get a absolute z-score less than 2(in fact [-2*(sigma/sqrt(n),+2*(sigma/sqrt(n)] is called 95.4% confidence interval for the true population mean).
And with 4.6%(= 100 - 95.4) we observe a sample such that z-score is not in [-2,+2]. So, the probability of observing this sample data or more extreme is 0.046 and that is our p-value.
Since we are talking about sample mean, so one more thing worth mentioning is that, usually sigma is not known and we approximate that by square root of MSS(mean sum of square deviation). Sample mean distribution will still be approximately normal as long as sample size is large otherwise the sample mean distribution will be more close to t distribution with n-1 degrees of freedom and we will use t-score instead of z-score and p-value/confidence-interval etc will be calculated from t-distribution.