## Saturday, February 26, 2011

### OCW-18.443: problem-set#3

My solutions to 3rd problem set in MIT-Statistics for Applications course.

Most of the solutions are done using R(version 2.12.1).

Note: In this problem set, if you don't understand the meaning of y-on-x, x-on-y and bfsd regression, then you should read the handouts given for this assignment.

1(a)
y - distance
x - year
years are constant and also not measured using some instrument(and hence no chances of error), so clearly year is not a random variable. Hence the most suitable regression here is y-on-x.

1(b) Here is the R-terminal interaction.
>
> year = c(1952,1956,1960,1964,1968,1972,1976,1980,1984)
> distance = c(758.3,784.4,813.6,808.5,891.9,825.7,835.9,855.6,855.6)
>
> plot(year,distance) # draw distance(y-axis) vs year(x-axis) plot
>

>
> #clearly (1968,891.9) is an outlier, otherwise it looks close to linear
>
> linmodel = lm(distance ~ year) #create the linear model
> residuals = resid(linmodel); residuals #find and print the residuals
1 2 3 4 5 6 7
-22.893333 -7.870000 10.253333 -5.923333 66.400000 -10.876667 -11.753333
8 9
-3.130000 -14.206667
>
> plot(year, residuals) #plot residuals vs year
>

1(c) I notice that 7 out of 9 residuals are negative. And, actually there seems to be a pattern.

2(a)
y - power output
x - wind speed

If we do y-on-x regression, prediction for power output at sufficiently small speeds comes negative which is not possible. Also, no one will be interested in predicting wind speed from power output(as naturally power is being produced by wind and not viceversa) :).

2(b)
First, let us find the correlation coefficients for all 10 values using R. Here it is
>
> output = c(0.123,0.558,1.194,1.582,1.822,1.800,2.166,2.303,2.386,2.236)
>
> speed = c(2.45,3.05,4.10,5.00,6.00,7.00,8.15,9.10,9.70,10.00)
>
> cor(output,speed)
[1] 0.9419852
>

Now, let us find the correlation for 7 values where windspeeds vary from 3 to 9.1 mph.
>
> cor(output[2:8],speed[2:8])
[1] 0.951478
>

Now, let us do the y-on-x regression for the 7 values.
>
> speed = speed[2:8]
> output = output[2:8]
>
> linmodel = lm(output ~ speed) #build the model
> coeffs = coefficients(linmodel)
>
> #Plot the output vs speed along with fitted line
> plot(speed,output,xlab="Wind Speed",ylab="Power Output")
> abline(coeffs[1],coeffs[2])
>

3.
Again, I solved it using R. Here is R-terminal interaction.

> height = c(5000,10000,15000,20000,25000,30000,35000,40000,45000,50000,
55000,60000,70000,80000,90000,100000)
>
> pressure = c(632.50,522.70,429.00,349.50,282.40,226.10,179.30,141.20,
111.10,87.80,68.90,54.20,33.70,21.00,13.20,8.36)
>
> plot(height,pressure)

> #clearly it does not look linear so we can expect some pattern in residuals
>
>
> linmodel = lm(pressure ~ height)
>
> coeffs = coefficients(linmodel)
> coeffs[1] #Intercept
(Intercept)
471.6105
> coeffs[2] #slope
height
-0.006006586
>
> cor(height,pressure) #correlation
[1] -0.885868
>
> residuals = resid(linmodel)
>
> residuals
1 2 3 4 5 6 7
190.922430 111.155362 47.488294 -1.978775 -39.045843 -65.312911 -82.079980
8 9 10 11 12 13 14
-90.147048 -90.214116 -83.481185 -72.348253 -57.015322 -17.449458 29.916405
15 16
82.182268 137.408132
>
> plot(height,residuals)

Clearly residuals vs height plot shows a pattern and as expected data does not fit linear model.

4.
We are using height, pressure from problem#3
>
> squared_height = height * height
> squared_height
[1] 2.500e+07 1.000e+08 2.250e+08 4.000e+08 6.250e+08 9.000e+08 1.225e+09
[8] 1.600e+09 2.025e+09 2.500e+09 3.025e+09 3.600e+09 4.900e+09 6.400e+09
[15] 8.100e+09 1.000e+10
>
> plot(squared_height,pressure)

> #clearly this does not look linear either, so we can expect pattern in residuals
>
> linmodel = lm(pressure ~ squared_height)
>
> coeffs = coefficients(linmodel)
> coeffs[1] #Intercept
(Intercept)
332.7193
> coeffs[2] #slope
squared_height
-4.737236e-08
>
> cor(squared_height,pressure) #correlation
[1] -0.7394696
>
> residuals = resid(linmodel)
>
> residuals
1 2 3 4 5 6
300.965031 194.717958 106.939504 35.729667 -20.711551 -63.984150
7 8 9 10 11 12
-95.388132 -115.723495 -125.690240 -126.488366 -120.517875 -107.978765
13 14 15 16
-66.894691 -8.536143 64.196877 149.364370
>
> plot(squared_height,residuals)

Clearly residuals vs height plot shows a pattern.

5.
Again, the R-terminal interaction is here...
> ln_pressure = log(pressure)
> ln_pressure
[1] 6.449680 6.259008 6.061457 5.856504 5.643325 5.420977 5.189060 4.950177
[9] 4.710431 4.475062 4.232656 3.992681 3.517498 3.044522 2.580217 2.123458
>
> plot(height,ln_pressure)

> #clearly its linear, so there shouldn't be any pattern in residuals
>
> linmodel = lm(ln_pressure ~ height)
>
> coeffs = coefficients(linmodel)
> coeffs[1] #Intercept
(Intercept)
6.76613
> coeffs[2] #slope
height
-4.623476e-05
>
> cor(ln_pressure,height) #correlation
[1] -0.9996577
>
> residuals = resid(linmodel)
> residuals
1 2 3 4 5
-0.0852764784 -0.0447752103 -0.0111521741 0.0150682717 0.0330630186
6 7 8 9 10
0.0418896953 0.0411464997 0.0334372471 0.0248644217 0.0206690287
11 12 13 14 15
0.0094375091 0.0006360425 -0.0121994225 -0.0228272162 -0.0247852183
16
-0.0191960148
>
> plot(height,residuals)

Though plot shows some pattern but residual absolute values are tiny compared to those in problem#3 and problem#4. However to be sure that our model is correct. Let us plot the data points and fitted line together and see that it fits well
>
> plot(height,ln_pressure)
> abline(6.76613,-0.00004623476)