Day 20 Homework Solutions

1. In the movie we seem to be fitting a regression line to part of the data, data in a moving window. As the window moves to the right, the regression line changes and a curve is drawn so that the curve is tangent to the moving regression lines.

3.31 (a)

setwd("/Users/traves/Dropbox/SM339/day20")
health = read.csv("MetroHealth83.csv")
attach(health)
health3 = data.frame(SqrtMDs, NumHospitals, NumBeds)
cor(health3)
##              SqrtMDs NumHospitals NumBeds
## SqrtMDs       1.0000       0.9041  0.9461
## NumHospitals  0.9041       1.0000  0.9424
## NumBeds       0.9461       0.9424  1.0000

SqrtMDs and NumBeds have the highest correlation. NumBeds would be the best predictor of SqrtMDs.

(b) The variability in SqrtMDs explained by the variables alone is the square of the correlation coefficient. So NumHospitals explains almost 82% of the variation in SqrtMDs and NumBeds exlains about 89.5% of the variation in SqrtMDs.

cor(SqrtMDs, NumHospitals)^2
## [1] 0.8173
cor(SqrtMDs, NumBeds)^2
## [1] 0.895

(c) Together the two variables explain 89.6% of the variability in SqrtMDs.

together = lm(SqrtMDs ~ NumHospitals + NumBeds)
summary(together)
## 
## Call:
## lm(formula = SqrtMDs ~ NumHospitals + NumBeds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16.04  -4.58  -1.38   3.89  17.41 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.86649    1.11774   13.30  < 2e-16 ***
## NumHospitals  0.35894    0.34692    1.03      0.3    
## NumBeds       0.01157    0.00148    7.82  1.8e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.71 on 80 degrees of freedom
## Multiple R-squared:  0.896,  Adjusted R-squared:  0.894 
## F-statistic:  346 on 2 and 80 DF,  p-value: <2e-16

(d) Based on the correlations, each of the variables NumBeds and NumHospitals have significant relationships with SqrtMDs.

(e) Looking at the summary, we see that NumBeds is very important in the model (ceofficient is nonzero even in the presence of the other variable; p = 1.84e-11), whereas the coefficient of NumHospitals cannot be distinguished from zero (p = 0.304) in the presence of the other variable.

(f) The answers to (d) and (e) look inconsistent but in fact they are not. The explanation lies in the fact that NumHospitals and NumBeds are highly correlated (r = 0.942). Once NumBeds is added to the model, NumHospitals isn't that important because most of its effect is already taken care of by the NumBeds variable. Apparently the same can't be said in the other direction: NumBeds is still significantly related to the residuals of the model using NumHospitals alone.

3.35 The scatterplot suggests that a quadratic model might be appropriate.

setwd("/Users/traves/Dropbox/SM339/day20")
co2 = read.csv("CO2.csv")
attach(co2)
## The following object is masked from package:datasets:
## 
##     CO2
plot(CO2 ~ Day)

plot of chunk unnamed-chunk-4

Day2 = Day^2
model = lm(CO2 ~ 1 + Day + Day2)
summary(model)
## 
## Call:
## lm(formula = CO2 ~ 1 + Day + Day2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.248  -3.080  -0.252   2.843  20.153 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.15e+02   2.86e+00   145.3   <2e-16 ***
## Day         -4.76e-01   2.87e-02   -16.6   <2e-16 ***
## Day2         1.16e-03   6.68e-05    17.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.62 on 234 degrees of freedom
## Multiple R-squared:  0.573,  Adjusted R-squared:  0.57 
## F-statistic:  157 on 2 and 234 DF,  p-value: <2e-16

The fitted model is CO2 = 415 - 0.476Day + 0.00158Day\( ^2 \). Let's find where CO2 is minimum (according to the model) by taking the derivative and setting it to zero: -0.476 + 0.00316Day = 0, or Day = 150.63. The model predicts that the minimum CO2 levels ought to occur around Day 150 to Day 151.

Let's check the residuals from our model:

plot(rstandard(model) ~ model$fitted)
abline(0, 0, col = "blue")
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
## Loading required package: lattice

plot of chunk unnamed-chunk-5

densityplot(rstandard(model))

plot of chunk unnamed-chunk-5

co2[which(rstandard(model) > 4), ]
##      CO2 Day
## 44 391.8 136

The constant variance assumption seems to hold but there is one outlier (occuring on Day 136) that ought to be investigated further.