Solutions to Day 11 Homework

Exercise 1.16 in STAT2 book:

Load data:

setwd("/Users/traves/Dropbox/sm339/day11")
USS <- read.csv("USStamps.csv")
summary(USS)
##       Year          Price     
##  Min.   :1885   Min.   : 2.0  
##  1st Qu.:1968   1st Qu.: 6.0  
##  Median :1981   Median :20.0  
##  Mean   :1976   Mean   :21.3  
##  3rd Qu.:2001   3rd Qu.:34.0  
##  Max.   :2012   Max.   :45.0
attach(USS)
plot(Price ~ Year, main = "Price of US Stamps versus Year", ylab = "Price in cents")

plot of chunk unnamed-chunk-1

a. The price stays almost constant for about 60 years but starts increasing in about 1958. Once it starts increasing the trend looks roughly linear.

b.

USS  # Note: linearly ordered by Year
##    Year Price
## 1  1885     2
## 2  1917     3
## 3  1919     2
## 4  1932     3
## 5  1958     4
## 6  1963     5
## 7  1968     6
## 8  1971     8
## 9  1974    10
## 10 1975    13
## 11 1978    15
## 12 1981    18
## 13 1981    20
## 14 1985    22
## 15 1988    25
## 16 1991    29
## 17 1995    32
## 18 1999    33
## 19 2001    34
## 20 2002    37
## 21 2006    39
## 22 2007    41
## 23 2008    42
## 24 2009    44
## 25 2012    45
USSM = USS[-seq(1, 4), ]
line = lm(USSM$Price ~ USSM$Year)
summary(line)
## 
## Call:
## lm(formula = USSM$Price ~ USSM$Year)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.923 -0.948  0.119  1.190  4.532 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.65e+03   4.69e+01   -35.1   <2e-16 ***
## USSM$Year    8.41e-01   2.36e-02    35.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 1.74 on 19 degrees of freedom
## Multiple R-squared: 0.985,   Adjusted R-squared: 0.985 
## F-statistic: 1.27e+03 on 1 and 19 DF,  p-value: <2e-16

The regression line is Price = -0.002 + 0.841*Year.

c. Plot the regression line for 1958 onward:

plot(Price ~ Year, data = USSM, main = "Price of US Stamps versus Year", ylab = "Price in cents")
abline(line, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-3

The regression line seems like an excellent fit to the data after year 1958.

d. Examine residuals:

require(lattice)
## Loading required package: lattice
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
densityplot(line$residuals)

plot of chunk unnamed-chunk-4

qqnorm(line$residuals)
qqline(line$residuals)

plot of chunk unnamed-chunk-4

plot(line$residuals ~ line$fitted)
abline(0, 0, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-4

The residuals look roughly normally distributed (though the single large residual is making the top tail quite heavy) and they seem to exhibit constant variance. It seems as if the conditiosn for the regrssion model are met, at least from the plots. On the other hand, it may be that US Stamp price increases are not independent form year to year. This cannot be determined from the plots of the residuals, and given the politics in congress, it seems like an assumption that may not hold.

e. Identifying outliers:

boxplot(rstudent(line))

plot of chunk unnamed-chunk-5

which(rstudent(line) > 3)
## 1 
## 1
USSM[1, ]
##   Year Price
## 5 1958     4

It seems as if the 1958 observation is an outlier. This is the first year that prices started to rise regularly and it makes sense that this year might be unusual as it is a boundardy point.

Exercise 1.17 in STAT2 book:

a. Examine trends over time:

setwd("/Users/traves/Dropbox/sm339/day11")
ME <- read.csv("MathEnrollment.csv")
summary(ME)
##      Ayear           Fall         Spring   
##  Min.   :2001   Min.   :248   Min.   :206  
##  1st Qu.:2004   1st Qu.:266   1st Qu.:238  
##  Median :2006   Median :286   Median :254  
##  Mean   :2006   Mean   :286   Mean   :258  
##  3rd Qu.:2008   3rd Qu.:302   3rd Qu.:286  
##  Max.   :2011   Max.   :343   Max.   :308
attach(ME)
def.par <- par(no.readonly = TRUE)  # set def.par to current layout parameters
layout(matrix(c(1, 2), 2, 1, byrow = TRUE))  # plot graphs in two rows
plot(Fall ~ Ayear, col = "red", pch = 19, main = "Math enrollment in Fall over time", 
    xlab = "Academic year", ylab = "Fall enrollment")
plot(Spring ~ Ayear, col = "red", pch = 19, main = "Math enrollment in Spring over time", 
    xlab = "Academic year", ylab = "Spring enrollment")

plot of chunk unnamed-chunk-6

par(def.par)  # reset graphing to 1 plot at a time

It is hard to tell but it does not seem that the enrollment patterns are the same over time. In particular, the trend in Fall enrollments from 2004-07 is down but the trend in Spring enrollments over the same period is up.

b. Spring vs. Fall:

plot(Spring ~ Fall, col = "red", pch = 19, main = "Math enrollment in Spring versus Fall enrollments", 
    xlab = "Fall Enrollment", ylab = "Spring enrollment")

plot of chunk unnamed-chunk-7

I do not agree with the faculty member's claim: the plot of Spring vs Fall enrollments does not seem to indicate that the Fall enrollment is a good predictor of Spring enrollment.

c. Regression line:

SF = lm(Spring ~ Fall)
summary(SF)  # Note that the slope cannot be reliably distiguished from 0 (p-value = 0.40)
## 
## Call:
## lm(formula = Spring ~ Fall)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46.74 -24.05   1.91  20.67  48.98 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  351.059    106.471    3.30   0.0093 **
## Fall          -0.327      0.371   -0.88   0.4019   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 33.1 on 9 degrees of freedom
## Multiple R-squared: 0.0792,  Adjusted R-squared: -0.0232 
## F-statistic: 0.774 on 1 and 9 DF,  p-value: 0.402
plot(Spring ~ Fall, col = "red", pch = 19, main = "Math enrollment in Spring versus Fall enrollments", 
    xlab = "Fall Enrollment", ylab = "Spring enrollment")
abline(SF, col = "blue", lwd = 2)

plot of chunk unnamed-chunk-8

densityplot(rstandard(SF))

plot of chunk unnamed-chunk-8

plot(rstandard(SF) ~ SF$fitted, col = "red", pch = 19, main = "Standardized Residuals vs Fitted in Linear Model")

plot of chunk unnamed-chunk-8

which(rstandard(SF) > 2)
## 3 
## 3
ME[3, ]
##   Ayear Fall Spring
## 3  2003  343    288

There does seem to be an outlier in the data from Academic year 2003.

d. Checking influence:

MME = ME[-c(3), ]
SFM = lm(Spring ~ Fall, data = MME)
summary(SFM)
## 
## Call:
## lm(formula = Spring ~ Fall, data = MME)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -30.50 -17.35  -6.06  22.71  29.42 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  548.009    106.724    5.13  0.00089 ***
## Fall          -1.048      0.381   -2.75  0.02487 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 24.9 on 8 degrees of freedom
## Multiple R-squared: 0.487,   Adjusted R-squared: 0.423 
## F-statistic: 7.59 on 1 and 8 DF,  p-value: 0.0249

Note that after removing 2003 data the model changes and the slope coefficient is now clearly different from 0 (p-value = 0.025).

We plot both regression lines:

plot(Spring ~ Fall, col = "red", pch = 19, main = "Math enrollment in Spring versus Fall enrollments", 
    xlab = "Fall Enrollment", ylab = "Spring enrollment")
abline(SF, col = "blue", lwd = 2)
abline(SFM, col = "green", lwd = 2)

plot of chunk unnamed-chunk-10

plot(rstudent(SF) ~ Ayear)

plot of chunk unnamed-chunk-10

rstudent(SF)[3]
##   3 
## 2.8

The 2003 data point has Studentized residual of 2.8 which is quite high. It seems like an influential point, especially in light of the effect on the regression line (e.g. the slope becoming significant).