Exercise 1.16 in STAT2 book:
Load data:
setwd("/Users/traves/Dropbox/sm339/day11")
USS <- read.csv("USStamps.csv")
summary(USS)
## Year Price
## Min. :1885 Min. : 2.0
## 1st Qu.:1968 1st Qu.: 6.0
## Median :1981 Median :20.0
## Mean :1976 Mean :21.3
## 3rd Qu.:2001 3rd Qu.:34.0
## Max. :2012 Max. :45.0
attach(USS)
plot(Price ~ Year, main = "Price of US Stamps versus Year", ylab = "Price in cents")
a. The price stays almost constant for about 60 years but starts increasing in about 1958. Once it starts increasing the trend looks roughly linear.
b.
USS # Note: linearly ordered by Year
## Year Price
## 1 1885 2
## 2 1917 3
## 3 1919 2
## 4 1932 3
## 5 1958 4
## 6 1963 5
## 7 1968 6
## 8 1971 8
## 9 1974 10
## 10 1975 13
## 11 1978 15
## 12 1981 18
## 13 1981 20
## 14 1985 22
## 15 1988 25
## 16 1991 29
## 17 1995 32
## 18 1999 33
## 19 2001 34
## 20 2002 37
## 21 2006 39
## 22 2007 41
## 23 2008 42
## 24 2009 44
## 25 2012 45
USSM = USS[-seq(1, 4), ]
line = lm(USSM$Price ~ USSM$Year)
summary(line)
##
## Call:
## lm(formula = USSM$Price ~ USSM$Year)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.923 -0.948 0.119 1.190 4.532
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.65e+03 4.69e+01 -35.1 <2e-16 ***
## USSM$Year 8.41e-01 2.36e-02 35.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.74 on 19 degrees of freedom
## Multiple R-squared: 0.985, Adjusted R-squared: 0.985
## F-statistic: 1.27e+03 on 1 and 19 DF, p-value: <2e-16
The regression line is Price = -0.002 + 0.841*Year.
c. Plot the regression line for 1958 onward:
plot(Price ~ Year, data = USSM, main = "Price of US Stamps versus Year", ylab = "Price in cents")
abline(line, col = "blue", lwd = 2)
The regression line seems like an excellent fit to the data after year 1958.
d. Examine residuals:
require(lattice)
## Loading required package: lattice
require(latticeExtra)
## Loading required package: latticeExtra
## Loading required package: RColorBrewer
densityplot(line$residuals)
qqnorm(line$residuals)
qqline(line$residuals)
plot(line$residuals ~ line$fitted)
abline(0, 0, col = "blue", lwd = 2)
The residuals look roughly normally distributed (though the single large residual is making the top tail quite heavy) and they seem to exhibit constant variance. It seems as if the conditiosn for the regrssion model are met, at least from the plots. On the other hand, it may be that US Stamp price increases are not independent form year to year. This cannot be determined from the plots of the residuals, and given the politics in congress, it seems like an assumption that may not hold.
e. Identifying outliers:
boxplot(rstudent(line))
which(rstudent(line) > 3)
## 1
## 1
USSM[1, ]
## Year Price
## 5 1958 4
It seems as if the 1958 observation is an outlier. This is the first year that prices started to rise regularly and it makes sense that this year might be unusual as it is a boundardy point.
Exercise 1.17 in STAT2 book:
a. Examine trends over time:
setwd("/Users/traves/Dropbox/sm339/day11")
ME <- read.csv("MathEnrollment.csv")
summary(ME)
## Ayear Fall Spring
## Min. :2001 Min. :248 Min. :206
## 1st Qu.:2004 1st Qu.:266 1st Qu.:238
## Median :2006 Median :286 Median :254
## Mean :2006 Mean :286 Mean :258
## 3rd Qu.:2008 3rd Qu.:302 3rd Qu.:286
## Max. :2011 Max. :343 Max. :308
attach(ME)
def.par <- par(no.readonly = TRUE) # set def.par to current layout parameters
layout(matrix(c(1, 2), 2, 1, byrow = TRUE)) # plot graphs in two rows
plot(Fall ~ Ayear, col = "red", pch = 19, main = "Math enrollment in Fall over time",
xlab = "Academic year", ylab = "Fall enrollment")
plot(Spring ~ Ayear, col = "red", pch = 19, main = "Math enrollment in Spring over time",
xlab = "Academic year", ylab = "Spring enrollment")
par(def.par) # reset graphing to 1 plot at a time
It is hard to tell but it does not seem that the enrollment patterns are the same over time. In particular, the trend in Fall enrollments from 2004-07 is down but the trend in Spring enrollments over the same period is up.
b. Spring vs. Fall:
plot(Spring ~ Fall, col = "red", pch = 19, main = "Math enrollment in Spring versus Fall enrollments",
xlab = "Fall Enrollment", ylab = "Spring enrollment")
I do not agree with the faculty member's claim: the plot of Spring vs Fall enrollments does not seem to indicate that the Fall enrollment is a good predictor of Spring enrollment.
c. Regression line:
SF = lm(Spring ~ Fall)
summary(SF) # Note that the slope cannot be reliably distiguished from 0 (p-value = 0.40)
##
## Call:
## lm(formula = Spring ~ Fall)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46.74 -24.05 1.91 20.67 48.98
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 351.059 106.471 3.30 0.0093 **
## Fall -0.327 0.371 -0.88 0.4019
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.1 on 9 degrees of freedom
## Multiple R-squared: 0.0792, Adjusted R-squared: -0.0232
## F-statistic: 0.774 on 1 and 9 DF, p-value: 0.402
plot(Spring ~ Fall, col = "red", pch = 19, main = "Math enrollment in Spring versus Fall enrollments",
xlab = "Fall Enrollment", ylab = "Spring enrollment")
abline(SF, col = "blue", lwd = 2)
densityplot(rstandard(SF))
plot(rstandard(SF) ~ SF$fitted, col = "red", pch = 19, main = "Standardized Residuals vs Fitted in Linear Model")
which(rstandard(SF) > 2)
## 3
## 3
ME[3, ]
## Ayear Fall Spring
## 3 2003 343 288
There does seem to be an outlier in the data from Academic year 2003.
d. Checking influence:
MME = ME[-c(3), ]
SFM = lm(Spring ~ Fall, data = MME)
summary(SFM)
##
## Call:
## lm(formula = Spring ~ Fall, data = MME)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.50 -17.35 -6.06 22.71 29.42
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 548.009 106.724 5.13 0.00089 ***
## Fall -1.048 0.381 -2.75 0.02487 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.9 on 8 degrees of freedom
## Multiple R-squared: 0.487, Adjusted R-squared: 0.423
## F-statistic: 7.59 on 1 and 8 DF, p-value: 0.0249
Note that after removing 2003 data the model changes and the slope coefficient is now clearly different from 0 (p-value = 0.025).
We plot both regression lines:
plot(Spring ~ Fall, col = "red", pch = 19, main = "Math enrollment in Spring versus Fall enrollments",
xlab = "Fall Enrollment", ylab = "Spring enrollment")
abline(SF, col = "blue", lwd = 2)
abline(SFM, col = "green", lwd = 2)
plot(rstudent(SF) ~ Ayear)
rstudent(SF)[3]
## 3
## 2.8
The 2003 data point has Studentized residual of 2.8 which is quite high. It seems like an influential point, especially in light of the effect on the regression line (e.g. the slope becoming significant).