Key Concepts overvied in the lecture include:
1. Pearson's Correlation Coefficient
2. Ranked correlation: Mann-Whitney and Spearman's Ranked Correlation Coefficients
3. Assumptions of linear regression
4. Residuals and Least Squares
5. Coefficient of Determination (R2)<br />
6. Significant Testing for Linear Regressions
7. Data Transformations
The inclass exercise is about using general linear models.
Part 1: Which weather station (or sets of weather stations) is the best predictor of runoff?
# install.packages('HH')#only need to do this once per computer...
library(HH) #activate the packages to do Tukey's Ladder of Powers.
## Warning: package 'HH' was built under R version 2.15.3
## Loading required package: lattice
## Loading required package: grid
## Loading required package: multcomp
## Warning: package 'multcomp' was built under R version 2.15.3
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: splines
## Loading required package: leaps
## Loading required package: RColorBrewer
## Loading required package: latticeExtra
## Loading required package: reshape
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
##
## rename, round_any
## Loading required package: colorspace
library(leaps) #activate the package to use regsubset
rain <- read.csv("E:/Quant/inClassExercises/InClassExerciseData/LA_RAIN.csv")
names(rain)
## [1] "Year" "APMAM" "APSAB" "APSLAKE" "OPBPC" "OPRC" "OPSLAKE"
## [8] "BSAAM"
# Create some linear models to get at best prediction or best model fit
# (R^2)
# There are a few assumptions that we must test for all genearl linear
# models: 1. The plotted data is homoscedastic (variance of data does not
# depend on X and is consistent.) 2. The slope of the regression line is
# significantly different than 0. We can use a t-test on the model
# parameters. The null hypothesis is that the coefficient is 0. 3. The
# variation explained by the model is not due to random chance using an
# F-test. The null hypothesis is that the model is due to random chance.
# 4. the residuals are normally distributed around 0.
# Plot the data to check for homoscedasity.
plot(rain$BSAAM ~ rain$OPSLAKE)
plot(rain$BSAAM ~ rain$OPRC)
plot(rain$BSAAM ~ rain$OPBPC)
plot(rain$BSAAM ~ rain$APSLAKE) #This one doesn't look good...
plot(rain$BSAAM ~ rain$APSAB) #This one doesn't look good...
plot(rain$BSAAM ~ rain$APMAM) #This one doesn't look good...
# Create the models and perform diagnostics....
opslake <- lm(rain$BSAAM ~ rain$OPSLAKE) #Create a linear model.
summary(opslake) #R2 of .8778; the model parameter is significant
##
## Call:
## lm(formula = rain$BSAAM ~ rain$OPSLAKE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17604 -5338 332 3411 20876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27015 3219 8.39 1.9e-10 ***
## rain$OPSLAKE 3752 216 17.39 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8920 on 41 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.878
## F-statistic: 303 on 1 and 41 DF, p-value: <2e-16
shapiro.test(opslake$residuals) #residuals are good.
##
## Shapiro-Wilk normality test
##
## data: opslake$residuals
## W = 0.9677, p-value = 0.2639
anova(opslake) #model is significant.
## Analysis of Variance Table
##
## Response: rain$BSAAM
## Df Sum Sq Mean Sq F value Pr(>F)
## rain$OPSLAKE 1 2.41e+10 2.41e+10 303 <2e-16 ***
## Residuals 41 3.26e+09 7.96e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
oprc <- lm(rain$BSAAM ~ rain$OPRC)
summary(oprc) #R2 of .842; model parameter is significant
##
## Call:
## lm(formula = rain$BSAAM ~ rain$OPRC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741 4044 5.38 3.3e-06 ***
## rain$OPRC 4667 311 14.99 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10100 on 41 degrees of freedom
## Multiple R-squared: 0.846, Adjusted R-squared: 0.842
## F-statistic: 225 on 1 and 41 DF, p-value: <2e-16
shapiro.test(opslake$residuals) #residuals are good.
##
## Shapiro-Wilk normality test
##
## data: opslake$residuals
## W = 0.9677, p-value = 0.2639
anova(opslake) #model is significant.
## Analysis of Variance Table
##
## Response: rain$BSAAM
## Df Sum Sq Mean Sq F value Pr(>F)
## rain$OPSLAKE 1 2.41e+10 2.41e+10 303 <2e-16 ***
## Residuals 41 3.26e+09 7.96e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
opbpc <- lm(rain$BSAAM ~ rain$OPBPC)
summary(opbpc) #R2 of .7793; model parameter is significant
##
## Call:
## lm(formula = rain$BSAAM ~ rain$OPBPC)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21183 -7298 -819 4731 38430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40017 3589 11.2 5.5e-14 ***
## rain$OPBPC 2940 241 12.2 3.0e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12000 on 41 degrees of freedom
## Multiple R-squared: 0.785, Adjusted R-squared: 0.779
## F-statistic: 149 on 1 and 41 DF, p-value: 3e-15
shapiro.test(opslake$residuals) #residuals are good.
##
## Shapiro-Wilk normality test
##
## data: opslake$residuals
## W = 0.9677, p-value = 0.2639
anova(opslake) #model is significant.
## Analysis of Variance Table
##
## Response: rain$BSAAM
## Df Sum Sq Mean Sq F value Pr(>F)
## rain$OPSLAKE 1 2.41e+10 2.41e+10 303 <2e-16 ***
## Residuals 41 3.26e+09 7.96e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# These violate the condition that the data needs to be homoscedastic. We
# can try to transform the data using Tukey's Ladder of Power. Note, this
# takes a while....
ladder(BSAAM ~ APSLAKE, data = rain, main = "Ladder of Powers \n APSLAKE")
# Based on the ladder of powers, a good transformation would be to take
# the natural log of the data.
apslake <- lm(log(rain$BSAAM) ~ log(rain$APSLAKE))
summary(apslake) #R2 of 0.0283; model parameter NOT significant
##
## Call:
## lm(formula = log(rain$BSAAM) ~ log(rain$APSLAKE))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6282 -0.2106 -0.0235 0.2707 0.6769
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.968 0.170 64.46 <2e-16 ***
## log(rain$APSLAKE) 0.162 0.109 1.49 0.14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.309 on 41 degrees of freedom
## Multiple R-squared: 0.0514, Adjusted R-squared: 0.0283
## F-statistic: 2.22 on 1 and 41 DF, p-value: 0.144
ladder(BSAAM ~ APSAB, data = rain, main = "Ladder of Powers \n APSAB")
# Based on the ladder of power, none of the transformations look
# appropriate. Skip this one. It violates the assumpations.
ladder(BSAAM ~ APMAM, data = rain, main = "Ladder of Powers \n APMAM")
# Based on the ladder of power, none of the transformations look
# appropriate. Skip this one. It violates the assumpations.
Based on the models that were created, OPSLAKE, OPRC, and OPBPC were good predictors. The best was OPSLAKE because it had the highest coefficient of determination (or R2 value)
Part 2: If you had to keep 2 weather stations open, which 2 would you keep?
# We can do this using the regsubsets function. This will examine ALL
# possible models and provide the best model.
modelsubsets <- regsubsets(BSAAM ~ OPSLAKE + OPRC + OPBPC, data = rain)
plot(modelsubsets) #shows a graphic of which subsets are the best.
The stations of OPSLAKE and OPRC are the best two variable model.