Lecture 4: Intro to Correlation and Regression

Key Concepts overvied in the lecture include:
1. Pearson's Correlation Coefficient
2. Ranked correlation: Mann-Whitney and Spearman's Ranked Correlation Coefficients
3. Assumptions of linear regression
4. Residuals and Least Squares
5. Coefficient of Determination (R2)<br /> 6. Significant Testing for Linear Regressions
7. Data Transformations


The inclass exercise is about using general linear models.

Part 1: Which weather station (or sets of weather stations) is the best predictor of runoff?

# install.packages('HH')#only need to do this once per computer...
library(HH)  #activate the packages to do Tukey's Ladder of Powers.
## Warning: package 'HH' was built under R version 2.15.3
## Loading required package: lattice
## Loading required package: grid
## Loading required package: multcomp
## Warning: package 'multcomp' was built under R version 2.15.3
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: splines
## Loading required package: leaps
## Loading required package: RColorBrewer
## Loading required package: latticeExtra
## Loading required package: reshape
## Loading required package: plyr
## Attaching package: 'reshape'
## The following object(s) are masked from 'package:plyr':
## 
## rename, round_any
## Loading required package: colorspace
library(leaps)  #activate the package to use regsubset
rain <- read.csv("E:/Quant/inClassExercises/InClassExerciseData/LA_RAIN.csv")
names(rain)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"

# Create some linear models to get at best prediction or best model fit
# (R^2)

# There are a few assumptions that we must test for all genearl linear
# models: 1. The plotted data is homoscedastic (variance of data does not
# depend on X and is consistent.) 2. The slope of the regression line is
# significantly different than 0.  We can use a t-test on the model
# parameters.  The null hypothesis is that the coefficient is 0. 3. The
# variation explained by the model is not due to random chance using an
# F-test.  The null hypothesis is that the model is due to random chance.
# 4. the residuals are normally distributed around 0.



# Plot the data to check for homoscedasity.
plot(rain$BSAAM ~ rain$OPSLAKE)

plot of chunk unnamed-chunk-1

plot(rain$BSAAM ~ rain$OPRC)

plot of chunk unnamed-chunk-1

plot(rain$BSAAM ~ rain$OPBPC)

plot of chunk unnamed-chunk-1

plot(rain$BSAAM ~ rain$APSLAKE)  #This one doesn't look good...

plot of chunk unnamed-chunk-1

plot(rain$BSAAM ~ rain$APSAB)  #This one doesn't look good...

plot of chunk unnamed-chunk-1

plot(rain$BSAAM ~ rain$APMAM)  #This one doesn't look good...

plot of chunk unnamed-chunk-1


# Create the models and perform diagnostics....
opslake <- lm(rain$BSAAM ~ rain$OPSLAKE)  #Create a linear model.
summary(opslake)  #R2 of .8778; the model parameter is significant
## 
## Call:
## lm(formula = rain$BSAAM ~ rain$OPSLAKE)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17604  -5338    332   3411  20876 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     27015       3219    8.39  1.9e-10 ***
## rain$OPSLAKE     3752        216   17.39  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8920 on 41 degrees of freedom
## Multiple R-squared: 0.881,   Adjusted R-squared: 0.878 
## F-statistic:  303 on 1 and 41 DF,  p-value: <2e-16
shapiro.test(opslake$residuals)  #residuals are good.
## 
##  Shapiro-Wilk normality test
## 
## data:  opslake$residuals 
## W = 0.9677, p-value = 0.2639
anova(opslake)  #model is significant.
## Analysis of Variance Table
## 
## Response: rain$BSAAM
##              Df   Sum Sq  Mean Sq F value Pr(>F)    
## rain$OPSLAKE  1 2.41e+10 2.41e+10     303 <2e-16 ***
## Residuals    41 3.26e+09 7.96e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


oprc <- lm(rain$BSAAM ~ rain$OPRC)
summary(oprc)  #R2 of .842; model parameter is significant
## 
## Call:
## lm(formula = rain$BSAAM ~ rain$OPRC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21741       4044    5.38  3.3e-06 ***
## rain$OPRC       4667        311   14.99  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 10100 on 41 degrees of freedom
## Multiple R-squared: 0.846,   Adjusted R-squared: 0.842 
## F-statistic:  225 on 1 and 41 DF,  p-value: <2e-16
shapiro.test(opslake$residuals)  #residuals are good.
## 
##  Shapiro-Wilk normality test
## 
## data:  opslake$residuals 
## W = 0.9677, p-value = 0.2639
anova(opslake)  #model is significant.
## Analysis of Variance Table
## 
## Response: rain$BSAAM
##              Df   Sum Sq  Mean Sq F value Pr(>F)    
## rain$OPSLAKE  1 2.41e+10 2.41e+10     303 <2e-16 ***
## Residuals    41 3.26e+09 7.96e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


opbpc <- lm(rain$BSAAM ~ rain$OPBPC)
summary(opbpc)  #R2 of .7793; model parameter is significant
## 
## Call:
## lm(formula = rain$BSAAM ~ rain$OPBPC)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40017       3589    11.2  5.5e-14 ***
## rain$OPBPC      2940        241    12.2  3.0e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 12000 on 41 degrees of freedom
## Multiple R-squared: 0.785,   Adjusted R-squared: 0.779 
## F-statistic:  149 on 1 and 41 DF,  p-value: 3e-15
shapiro.test(opslake$residuals)  #residuals are good.
## 
##  Shapiro-Wilk normality test
## 
## data:  opslake$residuals 
## W = 0.9677, p-value = 0.2639
anova(opslake)  #model is significant.
## Analysis of Variance Table
## 
## Response: rain$BSAAM
##              Df   Sum Sq  Mean Sq F value Pr(>F)    
## rain$OPSLAKE  1 2.41e+10 2.41e+10     303 <2e-16 ***
## Residuals    41 3.26e+09 7.96e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1



# These violate the condition that the data needs to be homoscedastic.  We
# can try to transform the data using Tukey's Ladder of Power.  Note, this
# takes a while....

ladder(BSAAM ~ APSLAKE, data = rain, main = "Ladder of Powers \n APSLAKE")

plot of chunk unnamed-chunk-1

# Based on the ladder of powers, a good transformation would be to take
# the natural log of the data.
apslake <- lm(log(rain$BSAAM) ~ log(rain$APSLAKE))
summary(apslake)  #R2 of 0.0283; model parameter NOT significant
## 
## Call:
## lm(formula = log(rain$BSAAM) ~ log(rain$APSLAKE))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.6282 -0.2106 -0.0235  0.2707  0.6769 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         10.968      0.170   64.46   <2e-16 ***
## log(rain$APSLAKE)    0.162      0.109    1.49     0.14    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.309 on 41 degrees of freedom
## Multiple R-squared: 0.0514,  Adjusted R-squared: 0.0283 
## F-statistic: 2.22 on 1 and 41 DF,  p-value: 0.144


ladder(BSAAM ~ APSAB, data = rain, main = "Ladder of Powers \n APSAB")

plot of chunk unnamed-chunk-1

# Based on the ladder of power, none of the transformations look
# appropriate.  Skip this one.  It violates the assumpations.

ladder(BSAAM ~ APMAM, data = rain, main = "Ladder of Powers \n APMAM")

plot of chunk unnamed-chunk-1

# Based on the ladder of power, none of the transformations look
# appropriate.  Skip this one.  It violates the assumpations.

Based on the models that were created, OPSLAKE, OPRC, and OPBPC were good predictors. The best was OPSLAKE because it had the highest coefficient of determination (or R2 value)

Part 2: If you had to keep 2 weather stations open, which 2 would you keep?

# We can do this using the regsubsets function.  This will examine ALL
# possible models and provide the best model.
modelsubsets <- regsubsets(BSAAM ~ OPSLAKE + OPRC + OPBPC, data = rain)

plot(modelsubsets)  #shows a graphic of which subsets are the best.

plot of chunk unnamed-chunk-2


The stations of OPSLAKE and OPRC are the best two variable model.