Exercise from Lecture 3 - Alex Crawford

Question 1: Which individual station best predicts LA rainfall?

Step 1:

Load Data & Explore

larain <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/LA_RAIN.csv", 
    header = TRUE)
names(larain)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"
head(larain)
##   Year APMAM APSAB APSLAKE OPBPC  OPRC OPSLAKE  BSAAM
## 1 1948  9.13  3.58    3.91  4.10  7.43    6.47  54235
## 2 1949  5.28  4.82    5.20  7.55 11.11   10.26  67567
## 3 1950  4.20  3.77    3.67  9.52 12.20   11.35  66161
## 4 1951  4.60  4.46    3.93 11.14 15.15   11.13  68094
## 5 1952  7.15  4.99    4.88 16.34 20.05   22.81 107080
## 6 1953  9.70  5.65    4.91  8.88  8.15    7.41  67594
plot(larain, main = "LA Rainfall and Station Data")  # Plots ALL of the data. Only do this with small data frames.

plot of chunk unnamed-chunk-2

Step 2:

Perform a simple linear regression model for the three stations that have a visual correlation in the above plots.

opbpc <- lm(BSAAM ~ OPBPC, data = larain)
oprc <- lm(BSAAM ~ OPRC, data = larain)
opslake <- lm(BSAAM ~ OPSLAKE, data = larain)
summary(opbpc)  #adjR2 = 0.7793
## 
## Call:
## lm(formula = BSAAM ~ OPBPC, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -21183  -7298   -819   4731  38430 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    40017       3589    11.2  5.5e-14 ***
## OPBPC           2940        241    12.2  3.0e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 12000 on 41 degrees of freedom
## Multiple R-squared: 0.785,   Adjusted R-squared: 0.779 
## F-statistic:  149 on 1 and 41 DF,  p-value: 3e-15
summary(oprc)  #adjR2 = 0.842
## 
## Call:
## lm(formula = BSAAM ~ OPRC, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -24356  -5514   -522   7448  24854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    21741       4044    5.38  3.3e-06 ***
## OPRC            4667        311   14.99  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 10100 on 41 degrees of freedom
## Multiple R-squared: 0.846,   Adjusted R-squared: 0.842 
## F-statistic:  225 on 1 and 41 DF,  p-value: <2e-16
summary(opslake)  #adjR2 =  0.8778
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17604  -5338    332   3411  20876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    27015       3219    8.39  1.9e-10 ***
## OPSLAKE         3752        216   17.39  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8920 on 41 degrees of freedom
## Multiple R-squared: 0.881,   Adjusted R-squared: 0.878 
## F-statistic:  303 on 1 and 41 DF,  p-value: <2e-16

Step 3:

Use ANOVA to determine if best predictor is significantly better than the rest.

anova(opslake, oprc, opbpc)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPSLAKE
## Model 2: BSAAM ~ OPRC
## Model 3: BSAAM ~ OPBPC
##   Res.Df      RSS Df Sum of Sq F Pr(>F)
## 1     41 3.26e+09                      
## 2     41 4.22e+09  0 -9.56e+08         
## 3     41 5.89e+09  0 -1.67e+09

Why is it not reporting a test statistic?

Step 4:

Observe residuals

plot(larain$BSAAM ~ larain$OPSLAKE, main = "LA Rainfall Prediction", xlab = "OPSLake Gage", 
    ylab = "LA Rainfall")
abline(coef(opslake))

plot of chunk unnamed-chunk-5

plot(opslake$residuals ~ opslake$fitted.values, main = "opslake Residuals", 
    xlab = "Fitted Values", ylab = "Residuals")

plot of chunk unnamed-chunk-5

Honestly, those don't look all that bad, but there may be a slight arc to them…

Step 5:

Try a quadratic model with OPSLAKE, the best linear variable.

opslake2 <- lm(BSAAM ~ OPSLAKE + I(OPSLAKE^2), data = larain)
summary(opslake2)  # adj. R2 = 0.8755
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + I(OPSLAKE^2), data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -18060  -5452    290   3790  20250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   23800.2     7089.2    3.36   0.0017 ** 
## OPSLAKE        4220.2      942.3    4.48  6.1e-05 ***
## I(OPSLAKE^2)    -14.0       27.4   -0.51   0.6128    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 9000 on 40 degrees of freedom
## Multiple R-squared: 0.881,   Adjusted R-squared: 0.876 
## F-statistic:  149 on 2 and 40 DF,  p-value: <2e-16
anova(opslake, opslake2)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPSLAKE
## Model 2: BSAAM ~ OPSLAKE + I(OPSLAKE^2)
##   Res.Df      RSS Df Sum of Sq    F Pr(>F)
## 1     41 3.26e+09                         
## 2     40 3.24e+09  1  21097409 0.26   0.61

The adjusted R2 value is about the same, and the ANOVA yields not significant, so let's check the residuals.

par(mfrow = c(1, 2))
plot(opslake$residuals ~ opslake$fitted.values, main = "opslake Residuals", 
    xlab = "Fitted Values", ylab = "Residuals")
plot(opslake2$residuals ~ opslake2$fitted.values, main = "opslake2 Residuals", 
    xlab = "Fitted Values", ylab = "Residuals")

plot of chunk unnamed-chunk-7

ANSWER:

Again, there is barely any difference, so for the sake of parsimony, I'll stick with a simple linear regression of OPSLAKE to predict rainfall.

Question 2: Which TWO stations best predict LA rainfall?

Step 1:

**The three stations used above (op) appear to co-vary, as do the three stations (ap) not chosen for the first question. Covariation of independent variables in a multiple regression is not productive; therefore, I'll select the best station from each set of three. That means step 1 is finding the best ap station.

apmam <- lm(BSAAM ~ APMAM, data = larain)
apsab <- lm(BSAAM ~ APSAB, data = larain)
apslake <- lm(BSAAM ~ APSLAKE, data = larain)
summary(apmam)  #adjR2 = 0.03391
## 
## Call:
## lm(formula = BSAAM ~ APMAM, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -37043 -16339  -5457  17158  72467 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63364       9917    6.39  1.2e-07 ***
## APMAM           1965       1249    1.57     0.12    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 25100 on 41 degrees of freedom
## Multiple R-squared: 0.0569,  Adjusted R-squared: 0.0339 
## F-statistic: 2.47 on 1 and 41 DF,  p-value: 0.123
summary(apsab)  #adjR2 = 0.01003
## 
## Call:
## lm(formula = BSAAM ~ APSAB, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -41314 -16784  -5101  16492  70942 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    67152       9689    6.93  2.1e-08 ***
## APSAB           2279       1909    1.19     0.24    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 25400 on 41 degrees of freedom
## Multiple R-squared: 0.0336,  Adjusted R-squared: 0.01 
## F-statistic: 1.43 on 1 and 41 DF,  p-value: 0.239
summary(apslake)  #adjR2 = 0.0393
## 
## Call:
## lm(formula = BSAAM ~ APSLAKE, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -46438 -16907  -5661  19028  69464 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    63864       9249    6.90  2.2e-08 ***
## APSLAKE         2818       1709    1.65     0.11    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 25000 on 41 degrees of freedom
## Multiple R-squared: 0.0622,  Adjusted R-squared: 0.0393 
## F-statistic: 2.72 on 1 and 41 DF,  p-value: 0.107
anova(apslake, apmam, apsab)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ APSLAKE
## Model 2: BSAAM ~ APMAM
## Model 3: BSAAM ~ APSAB
##   Res.Df      RSS Df Sum of Sq F Pr(>F)
## 1     41 2.57e+10                      
## 2     41 2.58e+10  0 -1.44e+08         
## 3     41 2.64e+10  0 -6.38e+08

Step 2:

Try the adding the two stations (severally) that had adj. R2 over 0.3

opslake.apslake <- lm(BSAAM ~ OPSLAKE + APSLAKE, data = larain)
summary(opslake.apslake)  # AdjR2 = 0.9002
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -13336  -5893   -172   4220  19500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    19145       3812    5.02  1.1e-05 ***
## OPSLAKE         3690        196   18.83  < 2e-16 ***
## APSLAKE         1769        554    3.19   0.0027 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8060 on 40 degrees of freedom
## Multiple R-squared: 0.905,   Adjusted R-squared:  0.9 
## F-statistic:  190 on 2 and 40 DF,  p-value: <2e-16
opslake.apmam <- lm(BSAAM ~ OPSLAKE + APMAM, data = larain)
summary(opslake.apmam)  # AdjR2 = 0.8948
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APMAM, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -16354  -4787   -616   4730  20554 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    19423       4055    4.79  2.3e-05 ***
## OPSLAKE         3693        201   18.35  < 2e-16 ***
## APMAM           1147        415    2.77   0.0085 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8280 on 40 degrees of freedom
## Multiple R-squared:  0.9,    Adjusted R-squared: 0.895 
## F-statistic:  180 on 2 and 40 DF,  p-value: <2e-16
anova(opslake.apslake, opslake.apmam)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPSLAKE + APSLAKE
## Model 2: BSAAM ~ OPSLAKE + APMAM
##   Res.Df      RSS Df Sum of Sq F Pr(>F)
## 1     40 2.60e+09                      
## 2     40 2.74e+09  0 -1.39e+08
anova(opslake.apslake, opslake)
## Analysis of Variance Table
## 
## Model 1: BSAAM ~ OPSLAKE + APSLAKE
## Model 2: BSAAM ~ OPSLAKE
##   Res.Df      RSS Df Sum of Sq    F Pr(>F)   
## 1     40 2.60e+09                            
## 2     41 3.26e+09 -1 -6.63e+08 10.2 0.0027 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefopsapslake <- coef(opslake.apslake)

The results suggest that the opslake + apslake model is the best, and significantly better than opslake alone.

Step 3:

**Check the Residuals:

par(mfrow = c(1, 2))
plot(opslake$residuals ~ opslake$fitted.values, main = "opslake Residuals", 
    xlab = "Fitted Values", ylab = "Residuals")
plot(opslake.apslake$residuals ~ opslake.apslake$fitted.values, main = "ops/apslake Resids", 
    xlab = "Fitted Values", ylab = "Residuals")

plot of chunk unnamed-chunk-10

From my eye, these look close, maybe slightly better.I also note no curvature, so linear is probably adequate.

Answer:

I would select a simple multiple regression of opslake and apslake.

Alternative Method:

Use regsubsets to select the best regression models (linear or otherwise)

# regsubsets(DEPENDENT~.,data= data.frame) Using the period after the ~
# means that you're ogin to try ALL other variables as independent
# variables.
library(HH)  # Need HH loaded.
## Loading required package: lattice
## Loading required package: grid
## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: splines
## Loading required package: leaps
## Loading required package: RColorBrewer
## Loading required package: latticeExtra
subsetsmodel <- regsubsets(BSAAM ~ ., data = larain)
summary(subsetsmodel)
## Subset selection object
## Call: regsubsets.formula(BSAAM ~ ., data = larain)
## 7 Variables  (and intercept)
##         Forced in Forced out
## Year        FALSE      FALSE
## APMAM       FALSE      FALSE
## APSAB       FALSE      FALSE
## APSLAKE     FALSE      FALSE
## OPBPC       FALSE      FALSE
## OPRC        FALSE      FALSE
## OPSLAKE     FALSE      FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
##          Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE
## 1  ( 1 ) " "  " "   " "   " "     " "   " "  "*"    
## 2  ( 1 ) " "  " "   " "   "*"     " "   " "  "*"    
## 3  ( 1 ) " "  " "   " "   "*"     " "   "*"  "*"    
## 4  ( 1 ) "*"  " "   " "   "*"     " "   "*"  "*"    
## 5  ( 1 ) "*"  " "   "*"   "*"     " "   "*"  "*"    
## 6  ( 1 ) "*"  "*"   "*"   "*"     " "   "*"  "*"    
## 7  ( 1 ) "*"  "*"   "*"   "*"     "*"   "*"  "*"
# The asterisks indicate the variables to use.
coef(subsetsmodel, 1:3)  # See the results coefficients.
## [[1]]
## (Intercept)     OPSLAKE 
##       27015        3752 
## 
## [[2]]
## (Intercept)     APSLAKE     OPSLAKE 
##       19145        1769        3690 
## 
## [[3]]
## (Intercept)     APSLAKE        OPRC     OPSLAKE 
##       15425        1712        1797        2390
names(subsetsmodel)  # See the names of variables stored in the regsubsets object.
##  [1] "np"        "nrbar"     "d"         "rbar"      "thetab"   
##  [6] "first"     "last"      "vorder"    "tol"       "rss"      
## [11] "bound"     "nvmax"     "ress"      "ir"        "nbest"    
## [16] "lopt"      "il"        "ier"       "xnames"    "method"   
## [21] "force.in"  "force.out" "sserr"     "intercept" "lindep"   
## [26] "nullrss"   "nn"        "call"