Load Data & Explore
larain <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/LA_RAIN.csv",
header = TRUE)
names(larain)
## [1] "Year" "APMAM" "APSAB" "APSLAKE" "OPBPC" "OPRC" "OPSLAKE"
## [8] "BSAAM"
head(larain)
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE BSAAM
## 1 1948 9.13 3.58 3.91 4.10 7.43 6.47 54235
## 2 1949 5.28 4.82 5.20 7.55 11.11 10.26 67567
## 3 1950 4.20 3.77 3.67 9.52 12.20 11.35 66161
## 4 1951 4.60 4.46 3.93 11.14 15.15 11.13 68094
## 5 1952 7.15 4.99 4.88 16.34 20.05 22.81 107080
## 6 1953 9.70 5.65 4.91 8.88 8.15 7.41 67594
plot(larain, main = "LA Rainfall and Station Data") # Plots ALL of the data. Only do this with small data frames.
Perform a simple linear regression model for the three stations that have a visual correlation in the above plots.
opbpc <- lm(BSAAM ~ OPBPC, data = larain)
oprc <- lm(BSAAM ~ OPRC, data = larain)
opslake <- lm(BSAAM ~ OPSLAKE, data = larain)
summary(opbpc) #adjR2 = 0.7793
##
## Call:
## lm(formula = BSAAM ~ OPBPC, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -21183 -7298 -819 4731 38430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40017 3589 11.2 5.5e-14 ***
## OPBPC 2940 241 12.2 3.0e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12000 on 41 degrees of freedom
## Multiple R-squared: 0.785, Adjusted R-squared: 0.779
## F-statistic: 149 on 1 and 41 DF, p-value: 3e-15
summary(oprc) #adjR2 = 0.842
##
## Call:
## lm(formula = BSAAM ~ OPRC, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -24356 -5514 -522 7448 24854
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 21741 4044 5.38 3.3e-06 ***
## OPRC 4667 311 14.99 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10100 on 41 degrees of freedom
## Multiple R-squared: 0.846, Adjusted R-squared: 0.842
## F-statistic: 225 on 1 and 41 DF, p-value: <2e-16
summary(opslake) #adjR2 = 0.8778
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17604 -5338 332 3411 20876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27015 3219 8.39 1.9e-10 ***
## OPSLAKE 3752 216 17.39 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8920 on 41 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.878
## F-statistic: 303 on 1 and 41 DF, p-value: <2e-16
Use ANOVA to determine if best predictor is significantly better than the rest.
anova(opslake, oprc, opbpc)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPSLAKE
## Model 2: BSAAM ~ OPRC
## Model 3: BSAAM ~ OPBPC
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 3.26e+09
## 2 41 4.22e+09 0 -9.56e+08
## 3 41 5.89e+09 0 -1.67e+09
Why is it not reporting a test statistic?
Observe residuals
plot(larain$BSAAM ~ larain$OPSLAKE, main = "LA Rainfall Prediction", xlab = "OPSLake Gage",
ylab = "LA Rainfall")
abline(coef(opslake))
plot(opslake$residuals ~ opslake$fitted.values, main = "opslake Residuals",
xlab = "Fitted Values", ylab = "Residuals")
Honestly, those don't look all that bad, but there may be a slight arc to them…
Try a quadratic model with OPSLAKE, the best linear variable.
opslake2 <- lm(BSAAM ~ OPSLAKE + I(OPSLAKE^2), data = larain)
summary(opslake2) # adj. R2 = 0.8755
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + I(OPSLAKE^2), data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18060 -5452 290 3790 20250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23800.2 7089.2 3.36 0.0017 **
## OPSLAKE 4220.2 942.3 4.48 6.1e-05 ***
## I(OPSLAKE^2) -14.0 27.4 -0.51 0.6128
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 9000 on 40 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.876
## F-statistic: 149 on 2 and 40 DF, p-value: <2e-16
anova(opslake, opslake2)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPSLAKE
## Model 2: BSAAM ~ OPSLAKE + I(OPSLAKE^2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 3.26e+09
## 2 40 3.24e+09 1 21097409 0.26 0.61
The adjusted R2 value is about the same, and the ANOVA yields not significant, so let's check the residuals.
par(mfrow = c(1, 2))
plot(opslake$residuals ~ opslake$fitted.values, main = "opslake Residuals",
xlab = "Fitted Values", ylab = "Residuals")
plot(opslake2$residuals ~ opslake2$fitted.values, main = "opslake2 Residuals",
xlab = "Fitted Values", ylab = "Residuals")
Again, there is barely any difference, so for the sake of parsimony, I'll stick with a simple linear regression of OPSLAKE to predict rainfall.
**The three stations used above (op) appear to co-vary, as do the three stations (ap) not chosen for the first question. Covariation of independent variables in a multiple regression is not productive; therefore, I'll select the best station from each set of three. That means step 1 is finding the best ap station.
apmam <- lm(BSAAM ~ APMAM, data = larain)
apsab <- lm(BSAAM ~ APSAB, data = larain)
apslake <- lm(BSAAM ~ APSLAKE, data = larain)
summary(apmam) #adjR2 = 0.03391
##
## Call:
## lm(formula = BSAAM ~ APMAM, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -37043 -16339 -5457 17158 72467
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63364 9917 6.39 1.2e-07 ***
## APMAM 1965 1249 1.57 0.12
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25100 on 41 degrees of freedom
## Multiple R-squared: 0.0569, Adjusted R-squared: 0.0339
## F-statistic: 2.47 on 1 and 41 DF, p-value: 0.123
summary(apsab) #adjR2 = 0.01003
##
## Call:
## lm(formula = BSAAM ~ APSAB, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41314 -16784 -5101 16492 70942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67152 9689 6.93 2.1e-08 ***
## APSAB 2279 1909 1.19 0.24
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25400 on 41 degrees of freedom
## Multiple R-squared: 0.0336, Adjusted R-squared: 0.01
## F-statistic: 1.43 on 1 and 41 DF, p-value: 0.239
summary(apslake) #adjR2 = 0.0393
##
## Call:
## lm(formula = BSAAM ~ APSLAKE, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -46438 -16907 -5661 19028 69464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63864 9249 6.90 2.2e-08 ***
## APSLAKE 2818 1709 1.65 0.11
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25000 on 41 degrees of freedom
## Multiple R-squared: 0.0622, Adjusted R-squared: 0.0393
## F-statistic: 2.72 on 1 and 41 DF, p-value: 0.107
anova(apslake, apmam, apsab)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ APSLAKE
## Model 2: BSAAM ~ APMAM
## Model 3: BSAAM ~ APSAB
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 41 2.57e+10
## 2 41 2.58e+10 0 -1.44e+08
## 3 41 2.64e+10 0 -6.38e+08
Try the adding the two stations (severally) that had adj. R2 over 0.3
opslake.apslake <- lm(BSAAM ~ OPSLAKE + APSLAKE, data = larain)
summary(opslake.apslake) # AdjR2 = 0.9002
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APSLAKE, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13336 -5893 -172 4220 19500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19145 3812 5.02 1.1e-05 ***
## OPSLAKE 3690 196 18.83 < 2e-16 ***
## APSLAKE 1769 554 3.19 0.0027 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8060 on 40 degrees of freedom
## Multiple R-squared: 0.905, Adjusted R-squared: 0.9
## F-statistic: 190 on 2 and 40 DF, p-value: <2e-16
opslake.apmam <- lm(BSAAM ~ OPSLAKE + APMAM, data = larain)
summary(opslake.apmam) # AdjR2 = 0.8948
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE + APMAM, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16354 -4787 -616 4730 20554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19423 4055 4.79 2.3e-05 ***
## OPSLAKE 3693 201 18.35 < 2e-16 ***
## APMAM 1147 415 2.77 0.0085 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8280 on 40 degrees of freedom
## Multiple R-squared: 0.9, Adjusted R-squared: 0.895
## F-statistic: 180 on 2 and 40 DF, p-value: <2e-16
anova(opslake.apslake, opslake.apmam)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPSLAKE + APSLAKE
## Model 2: BSAAM ~ OPSLAKE + APMAM
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 2.60e+09
## 2 40 2.74e+09 0 -1.39e+08
anova(opslake.apslake, opslake)
## Analysis of Variance Table
##
## Model 1: BSAAM ~ OPSLAKE + APSLAKE
## Model 2: BSAAM ~ OPSLAKE
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 40 2.60e+09
## 2 41 3.26e+09 -1 -6.63e+08 10.2 0.0027 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coefopsapslake <- coef(opslake.apslake)
The results suggest that the opslake + apslake model is the best, and significantly better than opslake alone.
**Check the Residuals:
par(mfrow = c(1, 2))
plot(opslake$residuals ~ opslake$fitted.values, main = "opslake Residuals",
xlab = "Fitted Values", ylab = "Residuals")
plot(opslake.apslake$residuals ~ opslake.apslake$fitted.values, main = "ops/apslake Resids",
xlab = "Fitted Values", ylab = "Residuals")
From my eye, these look close, maybe slightly better.I also note no curvature, so linear is probably adequate.
I would select a simple multiple regression of opslake and apslake.
Use regsubsets to select the best regression models (linear or otherwise)
# regsubsets(DEPENDENT~.,data= data.frame) Using the period after the ~
# means that you're ogin to try ALL other variables as independent
# variables.
library(HH) # Need HH loaded.
## Loading required package: lattice
## Loading required package: grid
## Loading required package: multcomp
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: splines
## Loading required package: leaps
## Loading required package: RColorBrewer
## Loading required package: latticeExtra
subsetsmodel <- regsubsets(BSAAM ~ ., data = larain)
summary(subsetsmodel)
## Subset selection object
## Call: regsubsets.formula(BSAAM ~ ., data = larain)
## 7 Variables (and intercept)
## Forced in Forced out
## Year FALSE FALSE
## APMAM FALSE FALSE
## APSAB FALSE FALSE
## APSLAKE FALSE FALSE
## OPBPC FALSE FALSE
## OPRC FALSE FALSE
## OPSLAKE FALSE FALSE
## 1 subsets of each size up to 7
## Selection Algorithm: exhaustive
## Year APMAM APSAB APSLAKE OPBPC OPRC OPSLAKE
## 1 ( 1 ) " " " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " "*" " " " " "*"
## 3 ( 1 ) " " " " " " "*" " " "*" "*"
## 4 ( 1 ) "*" " " " " "*" " " "*" "*"
## 5 ( 1 ) "*" " " "*" "*" " " "*" "*"
## 6 ( 1 ) "*" "*" "*" "*" " " "*" "*"
## 7 ( 1 ) "*" "*" "*" "*" "*" "*" "*"
# The asterisks indicate the variables to use.
coef(subsetsmodel, 1:3) # See the results coefficients.
## [[1]]
## (Intercept) OPSLAKE
## 27015 3752
##
## [[2]]
## (Intercept) APSLAKE OPSLAKE
## 19145 1769 3690
##
## [[3]]
## (Intercept) APSLAKE OPRC OPSLAKE
## 15425 1712 1797 2390
names(subsetsmodel) # See the names of variables stored in the regsubsets object.
## [1] "np" "nrbar" "d" "rbar" "thetab"
## [6] "first" "last" "vorder" "tol" "rss"
## [11] "bound" "nvmax" "ress" "ir" "nbest"
## [16] "lopt" "il" "ier" "xnames" "method"
## [21] "force.in" "force.out" "sserr" "intercept" "lindep"
## [26] "nullrss" "nn" "call"