object <- lm(Y ~ X1+X2+…,data=) creates a linear model. We assign the model to an object so that we can extract information from that object later.
data= is for the data frame you're using. Unnecessary if you're diligent about the $ notation.
# Load Data
larain <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/LA_RAIN.csv",
header = TRUE)
names(larain)
## [1] "Year" "APMAM" "APSAB" "APSLAKE" "OPBPC" "OPRC" "OPSLAKE"
## [8] "BSAAM"
opslake.bsaam <- lm(BSAAM ~ OPSLAKE, data = larain)
After running the model, use extractor function to learn about it.
summary() is the best starting point. It returns most of the data you'd be looking for.
summary(opslake.bsaam)
##
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = larain)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17604 -5338 332 3411 20876
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27015 3219 8.39 1.9e-10 ***
## OPSLAKE 3752 216 17.39 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8920 on 41 degrees of freedom
## Multiple R-squared: 0.881, Adjusted R-squared: 0.878
## F-statistic: 303 on 1 and 41 DF, p-value: <2e-16
names(opslake.bsaam) # 12 columns!
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Reading the summary
1) Call:
lm(formula = BSAAM ~ OPSLAKE, data = larain)
This shows you the formula for the model.
2) Residuals:
Min 1Q Median 3Q Max
-17603.8 -5338.0 332.1 3410.6 20875.6
Median should be near 0 (not great here…) and the 1Q and 3Q and min and max should be roughly equal in magnitude (again, ehh…).
3) Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27014.6 3218.9 8.393 1.93e-10
OPSLAKE 3752.5 215.7 17.394 < 2e-16
The estimate provides the magnitude of the relationship. The standard error is reported (rather than standard deviation) because the coefficient is an estimated value for the real relationship. The t statistic and p-value are the t-test performed to determine if the coefficient (and intercept) are different from 0 (i.e. to determine if the relationship estimated really exists).
4) Residual standard error: 8922 on 41 degrees of freedom
This is the variation of the residuals about the regression line.
5) Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
The R-squared value is percentage of variation in y that can be explained by the model. Adjusted R-squared is more conservative, penalizing the use of more independent variables to increase explanatory power. R2 = 1-(RSS/TSS) = MSS/TSS
6) F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
This tests whether the entire model is greater than 0. Not interesting if there's only one coefficient because then it's just redundant.
plot(Y ~ X) will plot the variables on a scatter plot.
abline(lm(Y ~ X)) will plot a regression line on the scatter plot.
plot(larain$BSAAM ~ larain$OPSLAKE, main = "Predicting Stream Flow based on Precipitation",
xlab = "Precip at OPSLAKE", ylab = "Streamflow at BSAAM")
abline(lm(BSAAM ~ OPSLAKE, data = larain))