Dalgaard Chapter 6 Functions – Alex Crawford

6.1 Simple Linear Regression

Making a linear model

object <- lm(Y ~ X1+X2+…,data=) creates a linear model. We assign the model to an object so that we can extract information from that object later.
data= is for the data frame you're using. Unnecessary if you're diligent about the $ notation.

# Load Data
larain <- read.csv("/Users/telekineticturtle/Desktop/Colorado 13/Quant Methods/Data/LA_RAIN.csv", 
    header = TRUE)
names(larain)
## [1] "Year"    "APMAM"   "APSAB"   "APSLAKE" "OPBPC"   "OPRC"    "OPSLAKE"
## [8] "BSAAM"
opslake.bsaam <- lm(BSAAM ~ OPSLAKE, data = larain)

Exploring the results

After running the model, use extractor function to learn about it.
summary() is the best starting point. It returns most of the data you'd be looking for.

summary(opslake.bsaam)
## 
## Call:
## lm(formula = BSAAM ~ OPSLAKE, data = larain)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -17604  -5338    332   3411  20876 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    27015       3219    8.39  1.9e-10 ***
## OPSLAKE         3752        216   17.39  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 8920 on 41 degrees of freedom
## Multiple R-squared: 0.881,   Adjusted R-squared: 0.878 
## F-statistic:  303 on 1 and 41 DF,  p-value: <2e-16
names(opslake.bsaam)  # 12 columns!
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"

Reading the summary
1) Call:
lm(formula = BSAAM ~ OPSLAKE, data = larain)
This shows you the formula for the model.

2) Residuals:
Min 1Q Median 3Q Max
-17603.8 -5338.0 332.1 3410.6 20875.6
Median should be near 0 (not great here…) and the 1Q and 3Q and min and max should be roughly equal in magnitude (again, ehh…).

3) Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27014.6 3218.9 8.393 1.93e-10
OPSLAKE 3752.5 215.7 17.394 < 2e-16
The estimate provides the magnitude of the relationship. The standard error is reported (rather than standard deviation) because the coefficient is an estimated value for the real relationship. The t statistic and p-value are the t-test performed to determine if the coefficient (and intercept) are different from 0 (i.e. to determine if the relationship estimated really exists).

4) Residual standard error: 8922 on 41 degrees of freedom
This is the variation of the residuals about the regression line.

5) Multiple R-squared: 0.8807, Adjusted R-squared: 0.8778
The R-squared value is percentage of variation in y that can be explained by the model. Adjusted R-squared is more conservative, penalizing the use of more independent variables to increase explanatory power. R2 = 1-(RSS/TSS) = MSS/TSS

6) F-statistic: 302.6 on 1 and 41 DF, p-value: < 2.2e-16
This tests whether the entire model is greater than 0. Not interesting if there's only one coefficient because then it's just redundant.

Simple Plotting

plot(Y ~ X) will plot the variables on a scatter plot.
abline(lm(Y ~ X)) will plot a regression line on the scatter plot.

plot(larain$BSAAM ~ larain$OPSLAKE, main = "Predicting Stream Flow based on Precipitation", 
    xlab = "Precip at OPSLAKE", ylab = "Streamflow at BSAAM")
abline(lm(BSAAM ~ OPSLAKE, data = larain))

plot of chunk unnamed-chunk-3

6.2 Residuals and Fitted Values