GEOG 4023/5023 - Quantitative Methods
Spring 2017
Carson J. Q. Farmer
carson.farmer@colorado.edu
@carsonfarmer
carsonfarmer.com
\( Y = b_0 + b_1x + \epsilon \)
Normal distribution describing \( Y \)\[ \hat{Y} = \mathbf{a} + bx \]
\[ \hat{Y} = a + \mathbf{b}x \]
\[ \hat{Y}_i - \hat{Y}_i \]
\[ \min_{a, b} \sum_{i=1}^n (Y_i - \hat{Y})^2 = \min_{a, b} \sum_{i=1}^n (Y_i - a + b x_i)^2 \]
source("http://www.mosaic-web.org/go/datasets/mLineFit.R")
library(mosaicData)
data("KidsFeet")
mLineFit(width ~ length, data=KidsFeet)
install.packages(c("manipulate", "mosaicData"))
Utilities built-in dataset contains data from utility bills at a residencedata.frame containing 117 obervations for several variables
month, day, year, temp, ccf (gas usage), kwh (elec. usage), etc.lm() function to fit a linear modellibrary(dplyr)
data("Utilities") # Load some builtin data
Utilities %>%
select(month, day, year, temp, ccf, kwh) %>%
summary() # Compute summary of subset of columns
month day year temp
Min. : 1.000 Min. :23.00 Min. :1999 Min. : 9.00
1st Qu.: 4.000 1st Qu.:26.00 1st Qu.:2002 1st Qu.:30.00
Median : 7.000 Median :26.00 Median :2005 Median :51.00
Mean : 6.564 Mean :26.59 Mean :2005 Mean :48.66
3rd Qu.: 9.000 3rd Qu.:28.00 3rd Qu.:2007 3rd Qu.:69.00
Max. :12.000 Max. :36.00 Max. :2010 Max. :78.00
ccf kwh
Min. : 0.00 Min. : 160
1st Qu.: 16.00 1st Qu.: 583
Median : 60.00 Median : 790
Mean : 83.44 Mean : 751
3rd Qu.:144.00 3rd Qu.: 893
Max. :242.00 Max. :1213
plot(Utilities) # Mine looks slightly different
m <- lm(ccf ~ temp, data=Utilities) # Fit a simple model
summary(m)
Call:
lm(formula = ccf ~ temp, data = Utilities)
Residuals:
Min 1Q Median 3Q Max
-93.892 -14.892 -2.963 13.075 84.725
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 251.4872 5.2876 47.56 <2e-16 ***
temp -3.4535 0.1002 -34.48 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.19 on 115 degrees of freedom
Multiple R-squared: 0.9118, Adjusted R-squared: 0.911
F-statistic: 1189 on 1 and 115 DF, p-value: < 2.2e-16
# 'Recode' temp variable
Utilities <- mutate(Utilities, temp = temp - mean(temp))
# Refit model and get coefficients summary
summary(lm(ccf ~ temp, data=Utilities))$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 83.44444 2.0512277 40.68024 4.152433e-70
temp -3.45354 0.1001587 -34.48069 1.792965e-62
Intercept and coefficient on temp
\[ t = \frac{\text{observed diff} - \text{expected diff}}{\text{Std Err}} = \frac{(\hat{\beta} - 0) - 0}{\text{Std Err}} \]
-3.45/0.10 = -34.5# This should look familiar
sigma <- matrix(c(1, 0.1, 0.1, 1), nrow=2)
mu <- c(0, 0)
# Generate multi-variate random normal data
df <- as.data.frame(mvrnorm(10000, mu, sigma))
plot(V2 ~ V1, data=df)
V1 and V2
library(mosaic)
# Run 1000 regressions on samples of 100 each
betas <- do(1000) * coef(lm(V2 ~ V1, data=sample(df, 100)))
V1)hist(betas$V1)
sd(~V1, data=betas) # Print standard deviation
[1] 0.1003523
# Rerun on another sample
mod <- lm(V2 ~ V1, data=sample(df, 100))
summary(mod)$coefficients
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07512985 0.09210434 0.8157037 0.4166478
V1 0.09385312 0.09076209 1.0340563 0.3036549
sd) is close to estimate of standard error of slope from regression (based on sample)lm does good job of estimating random variation in regression coefficientssummary(lm(ccf ~ temp, data=Utilities))
Coefficients table \( \downarrow \) tell us? Estimate Std. Error t value Pr(>|t|)
(Intercept) 83.44444 2.0512277 40.68024 4.152433e-70
temp -3.45354 0.1001587 -34.48069 1.792965e-62
Multiple R-squared \( \downarrow \) tell us?Residual standard error:22.19 on 115 degrees of freedom
Multiple R-squared: 0.9118, Adjusted R-squared: 0.911
F-statistic: 1189 on 1 and 115 DF, p-value: < 2.2e-16
ccf = 83.444 - 3.454tempccf: cubic centimeters of fueltemp (degrees F) coefficient…
girth of tree may be strongly related to its age
girth (\( Y \)) is partially explained by tree age \( X \)girth is explained by tree age?age (\( X \)) – the explanatory variable…girth (\( Y \))
\[ \color{red}{ \sum_{i=1}^n (Y_i - \bar{Y}_i)^2 } = \color{green}{ \sum_{i=1}^n (\hat{Y}_i - \bar{Y}_i)^2 } + \color{blue}{ \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 } \]
\[ R^2 = \frac{ \color{green}{ \sum_{i=1}^n (\hat{Y}_i - \bar{Y}_i)^2} }{\color{red}{\sum_{i=1}^n (Y_i - \bar{Y}_i)^2} } \]
\[ R^2 = \frac{\text{regression}_{ss}}{\text{total}_{ss}} = \frac{\sum_{i=1}^n (\hat{Y}_i - \bar{Y}_i)^2}{\sum_{i=1}^n (Y_i - \bar{Y}_i)^2} \]


wage ~ sex*married: \( R^2 \) = 0.07ccf ~ temp: \( R^2 \) = 0.91R?
summary() on the modelvar(fitted)/var(resid)?var(fitted)/var(response) has nice property of being easy to interpretCan think of a regression model as having two parts:
Expected value of error is zero (\( \text{E}(\epsilon) = 0 \))
# Using WHO dataset again...
url <- "https://www.dropbox.com/s/l0j6wjue5armi4q/WHO.csv?dl=1"
who <- read.csv(url)
# Fitting simple model
mod <- lm(IMR ~ log(GNI), data=who)
confint(mod)
2.5 % 97.5 %
(Intercept) 212.91745 260.11324
log(GNI) -25.65125 -20.24836
new <- data.frame(GNI = c(6.5, 7.5, 8, 9))
predict(mod, newdata = new, interval = 'confidence')
fit lwr upr
1 193.5579 174.9470 212.1687
2 190.2737 172.0426 208.5048
3 188.7926 170.7326 206.8526
4 186.0895 168.3417 203.8373
plot(mod) # Plot standard diagnostic plots