Lecture 5 - Regression Modeling

GEOG 4023/5023 - Quantitative Methods
Spring 2017

What is regression?

\( Y = b_0 + b_1x + \epsilon \)

  • Used for both prediction and inference
    • Probably concerned with inference in 'science'
  • Models the mean of a dependent variable (\( Y \)) based on…
    • Value(s) of 1 or more independent variables (\( \mathbf{X} \))
    • Does \( \mathbf{X} \) affect \( Y \)? If so, in what way? How?
    • What is change in \( Y \) given change in \( \mathbf{X} \)?

Regression models

  • Regression models assume…
    1. There is a distribution of \( Y \) for each level of \( \mathbf{X} \)
    2. The mean of distribution depends on \( \mathbf{X} \)
    3. The Std Dev of distribution does not depend on \( \mathbf{X} \)

Visualizing conditional distributions

Mean of \( Y \), conditional on \( X \)

  • Mean of \( Y \), conditional on \( \mathbf{X} \) \( \uparrow \)
  • Std Dev of \( Y \), not conditional on \( \mathbf{X} \) \( \uparrow \)

Critical assumptions

  • This simple idea…
    • For each level of \( \mathbf{X} \) there is Normal distribution describing \( Y \)
  • Homoskedasticity
    • Std Dev of distribution not conditional on \( \mathbf{X} \)!

Heteroskedasticity

plot of chunk unnamed-chunk-1

Heteroskedasticity

plot of chunk unnamed-chunk-2

Something to remember

  • Regression analysis a simplification of reality…
    • Provides us with simplified view of relationship between 2 or more variables
    • A way of fitting a model to our data
    • Means for evaluating importance of variables to outcome of interest
    • A way to assess fit of model and to compare models

\[ Y = \beta_0 + \mathbf{\beta}_1\mathbf(X) + \epsilon \\ Y_i = \beta_0 + \beta_1X_{i1} + \beta_2X_{i2} + \dots + \epsilon_i \]

“Remember that all models are wrong; the practical question is how wrong they have to be to be not useful.”

  • George Box

plot of chunk unnamed-chunk-3

What would make a good line?

plot of chunk unnamed-chunk-4

Which of these lines is 'best'?

What are the criteria for a good line through the data?

plot of chunk unnamed-chunk-5

Are any of these lines the 'best' possible line?

Least squares

  • A good line minimizes distance between each data point and 'line'
  • Means our model (currently just a line) is as close as possible to our data
  • Referred to as the least squares line
  • To get this line we have to…
    • Estimate some parameters
    • Simplest method is via ordinary least squares (OLS)

All about residuals

  • Good model is one that…
    • Describes data well
  • Can measure how well model describes data by…
    • Looking at residuals
  • Models with small residuals said to fit the data
  • Process of doing a regression somtimes called…
    • Fitting a model to the data

All about residuals

  • Regression model
    • \( Y = \alpha + \beta x + \epsilon \)
  • Estimate of \( Y \) conditional on \( x \)
    • \( \hat{Y} = a + b x \)
  • Difference between observed and estimated is residual
    • \( Y_i - \hat{Y} = \text{resid} \)

Anatomy of a regression line

plot of chunk unnamed-chunk-6

\[ \hat{Y} = \mathbf{a} + bx \]

Anatomy of a regression line

plot of chunk unnamed-chunk-7

\[ \hat{Y} = a + \mathbf{b}x \]

Anatomy of a regression line

plot of chunk unnamed-chunk-8

\[ \hat{Y}_i - \hat{Y}_i \]

Least squares

  • Want to find line that minimizes residuals
    • Difference between regression prediction and observed data
    • Line with least square error

\[ \min_{a, b} \sum_{i=1}^n (Y_i - \hat{Y})^2 = \min_{a, b} \sum_{i=1}^n (Y_i - a + b x_i)^2 \]

Visualizing the squared residuals

  • Let's visualize sum of squared errors to provide some intuition
  • First, we load some useful functions and data
source("http://www.mosaic-web.org/go/datasets/mLineFit.R")
library(mosaicData)
data("KidsFeet")
  • And then we visualize in RStudio
mLineFit(width ~ length, data=KidsFeet)
  • If you can an error here, you might need to…
install.packages(c("manipulate", "mosaicData"))

An example

  • The Utilities built-in dataset contains data from utility bills at a residence
  • Its a data.frame containing 117 obervations for several variables
    • month, day, year, temp, ccf (gas usage), kwh (elec. usage), etc.
  • Let's model relationship between outside temperature and gas usage
    • Use the lm() function to fit a linear model

An example

library(dplyr)
data("Utilities")  # Load some builtin data
Utilities %>%
  select(month, day, year, temp, ccf, kwh) %>%
  summary()  # Compute summary of subset of columns
     month             day             year           temp      
 Min.   : 1.000   Min.   :23.00   Min.   :1999   Min.   : 9.00  
 1st Qu.: 4.000   1st Qu.:26.00   1st Qu.:2002   1st Qu.:30.00  
 Median : 7.000   Median :26.00   Median :2005   Median :51.00  
 Mean   : 6.564   Mean   :26.59   Mean   :2005   Mean   :48.66  
 3rd Qu.: 9.000   3rd Qu.:28.00   3rd Qu.:2007   3rd Qu.:69.00  
 Max.   :12.000   Max.   :36.00   Max.   :2010   Max.   :78.00  
      ccf              kwh      
 Min.   :  0.00   Min.   : 160  
 1st Qu.: 16.00   1st Qu.: 583  
 Median : 60.00   Median : 790  
 Mean   : 83.44   Mean   : 751  
 3rd Qu.:144.00   3rd Qu.: 893  
 Max.   :242.00   Max.   :1213  
plot(Utilities)  # Mine looks slightly different

plot of chunk unnamed-chunk-14

m <- lm(ccf ~ temp, data=Utilities)  # Fit a simple model
summary(m)

Call:
lm(formula = ccf ~ temp, data = Utilities)

Residuals:
    Min      1Q  Median      3Q     Max 
-93.892 -14.892  -2.963  13.075  84.725 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 251.4872     5.2876   47.56   <2e-16 ***
temp         -3.4535     0.1002  -34.48   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 22.19 on 115 degrees of freedom
Multiple R-squared:  0.9118,    Adjusted R-squared:  0.911 
F-statistic:  1189 on 1 and 115 DF,  p-value: < 2.2e-16

Is this \( \uparrow \) a good model?

plot of chunk unnamed-chunk-16

plot of chunk unnamed-chunk-17

Interpreting parameters

  • In a simple regression model we have two parameters
    • Slope - Expected change in \( Y \) for one unit change in \( X \)
    • Intetcept - Expected for of \( Y \) when \( X \) is zero

The intercept

  • Sometimes \( X \) cannot logically equal zero
    • Useful to transform variable so intercept is interpretatble
  • Example
    • Subtracting mean from each observation so that \( X=0 \) is the mean of \( X \)
    • Other transformations are sometimes useful

Changing meaning of intercept

 # 'Recode' temp variable
Utilities <- mutate(Utilities, temp = temp - mean(temp))
# Refit model and get coefficients summary
summary(lm(ccf ~ temp, data=Utilities))$coefficients
            Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 83.44444  2.0512277  40.68024 4.152433e-70
temp        -3.45354  0.1001587 -34.48069 1.792965e-62

Take a look \( \uparrow \) at the Intercept and coefficient on temp

Hypothesis tests on coefficients

  • Can evaluate models by looking at hypothesis tests included in output
    • Understand t-test on regression coefficients
    • Derive \( R^2 \) value

Hypothesis tests about coefficients

  • Each regression coefficient is t-tested
  • Null hypothesis for test is that coefficient is zero (\( \beta = 0 \))

\[ t = \frac{\text{observed diff} - \text{expected diff}}{\text{Std Err}} = \frac{(\hat{\beta} - 0) - 0}{\text{Std Err}} \]

  • In previous slide
    • -3.45/0.10 = -34.5

Why do t-tests work on coefficients?

  • Regression coefficients have Normal sampling distribution
  • In regression, standard errors of distribution are estimated
    • Estimate is used to construct t-test
  • Can simulate standard error of coefficients under null hypothesis…
    • Simulated results should be similar to estimate

Simulating standard errors

# This should look familiar
sigma <- matrix(c(1, 0.1, 0.1, 1), nrow=2)
mu <- c(0, 0)
# Generate multi-variate random normal data
df <- as.data.frame(mvrnorm(10000, mu, sigma))
  • First, we'll plot the data
plot(V2 ~ V1, data=df)

plot of chunk unnamed-chunk-21

Our null world: No strong relationship betwen V1 and V2

Simulating standard errors

library(mosaic)
# Run 1000 regressions on samples of 100 each
betas <- do(1000) * coef(lm(V2 ~ V1, data=sample(df, 100)))
  • Plot distribution of slope (coefficient on V1)
hist(betas$V1)

plot of chunk unnamed-chunk-24

Our null world: Note how coefficient reflects specified covariance!

Simulating standard errors

sd(~V1, data=betas)  # Print standard deviation
[1] 0.1003523
# Rerun on another sample
mod <- lm(V2 ~ V1, data=sample(df, 100))
summary(mod)$coefficients
              Estimate Std. Error   t value  Pr(>|t|)
(Intercept) 0.07512985 0.09210434 0.8157037 0.4166478
V1          0.09385312 0.09076209 1.0340563 0.3036549
  • Varition in simulated coefficients (sd) is close to estimate of standard error of slope from regression (based on sample)
  • lm does good job of estimating random variation in regression coefficients

An example

summary(lm(ccf ~ temp, data=Utilities))
  • What does Coefficients table \( \downarrow \) tell us?
            Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 83.44444  2.0512277  40.68024 4.152433e-70
temp        -3.45354  0.1001587 -34.48069 1.792965e-62
  • What does Multiple R-squared \( \downarrow \) tell us?
Residual standard error:22.19 on 115 degrees of freedom
Multiple R-squared: 0.9118, Adjusted R-squared: 0.911
F-statistic: 1189 on 1 and 115 DF,  p-value: < 2.2e-16

A note about coefficients...

  • Coefficients have units…
    • ccf = 83.444 - 3.454temp
  • Required to transform to units of response/dependent variable
    • ccf: cubic centimeters of fuel
  • So units on temp (degrees F) coefficient…
    • ccf per degree F
  • An the intercept is simply ccf

How much does a model 'explain'?

  • Remember, regression is a model of mean \( Y \) conditional on \( \mathbf{X} \)'s
  • Basic idea is that mean of \( Y \) is somehow affected by \( \mathbf{X} \)
  • Change in \( \mathbf{X} \) of 1 associated with change of \( \beta \) in \( Y \)
    • Need a way to tell how much of change in \( Y \) is attributable to changes in \( \mathbf{X} \)

Decomposing variability in Y

  • Deviations from mean can be partially explained by covariate \( X \)
    • Some fraction of deviations from mean will be due to \( X \)
    • Some fraction of deviations will be unexplained by \( X \)
  • Example: girth of tree may be strongly related to its age
    • Tree girth (\( Y \)) is partially explained by tree age \( X \)
    • How much variability in tree girth is explained by tree age?

Decomposing variance in Y

  • If we disregard tree age (\( X \)) – the explanatory variable…
  • Could compute the variance in girth (\( Y \))
    • Sum of squared deviations from mean
  • That variance is not just noise…
    • Some trees are thicker than others because of … {insert \( X \) or other explanatory variable}…

Relating total variance

  • If we add up…
    • Squared difference between model predictions (\( \hat{Y} \)) and mean (\( \bar{Y} \))
    • The squared residuals (\( Y_i - \hat{Y} \))
  • Should get the total squared deviations from the mean

Decomposing variability in Y

\[ \color{red}{ \sum_{i=1}^n (Y_i - \bar{Y}_i)^2 } = \color{green}{ \sum_{i=1}^n (\hat{Y}_i - \bar{Y}_i)^2 } + \color{blue}{ \sum_{i=1}^n (Y_i - \hat{Y}_i)^2 } \]

  • Total sum of squares (TSS)
  • Regression (explained) sum of squares (ESS)
  • Residual sum of squares (RSS)

\[ R^2 = \frac{ \color{green}{ \sum_{i=1}^n (\hat{Y}_i - \bar{Y}_i)^2} }{\color{red}{\sum_{i=1}^n (Y_i - \bar{Y}_i)^2} } \]

  • Proportion of total explained variation in \( Y \) is called coefficient of determination

Coefficient of determination

  • Percent of the total sum of squares explained by regression

\[ R^2 = \frac{\text{regression}_{ss}}{\text{total}_{ss}} = \frac{\sum_{i=1}^n (\hat{Y}_i - \bar{Y}_i)^2}{\sum_{i=1}^n (Y_i - \bar{Y}_i)^2} \]

Coefficient of determination

  • \( R^2 \) of 0.63 indicates variable(s) in model explain 63% of variation in response/outcome/dependent variable
  • Large \( R^2 \) does not necessarily imply a 'good' model
  • \( R^2 \) does not…
    • Measure magnitude of slope
    • Measure appropriateness of model
  • As predictors are added to model, \( R^2 \) must increase
    • Adjusted-\( R^2 \) 'controls' for number of predictors in model (so does F-statistic)
    • Generally, you should report Adjusted-\( R^2 \)

Sir Francis Galton -- 1888!

Sir Francis Galton, probably taken in the 1850s or early 1860s

From a lecture presented by Galton in 1877

Compute and compare R-squared

  • Captures over half the variance in response
    • \( R^2 \) = 7.21 in\( ^2 \) / 12.84 in\( ^2 \) = 0.56 (no units)
  • Ratio of variances
    • Can compare any combination of models
    • Fraction of variance in response account for my model…
    • wage ~ sex*married: \( R^2 \) = 0.07
    • ccf ~ temp: \( R^2 \) = 0.91
  • Compute \( R^2 \) in R?
    • summary() on the model

Ratios

  • Why not, say var(fitted)/var(resid)?
  • That we be reasonable too…
    • var(fitted)/var(response) has nice property of being easy to interpret
    • Always between 0 and 1
  • More technical statistic, F-statistic \[ F = \frac{\frac{\text{Var(fitted)}}{m-1}}{\frac{\text{Var(resid)}}{n-m-1}} \]
    • Nice properties in terms of shape of probability distribution
    • Fundamentally no advantage over \( R^2 \)

How big is big enough?

  • Depends on purpose
  • \( R^2 \) = 0.01 on daily stock-market prices…
    • Plenty to become richest person around!
  • General 'consensus'…
    • Bigger than 0.15… probably more like > 0.45
  • Better to consider F-statistic and its p-value

A note about residuals

  • Unless \( R^2 \) is 100%…
    • Will always be some amount of variation in \( Y \) which remains unexplained by \( \mathbf{X} \)
    • Unexplained variation is the error (\( \epsilon \)) component of the regression equation

Signal and noise

  • Can think of a regression model as having two parts:

    1. Predicts average of \( Y \) for given value/combination of \( \mathbf{X} \)
    2. Random error!
  • Expected value of error is zero (\( \text{E}(\epsilon) = 0 \))

    • Assume a Normal distribution with independent errors (iid)

The model is an estimate

  • Don't know the truth…
    • Only estimates based on data
  • Confidence intervals for parameters
  • Since model parameters (slope and intercept) are estimated… *Prediction should account for this uncertainty

Intervals for parameters

# Using WHO dataset again...
url <- "https://www.dropbox.com/s/l0j6wjue5armi4q/WHO.csv?dl=1"
who <- read.csv(url)
# Fitting simple model
mod <- lm(IMR ~ log(GNI), data=who)
confint(mod)
                2.5 %    97.5 %
(Intercept) 212.91745 260.11324
log(GNI)    -25.65125 -20.24836
  • Confidence intervals \( \uparrow \) for the regression coefficients

Prediction

new <- data.frame(GNI = c(6.5, 7.5, 8, 9))
predict(mod, newdata = new, interval = 'confidence')
       fit      lwr      upr
1 193.5579 174.9470 212.1687
2 190.2737 172.0426 208.5048
3 188.7926 170.7326 206.8526
4 186.0895 168.3417 203.8373
plot(mod)  # Plot standard diagnostic plots

plot of chunk unnamed-chunk-32

  • Violating assumptions?
    • Looks like \( \uparrow \) we have some outliers?

plot of chunk unnamed-chunk-33

  • Histograms and residuals…

Next time on, Quantitative Methods...

  • Multiple regression and model diagnostics
  • Generalized linear models (GLMs)
  • Logistic and Poisson regression
  • Modeling spatial data!