Linear Models with `R`

by Julian Faraway

notes by Bonnie Cooper

The following are notes from readings in ‘Linear Models with R’ by Julian Faraway for the course DATA621, ‘Business Analystics and Data Mining’ as part of the Masters of Science in Data Science program at CUNY SPS.

R libraries used:

library( faraway )
library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( HistData )
library( broom )
library( gclus )
library( asbio )
library( retistruct )
library( Matching )
library( MASS )
library( ellipse )
library( lmtest )
library( splines )
library( Amelia )
library( simex )
library( ggcorrplot )
library( leaps )
library( pls )
library( lars )

Introduction

“The formulation of a problem is often more essential than its solution”

Formulate the Problem Correctly

Understand the Physical Background
Understand the Objective
Make sure you know what the client wants
Put the Problem into Statistical Terms
- observational or experimental data?
- Is there Nonresponse?
- Are there Missing Values?
- How are the Data Coded?
- What are the Units of Measurement?
- Beware of data entry errors and other forms of data corruption

Initial Data Analysis

Critical to do an initial exploration to get the feel for the data. summary statistics. basic visualizations. inspect quality of the data…

look at some practice data:

data( pima )
head( pima )

##   pregnant glucose diastolic triceps insulin  bmi diabetes age test
## 1        6     148        72      35       0 33.6    0.627  50    1
## 2        1      85        66      29       0 26.6    0.351  31    0
## 3        8     183        64       0       0 23.3    0.672  32    1
## 4        1      89        66      23      94 28.1    0.167  21    0
## 5        0     137        40      35     168 43.1    2.288  33    1
## 6        5     116        74       0       0 25.6    0.201  30    0

summary( pima )

##     pregnant         glucose        diastolic         triceps     
##  Min.   : 0.000   Min.   :  0.0   Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 1.000   1st Qu.: 99.0   1st Qu.: 62.00   1st Qu.: 0.00  
##  Median : 3.000   Median :117.0   Median : 72.00   Median :23.00  
##  Mean   : 3.845   Mean   :120.9   Mean   : 69.11   Mean   :20.54  
##  3rd Qu.: 6.000   3rd Qu.:140.2   3rd Qu.: 80.00   3rd Qu.:32.00  
##  Max.   :17.000   Max.   :199.0   Max.   :122.00   Max.   :99.00  
##     insulin           bmi           diabetes           age       
##  Min.   :  0.0   Min.   : 0.00   Min.   :0.0780   Min.   :21.00  
##  1st Qu.:  0.0   1st Qu.:27.30   1st Qu.:0.2437   1st Qu.:24.00  
##  Median : 30.5   Median :32.00   Median :0.3725   Median :29.00  
##  Mean   : 79.8   Mean   :31.99   Mean   :0.4719   Mean   :33.24  
##  3rd Qu.:127.2   3rd Qu.:36.60   3rd Qu.:0.6262   3rd Qu.:41.00  
##  Max.   :846.0   Max.   :67.10   Max.   :2.4200   Max.   :81.00  
##       test      
##  Min.   :0.000  
##  1st Qu.:0.000  
##  Median :0.000  
##  Mean   :0.349  
##  3rd Qu.:1.000  
##  Max.   :1.000

Not all missing values are NA. the minimum for diastolic is $0$. However, that’s not a realistic value for a living person. Therefore, it is more likely that this is a missing value. Look closer at diastolic:

sort( pima$diastolic )[1:100]

##   [1]  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
##  [26]  0  0  0  0  0  0  0  0  0  0 24 30 30 38 40 44 44 44 44 46 46 48 48 48 48
##  [51] 48 50 50 50 50 50 50 50 50 50 50 50 50 50 52 52 52 52 52 52 52 52 52 52 52
##  [76] 54 54 54 54 54 54 54 54 54 54 54 55 55 56 56 56 56 56 56 56 56 56 56 56 56

clean these values up for diastolic and the other similar numeric variables

pima_NA <- pima %>%
  mutate( diastolic =  na_if( diastolic, 0 ),
             glucose = na_if( glucose, 0 ),
             triceps = na_if( triceps, 0 ),
             insulin = na_if( insulin, 0 ),
             bmi = na_if( bmi, 0 ) ) 
glimpse( pima_NA )

## Rows: 768
## Columns: 9
## $ pregnant  <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, …
## $ glucose   <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139…
## $ diastolic <int> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 80, 60, 72, …
## $ triceps   <int> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, NA, 23, 19, …
## $ insulin   <int> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA, NA, 846, 1…
## $ bmi       <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, NA, 37…
## $ diabetes  <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.15…
## $ age       <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, …
## $ test      <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, …

change the test feature to a categorical with descriptive labels:

pima_NA$test <- factor( pima_NA$test )
levels( pima_NA$test ) <- c( 'negative', 'positive' )
summary( pima_NA$test )

## negative positive 
##      500      268

Do some basic visualization of some features:

par(mfrow=c(1,3))
pimahist <- hist( pima_NA$diastolic, xlab='Diastolic',main='' )
pimadens <- plot( density( pima_NA$diastolic, na.rm=TRUE ), main="" )
pimasort <- plot( sort( pima_NA$diastolic ), ylab = 'Sorted Diastolic' )

Now try again with ggplot:

pimahist <- ggplot( pima_NA, aes( x = diastolic ) ) +
  geom_histogram()
pimadens <- ggplot( pima_NA, aes( x = diastolic ) ) +
  geom_density()
sdiastolic <- sort( pima_NA$diastolic )
pima_sort <- data.frame( 'sorted' = sdiastolic )
pimasort <- ggplot( pima_sort, aes( y = sorted, x = 1:length( sdiastolic ) ) ) +
  geom_point()

grid.arrange( pimahist, pimadens, pimasort, ncol = 3 )

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 35 rows containing non-finite values (stat_bin).

## Warning: Removed 35 rows containing non-finite values (stat_density).

Visualize some bivariate data

p1 <- ggplot( pima_NA, aes( x = diastolic, y = diabetes ) ) +
  geom_point()

p2 <- ggplot( pima_NA, aes( x = test, y = diabetes ) ) +
  geom_boxplot()

grid.arrange( p1, p2, ncol = 2 )

## Warning: Removed 35 rows containing missing values (geom_point).

ggplot2 is a handy library that is more flexible when visualizing complex dataframes. For instance: controlling color/shape/fill by another variables, or creating faceted plots:

#bisambiguating a factor variable:
p1 <- ggplot( pima, aes( x = diastolic, y = diabetes, color = factor( test ) ) ) +
  geom_point() +
  theme( legend.position = 'top',
         legend.direction = 'horizontal' )
p2 <- ggplot( pima, aes( x = diastolic, y = diabetes ) ) +
  geom_point( size = 1 ) +
  facet_grid( ~ factor( test ) )

grid.arrange(p1, p2, ncol=2,top="Disambiguating a factor variable")

On Visualizations: Good graphics are vital in data analysis. They help you avoid mistakes and suggest the form of the modeling to come. They are also important in communicating your analysis to others. Many in your audience or readership will focus on the graphs. This is your best opportunity to get your message over clearly and without misunderstanding. In some cases, the graphics can be so convincing that the formal analysis becomes just a confirmation of what has already been seen

When to use Linear Modeling

Linear Modeling is used for explaining or modeling the relationship between a response/outcome/output variable and one or more predictor/input/explanatory variable(s).
Simple Regression: modeling an outcome variable with just 1 explanatory variable.
Mulitple/Multivariate Regression: modeling an outcome variable with more than 1 explanatory varaible.

Regression Objectives

Prediction of future or unseen responses given specified values of the predictors
Assessment of an effect, or relationship between, explanatory variables and the response

History of Regression

describing the libration of the moon:

data( manilius )
glimpse( manilius )

## Rows: 27
## Columns: 4
## $ arc    <dbl> 13.16667, 13.13333, 13.20000, 14.25000, 14.70000, 13.01667, 14.…
## $ sinang <dbl> 0.8836, 0.9996, 0.9899, 0.2221, 0.0006, 0.9308, 0.0602, -0.1570…
## $ cosang <dbl> -0.4682, -0.0282, 0.1421, 0.9750, 1.0000, -0.3654, 0.9982, 0.98…
## $ group  <int> 1, 1, 1, 3, 3, 1, 3, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 2, 2, 3, 2, …

The data are divided into three groups based on similarity. Next, compute the sum of the three coefficients by group.

moon3 <- manilius %>%
  group_by( group ) %>%
  summarise( arc_sum = sum( arc ),
             sin_sum = sum( sinang ),
             cos_sum = sum( cosang ) )
moon3

## # A tibble: 3 x 4
##   group arc_sum sin_sum cos_sum
## * <int>   <dbl>   <dbl>   <dbl>
## 1     1    118.    8.50  -0.793
## 2     2    140.   -6.14   1.74 
## 3     3    128.    2.98   7.96

The result are 3 linear equations with three unknowns each to solve

solve( cbind( 9, moon3$sin_sum, moon3$cos_sum ), moon3$arc_sum )

## [1] 14.5445859 -1.4898221  0.1341264

Observe how similar the results are if we fit a linear regression to the original data:

mod <- lm( arc ~ sinang + cosang, manilius )
summary( mod )

## 
## Call:
## lm(formula = arc ~ sinang + cosang, data = manilius)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.37033 -0.07810  0.03655  0.07960  0.31581 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.56162    0.03411 426.846   <2e-16 ***
## sinang      -1.50458    0.04013 -37.490   <2e-16 ***
## cosang       0.09137    0.04890   1.868    0.074 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1483 on 24 degrees of freedom
## Multiple R-squared:  0.9836, Adjusted R-squared:  0.9822 
## F-statistic:   719 on 2 and 24 DF,  p-value: < 2.2e-16

The word regression derives from a term of Sir Francis Galton’s: regression to mediocrity
Important: regression to mediocrity refers to a particular statistical finding and is quite a different concept to that of regression.
Sir Francis Galton found that outcomes have a tendency towards the mean of the data using the heights of parents and their offspring. That is, tall parents are more likely to have a child shorter than they are and short parents are more likely to have a child taller than they are.
Here we explore this data:

data( GaltonFamilies )
ggplot( GaltonFamilies, aes( x = midparentHeight,
                             y = childHeight ) ) +
  geom_point( size = 3 )

find the linear regression line the best fits the distribution

mod <- lm( childHeight ~ midparentHeight, GaltonFamilies )
coef( mod )

##     (Intercept) midparentHeight 
##      22.6362405       0.6373609

ggplot(GaltonFamilies, aes(x = midparentHeight, y = childHeight)) + 
  geom_point() +
  geom_smooth( method = "lm" )

## `geom_smooth()` using formula 'y ~ x'

Let’s add the line that describes a relation where parents have children the same height as they are…

ggplot(GaltonFamilies, aes(x = midparentHeight, y = childHeight)) + 
  geom_point() +
  geom_smooth( method = "lm" ) +
  geom_line( aes( y = midparentHeight ), color = 'red' )

## `geom_smooth()` using formula 'y ~ x'

Now suppose children height is fully correlated to thier parents. We can use the following equation: \[\frac{y-\bar{y}}{SD_y} = r\frac{x-\bar{x}}{SD_x}\] where the correlation, $r = 1$ to find the coefficients of a line describing this:

beta <- with( GaltonFamilies,
                sd( childHeight )/ sd( midparentHeight ) )
alpha <- with( GaltonFamilies,
               mean( childHeight ) - 
                 beta * mean( midparentHeight ) )

ggplot(GaltonFamilies, aes(x = midparentHeight, y = childHeight)) + 
  geom_point() +
  geom_smooth( method = "lm" ) +
  geom_line( aes( y = midparentHeight ), color = 'red' ) +
  geom_abline(data=GaltonFamilies, aes(slope=beta, intercept=alpha ), color='green' )

## `geom_smooth()` using formula 'y ~ x'

Regression to the Mean: We can see that a child of tall parents is predicted by the least squares line to have a height which is above average but not quite as tall as the parents as the green line would have you believe. Similarly children of below average height parents are predicted to have a height which is still below average but not quite as short as the parents. This is why Galton used the phrase “regression to mediocrity” and the phenomenon is sometimes called the regression effect.

Recommended Exercises:

1.1 The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.

The teengamb dataset:

data( teengamb )
glimpse( teengamb )

## Rows: 47
## Columns: 5
## $ sex    <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ status <int> 51, 28, 37, 28, 65, 61, 28, 27, 43, 18, 18, 43, 30, 28, 38, 38,…
## $ income <dbl> 2.00, 2.50, 2.00, 7.00, 2.00, 3.47, 5.50, 6.42, 2.00, 6.00, 3.0…
## $ verbal <int> 8, 8, 6, 4, 8, 6, 7, 5, 6, 7, 6, 6, 4, 6, 6, 8, 8, 5, 8, 9, 8, …
## $ gamble <dbl> 0.00, 0.00, 0.00, 7.30, 19.60, 0.10, 1.45, 6.60, 1.70, 0.10, 0.…

teengamb_clean <- teengamb %>%
  mutate( sex = factor( sex ),
          status = factor( status ) )
summary( teengamb_clean )

##  sex        status       income           verbal          gamble     
##  0:28   28     : 7   Min.   : 0.600   Min.   : 1.00   Min.   :  0.0  
##  1:19   38     : 5   1st Qu.: 2.000   1st Qu.: 6.00   1st Qu.:  1.1  
##         51     : 5   Median : 3.250   Median : 7.00   Median :  6.0  
##         71     : 5   Mean   : 4.642   Mean   : 6.66   Mean   : 19.3  
##         18     : 4   3rd Qu.: 6.210   3rd Qu.: 8.00   3rd Qu.: 19.4  
##         43     : 4   Max.   :15.000   Max.   :10.00   Max.   :156.0  
##         (Other):17

The numerical summary shows the distribution of subjects in each group for the categorical variables sex and status as well as the mode/spread of the numerical features: income (lbs/wk), verbal, and gamble (annual).

and now to visualize the data:

#visulize the numeric features by group features
si <- ggplot( teengamb_clean, aes( x = sex, y = income, fill = sex ) ) +
  geom_boxplot()
sv <- ggplot( teengamb_clean, aes( x = sex, y = verbal, fill = sex ) ) +
  geom_boxplot()
sg <- ggplot( teengamb_clean, aes( x = sex, y = gamble, fill = sex ) ) +
  geom_boxplot()

grid.arrange( si, sv, sg, ncol = 3 )

sti <- ggplot( teengamb_clean, aes( x = status, y = income, fill = status ) ) +
  geom_boxplot() + theme(legend.position = "none")
stv <- ggplot( teengamb_clean, aes( x = status, y = verbal, fill = status ) ) +
  geom_boxplot() + theme(legend.position = "none")
stg <- ggplot( teengamb_clean, aes( x = status, y = gamble, fill = status ) ) +
  geom_boxplot() + theme(legend.position = "none")

grid.arrange( sti, stv, stg, ncol = 3 )

For sex, both genders have a very similar distribution of income and verbal scores. However, one gender has a very different distribution for the annual gambling expenditure.

There appears to be a linear relationship between verbal scores and status although for other features, the relationship is less clear. However, without knowing the meaning behind the status designations, it is difficult to tell.

Look at how some variable covary:

ggplot( teengamb_clean, aes( y = gamble, x = income, color = sex ) ) + geom_point( size = 3 ) +
  geom_smooth(method = "lm", alpha = .15, aes(fill = sex))

## `geom_smooth()` using formula 'y ~ x'

If we plot the amount gambled against weekly income we see two very different trends for the different genders. Males (0) increase their gambling as a function of their weekly income whereas females (1), do not tend to gamble nor show the trend of increasing gambling with increasing income.

Let’s take a look at the coefficients for these regression lines:

teengamb_clean %>% group_by( sex ) %>%
  do( gambling = broom::tidy( lm( gamble ~ income, data = .))) %>% 
  unnest( gambling )

## # A tibble: 4 x 6
##   sex   term        estimate std.error statistic   p.value
##   <fct> <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 0     (Intercept)   -2.66      8.02     -0.332 0.743    
## 2 0     income         6.52      1.25      5.20  0.0000201
## 3 1     (Intercept)    3.14      2.33      1.35  0.195    
## 4 1     income         0.175     0.479     0.365 0.719

for every GBP increase in weekly income, a male British teen gambles 6.5GBP more annually. However, there is only a negligible increase for females of 0.17.

1.3 The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question.

data( prostate, package = 'faraway' )
glimpse( prostate )

## Rows: 97
## Columns: 9
## $ lcavol  <dbl> -0.5798185, -0.9942523, -0.5108256, -1.2039728, 0.7514161, -1.…
## $ lweight <dbl> 2.7695, 3.3196, 2.6912, 3.2828, 3.4324, 3.2288, 3.4735, 3.5395…
## $ age     <int> 50, 58, 74, 58, 62, 50, 64, 58, 47, 63, 65, 63, 63, 67, 57, 66…
## $ lbph    <dbl> -1.386294, -1.386294, -1.386294, -1.386294, -1.386294, -1.3862…
## $ svi     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lcp     <dbl> -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1…
## $ gleason <int> 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6,…
## $ pgg45   <int> 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 5, 5, 0, 30, 0, 0, 0,…
## $ lpsa    <dbl> -0.43078, -0.16252, -0.16252, -0.16252, 0.37156, 0.76547, 0.76…

let’s learn what the variables are before proceeding:

lcavol -> log(cancer volume)
lweight -> log(prostate weight)
age -> age (years)
lbph -> log(benign prostatic hyperplasia amount)
svi -> seminal vesicle invasion
lcp -> log(capsular penetration)
gleason -> Gleason score
pgg45 -> percentage Gleason scores 4 or 5
lpsa -> log(prostate specific antigen)

observe summaries of each feature:

# svi appears to be binary, so make that adjust:
prostate_clean <- prostate %>%
  mutate( svi = factor( svi ) )
summary( prostate_clean )

##      lcavol           lweight           age             lbph         svi   
##  Min.   :-1.3471   Min.   :2.375   Min.   :41.00   Min.   :-1.3863   0:76  
##  1st Qu.: 0.5128   1st Qu.:3.376   1st Qu.:60.00   1st Qu.:-1.3863   1:21  
##  Median : 1.4469   Median :3.623   Median :65.00   Median : 0.3001         
##  Mean   : 1.3500   Mean   :3.653   Mean   :63.87   Mean   : 0.1004         
##  3rd Qu.: 2.1270   3rd Qu.:3.878   3rd Qu.:68.00   3rd Qu.: 1.5581         
##  Max.   : 3.8210   Max.   :6.108   Max.   :79.00   Max.   : 2.3263         
##       lcp             gleason          pgg45             lpsa        
##  Min.   :-1.3863   Min.   :6.000   Min.   :  0.00   Min.   :-0.4308  
##  1st Qu.:-1.3863   1st Qu.:6.000   1st Qu.:  0.00   1st Qu.: 1.7317  
##  Median :-0.7985   Median :7.000   Median : 15.00   Median : 2.5915  
##  Mean   :-0.1794   Mean   :6.753   Mean   : 24.38   Mean   : 2.4784  
##  3rd Qu.: 1.1786   3rd Qu.:7.000   3rd Qu.: 40.00   3rd Qu.: 3.0564  
##  Max.   : 2.9042   Max.   :9.000   Max.   :100.00   Max.   : 5.5829

generate a matrix plot ordered by degree of correlation:

dta <- prostate_clean %>%
  dplyr::select( -c( svi, gleason ) )
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

we observe a number of cross-correlations that might need attention for further analysis. For example, lbph, or the amount of benign prostatic hyperplasia amount is highly correlated with the weight of the prostate, lweight.

the Gleason Score assesses the severity of prostate cancer: A Gleason score of 6 is low grade, 7 is intermediate grade, and a score of 8 to 10 is high grade cancer.

We might be interested in using these features to find out what is predictive of gleason. How predictive are certain data features correlate with the Gleason Score?

1.4 The dataset sat comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question.

data( sat )
glimpse( sat )

## Rows: 50
## Columns: 7
## $ expend <dbl> 4.405, 8.963, 4.778, 4.459, 4.992, 5.443, 8.817, 7.030, 5.718, …
## $ ratio  <dbl> 17.2, 17.6, 19.3, 17.1, 24.0, 18.4, 14.4, 16.6, 19.1, 16.3, 17.…
## $ salary <dbl> 31.144, 47.951, 32.175, 28.934, 41.078, 34.571, 50.045, 39.076,…
## $ takers <int> 8, 47, 27, 6, 45, 29, 81, 68, 48, 65, 57, 15, 13, 58, 5, 9, 11,…
## $ verbal <int> 491, 445, 448, 482, 417, 462, 431, 429, 420, 406, 407, 468, 488…
## $ math   <int> 538, 489, 496, 523, 485, 518, 477, 468, 469, 448, 482, 511, 560…
## $ total  <int> 1029, 934, 944, 1005, 902, 980, 908, 897, 889, 854, 889, 979, 1…

observe the general properties of each variable

summary( sat )

##      expend          ratio           salary          takers     
##  Min.   :3.656   Min.   :13.80   Min.   :25.99   Min.   : 4.00  
##  1st Qu.:4.882   1st Qu.:15.22   1st Qu.:30.98   1st Qu.: 9.00  
##  Median :5.768   Median :16.60   Median :33.29   Median :28.00  
##  Mean   :5.905   Mean   :16.86   Mean   :34.83   Mean   :35.24  
##  3rd Qu.:6.434   3rd Qu.:17.57   3rd Qu.:38.55   3rd Qu.:63.00  
##  Max.   :9.774   Max.   :24.30   Max.   :50.05   Max.   :81.00  
##      verbal           math           total       
##  Min.   :401.0   Min.   :443.0   Min.   : 844.0  
##  1st Qu.:427.2   1st Qu.:474.8   1st Qu.: 897.2  
##  Median :448.0   Median :497.5   Median : 945.5  
##  Mean   :457.1   Mean   :508.8   Mean   : 965.9  
##  3rd Qu.:490.2   3rd Qu.:539.5   3rd Qu.:1032.0  
##  Max.   :516.0   Max.   :592.0   Max.   :1107.0

generate a matrix plot ordered by degree of correlation:

dta <- sat
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

Plot the trends for sat scores as a function of expenditure:

sat_scores <- sat %>%
  dplyr::select( expend, verbal, math ) %>%
  pivot_longer( cols = c( verbal, math ), 
                names_to = 'testcat',
                values_to = 'score' )
ggplot( sat_scores, aes( x = expend, y = score, color = testcat ) ) +
  geom_point( size = 2 ) +
  ylab( 'score' )

This plot suggests that there is a negative relationship between the expenditures of a state and the average sat scores for both math and verbal.

sat_scores %>% group_by( testcat ) %>%
  do( exscor = broom::tidy( lm( score ~ expend, data = .))) %>% 
  unnest( exscor )

## # A tibble: 4 x 6
##   testcat term        estimate std.error statistic  p.value
##   <chr>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 math    (Intercept)    570.      24.2      23.6  4.88e-28
## 2 math    expend         -10.3      3.99     -2.58 1.29e- 2
## 3 verbal  (Intercept)    520.      20.6      25.2  2.30e-29
## 4 verbal  expend         -10.6      3.40     -3.11 3.10e- 3

look at the cummulative scores

sat_total_lm <- sat %>%
  do( totalsat = broom::tidy( lm( total ~ expend, data = . ) ) ) %>%
  unnest( totalsat )
sat_total_lm

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)   1089.      44.4      24.5  8.17e-29
## 2 expend         -20.9      7.33     -2.85 6.41e- 3

Although the negative relationship is significant, this does not take into account an important confounding variable: the percentage of students that takes the exam:

ggplot( sat, aes( y = total, x = takers ) ) +
  geom_point( size = 2 )

Perform a multiple linear regression that includes the percentage of people taking the sat

satmod <- lm( total ~ expend + takers, sat )
summary( satmod )

## 
## Call:
## lm(formula = total ~ expend + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -88.400 -22.884   1.968  19.142  68.755 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 993.8317    21.8332  45.519  < 2e-16 ***
## expend       12.2865     4.2243   2.909  0.00553 ** 
## takers       -2.8509     0.2151 -13.253  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared:  0.8195, Adjusted R-squared:  0.8118 
## F-statistic: 106.7 on 2 and 47 DF,  p-value: < 2.2e-16

There is a dramatic improvement in the model fit when percentage of sat test takers is included. However, there is definitely room for improving on this model. For instance, the ‘U’ shaped envelope of the relationship between total score and percent test takers hints at a nonlinearity. Additionally, there are more data features in the set that could be explored.

1.5 The dataset ‘divusa’ contains data on divorces in the United States from 1920 to 1996. Make a numerical and graphical summary of the data as in the first question.

data( divusa )
glimpse( divusa )

## Rows: 77
## Columns: 7
## $ year       <int> 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929,…
## $ divorce    <dbl> 8.0, 7.2, 6.6, 7.1, 7.2, 7.2, 7.5, 7.8, 7.8, 8.0, 7.5, 7.1,…
## $ unemployed <dbl> 5.2, 11.7, 6.7, 2.4, 5.0, 3.2, 1.8, 3.3, 4.2, 3.2, 8.7, 15.…
## $ femlab     <dbl> 22.70, 22.79, 22.88, 22.97, 23.06, 23.15, 23.24, 23.33, 23.…
## $ marriage   <dbl> 92.0, 83.0, 79.7, 85.2, 80.3, 79.2, 78.7, 77.0, 74.1, 75.5,…
## $ birth      <dbl> 117.9, 119.8, 111.2, 110.5, 110.9, 106.6, 102.6, 99.8, 93.8…
## $ military   <dbl> 3.2247, 3.5614, 2.4553, 2.2065, 2.2889, 2.1735, 2.1073, 2.0…

inspect the summary stats:

summary( divusa )

##       year         divorce        unemployed         femlab     
##  Min.   :1920   Min.   : 6.10   Min.   : 1.200   Min.   :22.70  
##  1st Qu.:1939   1st Qu.: 8.70   1st Qu.: 4.200   1st Qu.:27.47  
##  Median :1958   Median :10.60   Median : 5.600   Median :37.10  
##  Mean   :1958   Mean   :13.27   Mean   : 7.173   Mean   :38.58  
##  3rd Qu.:1977   3rd Qu.:20.30   3rd Qu.: 7.500   3rd Qu.:47.80  
##  Max.   :1996   Max.   :22.80   Max.   :24.900   Max.   :59.30  
##     marriage          birth           military     
##  Min.   : 49.70   Min.   : 65.30   Min.   : 1.940  
##  1st Qu.: 61.90   1st Qu.: 68.90   1st Qu.: 3.469  
##  Median : 74.10   Median : 85.90   Median : 9.102  
##  Mean   : 72.97   Mean   : 88.89   Mean   :12.365  
##  3rd Qu.: 80.00   3rd Qu.:107.30   3rd Qu.:14.266  
##  Max.   :118.10   Max.   :122.90   Max.   :86.641

Use facet_wrap() to look at how each of the features varies with time (year):

divusa_long <- divusa %>%
  pivot_longer( cols = c( divorce, unemployed, femlab, marriage, birth, military ), names_to = 'cat', values_to = 'val' )
ggplot( aes( x = year, y = val ), data = divusa_long ) +
  geom_line() +
  #geom_point() +
  facet_wrap( ~ cat )

There are a few trends we see that occur over time:

the increase in percentage of women in the workplace
spikes in military/femlab/divorce/marriage coinciding with WW2
a peak in unemployment corresponding with the great depression

and now to view pairwise comparisons:

dta <- divusa
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

There are several relationships that look interesting to explore further:

the percentage of women in the workplace has a positive correlation with the divorce rate and a negative relationship with the marriage rate.
the birth rate has a similar relationship: positively correlated with marriage and negatively correlated to divorce
Not immediately clear from the previous figure: we see that divorce has a positive correlation with time whereas marriage is negative

Estimation

Linear Model

suppose we want to model a response property $Y$ by three feature variables $X_1$ $X_2$ and $X_3$. A general form to represent that is given: \[Y = f(X_1,X_2,X_3) + \epsilon\] typically, we won’t ever know the true $f$. For linear models, we make the assumption that parameters enter linearly to predict the response: \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon\] Linear models seem restrictive, but predictors can be transformed and combined such that linear models can be applied. Linear model implies simplicity, however, in practice linear models and the data they describe may in fact be quite complex.
Where do linear models come from?

Physical Theory may suggest a model (Hooke’s Law)
Experience with past data
No prior data exists (linear model used as a starting point)

A good model is like a map that guides us to our destination

Estimating

We would like to choose $\beta$ such that the systematic component of our model explains as much of the response as possible. where:

Data =	Systemic Structure	+	Random Variation
n dimentions=	p dimensions	+	(n-p) dimensions

The difference between the actual response and the predicted response = the residual. The conceptual purpose of the linear model is to represent as accurately as possible, some data that is complex with n-dimensions with something that is comparatively simple, a model with just ‘p’ terms/coefficients.

Least Squares Estimation

The random variation in the residuals lies in the n-p dimensional space; the aim of fitting a model is the minimize the residuals.
n-p = degrees of freedom
RSS = Residual Sum of Squares
estimate of $\sigma^2$: $\hat{\sigma}^2 = \frac{RSS}{n-p}$

An Example from the Galapagos

data( gala )
glimpse( gala )

## Rows: 30
## Columns: 7
## $ Species   <dbl> 58, 31, 3, 25, 2, 18, 24, 10, 8, 2, 97, 93, 58, 5, 40, 347, …
## $ Endemics  <dbl> 23, 21, 3, 9, 1, 11, 0, 7, 4, 2, 26, 35, 17, 4, 19, 89, 23, …
## $ Area      <dbl> 25.09, 1.24, 0.21, 0.10, 0.05, 0.34, 0.08, 2.33, 0.03, 0.18,…
## $ Elevation <dbl> 346, 109, 114, 46, 77, 119, 93, 168, 71, 112, 198, 1494, 49,…
## $ Nearest   <dbl> 0.6, 0.6, 2.8, 1.9, 1.9, 8.0, 6.0, 34.1, 0.4, 2.6, 1.1, 4.3,…
## $ Scruz     <dbl> 0.6, 26.3, 58.7, 47.4, 1.9, 8.0, 12.0, 290.2, 0.4, 50.2, 88.…
## $ Adjacent  <dbl> 1.84, 572.33, 0.78, 0.18, 903.82, 1.84, 0.34, 2.85, 17.95, 0…

fit a linear model using lm()

gala_mod <- lm( Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala )
gala_mod_sum <- summary( gala_mod )
gala_mod_sum

## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

the abridged output alternative from the faraway library:

sumary( gala_mod )

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  7.068221  19.154198  0.3690 0.7153508
## Area        -0.023938   0.022422 -1.0676 0.2963180
## Elevation    0.319465   0.053663  5.9532 3.823e-06
## Nearest      0.009144   1.054136  0.0087 0.9931506
## Scruz       -0.240524   0.215402 -1.1166 0.2752082
## Adjacent    -0.074805   0.017700 -4.2262 0.0002971
## 
## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77

Let’s directly calculate some quantities of interest:

#extract the feature matrix and the response variable:
fmat <- model.matrix( ~ Area + Elevation + Nearest + Scruz + Adjacent, gala )
r <- gala$Species

construct $(X^TX)^{-1}$:

xtxi <- solve( t( fmat ) %*% fmat )

get $\hat{\beta}$ by calculating $(X^TX)^{-1}X^Ty$:

xtxi %*% t( fmat ) %*% r

##                     [,1]
## (Intercept)  7.068220709
## Area        -0.023938338
## Elevation    0.319464761
## Nearest      0.009143961
## Scruz       -0.240524230
## Adjacent    -0.074804832

A more efficient and accurate way to get $\hat{\beta}$:

solve( crossprod( fmat, fmat ), crossprod( fmat, r ) )

##                     [,1]
## (Intercept)  7.068220709
## Area        -0.023938338
## Elevation    0.319464761
## Nearest      0.009143961
## Scruz       -0.240524230
## Adjacent    -0.074804832

compare with the lm() results:

sumary( gala_mod )

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  7.068221  19.154198  0.3690 0.7153508
## Area        -0.023938   0.022422 -1.0676 0.2963180
## Elevation    0.319465   0.053663  5.9532 3.823e-06
## Nearest      0.009144   1.054136  0.0087 0.9931506
## Scruz       -0.240524   0.215402 -1.1166 0.2752082
## Adjacent    -0.074805   0.017700 -4.2262 0.0002971
## 
## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77

We can extract a lot of information from the lm structure:

residuals( gala_mod ) #extract the residuals

##       Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major 
##   -58.725946    38.273154   -26.330659    14.635734    38.383916   -25.087705 
## Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina 
##    -9.919668    19.018992   -20.314202   -28.785943    49.343513    -3.989598 
##     Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow 
##    62.033276   -59.633796    40.497176   -39.403558   -37.694540    -2.037233 
##        Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador 
##  -111.679486   -42.475375   -23.075807    -5.553122    73.048122   -40.676318 
##    SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf 
##   182.583587   -23.376486    89.383371    -5.805095   -36.935732    -5.700573

fitted( gala_mod ) #extract the model predictions

##       Baltra    Bartolome     Caldwell     Champion      Coamano Daphne.Major 
##  116.7259460   -7.2731544   29.3306594   10.3642660  -36.3839155   43.0877052 
## Daphne.Minor       Darwin         Eden      Enderby     Espanola   Fernandina 
##   33.9196678   -9.0189919   28.3142017   30.7859425   47.6564865   96.9895982 
##     Gardner1     Gardner2     Genovesa      Isabela     Marchena       Onslow 
##   -4.0332759   64.6337956   -0.4971756  386.4035578   88.6945404    4.0372328 
##        Pinta       Pinzon   Las.Plazas       Rabida SanCristobal  SanSalvador 
##  215.6794862  150.4753750   35.0758066   75.5531221  206.9518779  277.6763183 
##    SantaCruz      SantaFe   SantaMaria      Seymour      Tortuga         Wolf 
##  261.4164131   85.3764857  195.6166286   49.8050946   52.9357316   26.7005735

df.residual( gala_mod ) #extract the degrees of freedom

## [1] 24

deviance( gala_mod ) #gives the RSS

## [1] 89231.37

coef( gala_mod ) #returns the coefficient estimates

##  (Intercept)         Area    Elevation      Nearest        Scruz     Adjacent 
##  7.068220709 -0.023938338  0.319464761  0.009143961 -0.240524230 -0.074804832

estimate $\sigma$:

sqrt( deviance( gala_mod ) / df.residual( gala_mod ) )

## [1] 60.97519

gala_mod_sum$sigma

## [1] 60.97519

compute the standard error for the coefficients

xtxi <- gala_mod_sum$cov.unscaled
SE <- sqrt( diag( xtxi ) ) * gala_mod_sum$sigma
SE

## (Intercept)        Area   Elevation     Nearest       Scruz    Adjacent 
## 19.15419782  0.02242235  0.05366280  1.05413595  0.21540225  0.01770019

or extract the SE

gala_mod_sum$coef[,2]

## (Intercept)        Area   Elevation     Nearest       Scruz    Adjacent 
## 19.15419782  0.02242235  0.05366280  1.05413595  0.21540225  0.01770019

There are great reasons to use Least Squares:

It makes sense geometrically (makes an orthogonal projection onto model space)
If the errors are independent and identically normally distributed, it estimates the maximum likelihood
The Gause-Markov rule states that $\hat{\beta}$ is the best linear unbiased estimate

However, there are some situations where Least Squares might not be the best:

When errors are correlated of have unequal variance
When the error distribution is long-tailed
When the predictors are collinear

Goodness of Fit

a measure of how well the model fits the data.
$R^2$ or the coefficient of determination or the percentage of variance explained: \[R^2 = 1 - \frac{\mbox{RSS}}{\mbox{Total SS(Corrected for Mean)}} = \mbox{cor}^2 (\hat{y},y)\] It is a mistake to rely on $R^2$ as the sole measure of goodness of fit.

op <- par(mfrow=c(1,4),mar=c (0,0,2,3), oma = c(5, 4.2, 0, 0))
with(anscombe, plot(x1, y1, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x1, y1),2)))))); abline(3,0.5) 
with(anscombe, plot(x2, y2, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x2, y2),2)))))); abline(3,0.5) 
with(anscombe, plot(x3, y3, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x3, y3),2)))))); abline(3,0.5) 
with(anscombe, plot(x4, y4, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x4, y4),2)))))); abline(3,0.5) 
mtext(expression(italic(y[1])),side=1, outer = TRUE, line = 3)
mtext(expression(italic(y[2])),side=2, outer = TRUE, line = 2.6)
mtext("(a)",side=3, at = -42, line = .5)
mtext("(b)",side=3, at = -26, line = .5)
mtext("(c)",side=3, at = -10.3, line = .5)
mtext("(d)",side=3, at = 5.5, line = .5)

par(op)
# }

All 4 of these distributions result in the same lm slope with the same $R^2$ value.

$\hat{\sigma}$ is another measure returned by the lm() summary that is in the same units as the response variable.

Identifiability

Unidentifiability will occur when $X$ is not of full rank. e.g. columns that are linear combinations of one another.

Orthogonality

data( odor )
glimpse( odor )

## Rows: 15
## Columns: 4
## $ odor <dbl> 66, 39, 43, 49, 58, 17, -5, -40, 65, 7, 43, -22, -31, -35, -26
## $ temp <dbl> -1, 1, -1, 1, -1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0
## $ gas  <dbl> -1, -1, 1, 1, 0, 0, 0, 0, -1, 1, -1, 1, 0, 0, 0
## $ pack <dbl> 0, 0, 0, 0, -1, -1, 1, 1, -1, -1, 1, 1, 0, 0, 0

compute the covariance:

cov( odor[,-1] )

##           temp       gas      pack
## temp 0.5714286 0.0000000 0.0000000
## gas  0.0000000 0.5714286 0.0000000
## pack 0.0000000 0.0000000 0.5714286

odor_mod <- lm( odor ~ temp + gas + pack, odor )
summary( odor_mod, cor = T )

## 
## Call:
## lm(formula = odor ~ temp + gas + pack, data = odor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.200 -17.138   1.175  20.300  62.925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   15.200      9.298   1.635    0.130
## temp         -12.125     12.732  -0.952    0.361
## gas          -17.000     12.732  -1.335    0.209
## pack         -21.375     12.732  -1.679    0.121
## 
## Residual standard error: 36.01 on 11 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.1519 
## F-statistic: 1.836 on 3 and 11 DF,  p-value: 0.1989
## 
## Correlation of Coefficients:
##      (Intercept) temp gas 
## temp 0.00                 
## gas  0.00        0.00     
## pack 0.00        0.00 0.00

The Correlation Coefs are 0.
These feature variables are entirely independent

Exercises

2.4

The dataset prostate comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Fit a model with lpsa as the response and lcavol as the predictor. Record the residual standard error and the $R^2$ . Now add lweight, svi, lbph, age, lcp, pgg45 and gleason to the model one at a time. For each model record the residual standard error and the $R^2$. Plot the trends in these two statistics.

data( prostate, package = 'faraway' )
glimpse( prostate )

## Rows: 97
## Columns: 9
## $ lcavol  <dbl> -0.5798185, -0.9942523, -0.5108256, -1.2039728, 0.7514161, -1.…
## $ lweight <dbl> 2.7695, 3.3196, 2.6912, 3.2828, 3.4324, 3.2288, 3.4735, 3.5395…
## $ age     <int> 50, 58, 74, 58, 62, 50, 64, 58, 47, 63, 65, 63, 63, 67, 57, 66…
## $ lbph    <dbl> -1.386294, -1.386294, -1.386294, -1.386294, -1.386294, -1.3862…
## $ svi     <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lcp     <dbl> -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1…
## $ gleason <int> 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6,…
## $ pgg45   <int> 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 5, 5, 0, 30, 0, 0, 0,…
## $ lpsa    <dbl> -0.43078, -0.16252, -0.16252, -0.16252, 0.37156, 0.76547, 0.76…

fit model and keep track of residual standard error and $R^3$:

formula <- ''
formulas <- c()
res_std_err <- c()
r_sqrd <- c()
feature_order <- c( 'lcavol', 'lweight', 'svi', 'lbph', 'age', 'lcp', 'pgg45', 'gleason' )
for (i in 1:length( feature_order )) {
  if (i == 1) {
    formula <- paste0('lpsa', " ~ ", feature_order[i] )
  }
  else {
    formula <- paste0(formula, " + ", feature_order[i] )
  }
  mod_sum <-  summary( lm( formula, data = prostate ) )
  res_std_err[[i]] <- mod_sum$sigma
  r_sqrd[[i]] <- mod_sum$r.squared
  #print( formula )
}

mods_df <- data.frame( 'n_terms' = seq( 1,8 ),
                       'sigma' = unlist(res_std_err, recursive = FALSE),
                       'r_sqrd' = unlist(r_sqrd, recursive = FALSE) )
glimpse( mods_df )

## Rows: 8
## Columns: 3
## $ n_terms <int> 1, 2, 3, 4, 5, 6, 7, 8
## $ sigma   <dbl> 0.7874994, 0.7506469, 0.7168094, 0.7108232, 0.7073054, 0.71021…
## $ r_sqrd  <dbl> 0.5394319, 0.5859345, 0.6264403, 0.6366035, 0.6441024, 0.64511…

visualize the trends of the summary statistics:

sigma_plot <- ggplot( mods_df, aes( x = n_terms, y = sigma ) ) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_line() +
  xlab( 'Number of features added to lm()' ) +
  ylab( 'sigma' ) +
  theme_classic() 

r_sqrd_plot <- ggplot( mods_df, aes( x = n_terms, y = r_sqrd ) ) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_line() +
  xlab( 'Number of features added to lm()' ) +
  ylab( 'R^2' ) +
  theme_classic() 

grid.arrange( sigma_plot, r_sqrd_plot, ncol = 2 )

2.5

Using the prostate data, plot lpsa against lcavol. Fit the regressions of lpsa on lcavol and lcavol on lpsa. Display both regression lines on the plot. At what point do the two lines intersect?

m <- lm(lcavol ~ lpsa, prostate)
m2 <- lm(lpsa ~ lcavol, prostate)

plot1 <- ggplot(prostate, aes(lcavol, lpsa)) +
  geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3) +
  geom_line(aes(x = predict(m), color = "lcavol ~ lpsa")) +
  geom_line(aes(y = predict(m2), color = "lpsa ~ lcavol")) +
  theme_classic() 
plot1

find the intersection of the two lines:

m_aug <- augment( m )
m2_aug <- augment( m2 )
P3 <- c( m2_aug$lcavol[1], m2_aug$.fitted[1] )
P4 <- c( m2_aug$lcavol[length( m2_aug$lcavol )], m2_aug$.fitted[length( m2_aug$lcavol )] )
P1 <- c( m_aug$.fitted[1], m_aug$lpsa[1] )
P2 <- c( m_aug$.fitted[length( m_aug$lcavol )], m_aug$lpsa[length( m_aug$lcavol )] )
intersect <- line.line.intersection(P1, P2, P3, P4, interior.only = TRUE)
int_df <- data.frame( 'X' = intersect[1], 'Y' = intersect[2] )
int_df

##         X        Y
## 1 1.35001 2.478387

visually confirm the intersection

coords <- paste( '(', round(int_df$X,2), ',', round(int_df$Y,2), ')' )
plot1 + annotate(geom="point", x=int_df$X, y=int_df$Y,
              color="red") +
  annotate(geom="text", x=2, y=2, label=coords,
              color="red")

  ggtitle( 'intersection of lm() model fits' )

## $title
## [1] "intersection of lm() model fits"
## 
## attr(,"class")
## [1] "labels"

Inference

Hypothesis Tests to Compare Models

F-statistic: A statistic used to evaluate the residuals for the model compared to the null hypothesis.
\[F = \frac{ \frac{(RSS_{\omega}-RSS_{\Omega})}{p-q}}{\frac{ RSS_{\Omega}}{n-p}} \sim F_{p-q,n-p}\] We would reject the null hypothesis if $F\gt F_{p-q,n-p}^{(\alpha)}$ Where p = num parameters in $\Omege$ and q is the num params in $\omega$. The degrees of freedom of a model at typically the number of observations - the number of parameters, so F can also be written as: \[F = \frac{ \frac{(RSS_{\omega}-RSS_{\Omega})}{df_{\omega}-df_{\Omega}}}{\frac{ RSS_{\Omega}}{df_{\Omega}}}\]

Are any of the Feature Variables Useful in Predicting the Response?

We do not know whether all predictors are required to predict a response or just some. The F-statistic can help determine this.

Demonstrate this by revisiting the data from the Galapagos:

gala_mod_sum

## 
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent, 
##     data = gala)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -111.679  -34.898   -7.862   33.460  182.584 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7.068221  19.154198   0.369 0.715351    
## Area        -0.023938   0.022422  -1.068 0.296318    
## Elevation    0.319465   0.053663   5.953 3.82e-06 ***
## Nearest      0.009144   1.054136   0.009 0.993151    
## Scruz       -0.240524   0.215402  -1.117 0.275208    
## Adjacent    -0.074805   0.017700  -4.226 0.000297 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared:  0.7658, Adjusted R-squared:  0.7171 
## F-statistic:  15.7 on 5 and 24 DF,  p-value: 6.838e-07

model the Null Hypothesis and perform an ANOVA:

gala_nullmod <- lm( Species ~ 1, gala )
anova( gala_nullmod, gala_mod )

## Analysis of Variance Table
## 
## Model 1: Species ~ 1
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     29 381081                                  
## 2     24  89231  5    291850 15.699 6.838e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Directly computing the F-stat & it’s p-value:

rss0 <- deviance( gala_nullmod )
rss <- deviance( gala_mod )
df0 <- df.residual( gala_nullmod )
df <- df.residual( gala_mod )
fstat <- ((rss0-rss)/(df0-df)/(rss/df))
fstat

## [1] 15.69941

1 - pf( fstat, df0-df, df )

## [1] 6.837893e-07

Can 1 particular feature be dropped from a model?

Let $\Omega$ be the full model.
Let $\omega$ be the full model - suspect feature.
Test whether the feature can be dropped by looking at the F-statistic. The null hypothesis is that $H_0 : \beta_i = 0$

gala_mod_noArea <- lm( Species ~ Elevation + Nearest + Scruz + Adjacent, gala )
anova( gala_mod_noArea, gala_mod )

## Analysis of Variance Table
## 
## Model 1: Species ~ Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Res.Df   RSS Df Sum of Sq      F Pr(>F)
## 1     25 93469                           
## 2     24 89231  1    4237.7 1.1398 0.2963

The p-val of 0.3 indicates that the Null cannot be rejected

Testing a pair of predictors

Fit a model without the pair & build the F-test:

gala_mod_noAreas <- lm( Species ~ Elevation + Nearest + Scruz, gala )
anova( gala_mod_noAreas, gala_mod )

## Analysis of Variance Table
## 
## Model 1: Species ~ Elevation + Nearest + Scruz
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
##   Res.Df    RSS Df Sum of Sq      F  Pr(>F)   
## 1     26 158292                               
## 2     24  89231  2     69060 9.2874 0.00103 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Here, the F-test is statistically significant suggesting that we can reject the null hypothesis. The rejection suggests that there is a difference in the amount of residual error that is accounted for between the two models therefore, removing the pair of feature variables cannot be justified.

Testing a Subspace

Can we combine two features? use the I(), or interaction function to test linear combinations of feature variables
Can a feature be set to a particular values? use the offset() function

What can’t we use an F-test to evaluate?

Non-linear relationships. ex: $H_0 : \beta_j \beta_k = 1$
the null model needs to be in the feature subspace of the full model.

Permutation Tests: when you don’t want to assume normality

Permutation Test: what is the chance that the F-test is larger than we observed? We could compute this exactly by computing the F-stat for all permutations of the response variable and see what proportion have greater F-tests.

gala_mod_2 <- lm( Species ~ Nearest + Scruz, gala )
gala_mod_2_sum <- summary( gala_mod_2 )
gala_mod_2_sum$fstatistic

##      value      numdf      dendf 
##  0.6019558  2.0000000 27.0000000

1 - pf( gala_mod_2_sum$fstatistic[1],
        gala_mod_2_sum$fstatistic[2],
        gala_mod_2_sum$fstatistic[3])

##     value 
## 0.5549255

#anova( gala_mod_noAreas, gala_mod )

use the sample() function to repeat this 4000x

nreps <- 4000
fstats <- numeric( nreps )
for (i in 1:nreps ) {
  lmsum <- summary( lm( sample( Species ) ~ Nearest + Scruz, gala ) )
  fstats[i] <- lmsum$fstat[1]
}

mean( fstats > gala_mod_2_sum$fstatistic[1] )

## [1] 0.5585

Sampling

a finite population from which we draw a simple random sample that is our data.

Confidence Intervals for $\beta$

consider our galapagos model:

sumary( gala_mod )

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  7.068221  19.154198  0.3690 0.7153508
## Area        -0.023938   0.022422 -1.0676 0.2963180
## Elevation    0.319465   0.053663  5.9532 3.823e-06
## Nearest      0.009144   1.054136  0.0087 0.9931506
## Scruz       -0.240524   0.215402 -1.1166 0.2752082
## Adjacent    -0.074805   0.017700 -4.2262 0.0002971
## 
## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77

construct 95% CI for $\beta_{area}$

tdist <- qt( 0.975, 30-6 ) #t-dist w/ 30-6 degrees of freedom
c( -0.023938 - tdist * 0.022422, -0.023938 + tdist * 0.022422)

## [1] -0.07021473  0.02233873

CIs have a duality with the 2-sided hypothesis tests: if the interval contains zero, this indicates that we cannot reject the null hypothesis. $\beta_{area}$’s CI contains 0, therefore we cannot reject the null hypothesis.

c( -0.074805 - tdist * 0.0177, -0.074805 + tdist * 0.0177)

## [1] -0.111336 -0.038274

$\beta_{adjacent}$’s CI does not, so we can reject the null hypothesis.

We can quickly find the CI for all feature variables in the model:

confint( gala_mod )

##                   2.5 %      97.5 %
## (Intercept) -32.4641006 46.60054205
## Area         -0.0702158  0.02233912
## Elevation     0.2087102  0.43021935
## Nearest      -2.1664857  2.18477363
## Scruz        -0.6850926  0.20404416
## Adjacent     -0.1113362 -0.03827344

Confidence Regions:

plot( ellipse( gala_mod, c( 2,6 ) ), type = 'l', ylim = c(-0.13,0 ) ) #Confidence Ellipse
points( coef( gala_mod )[2], coef( gala_mod )[6], pch=19 ) #least squares estimate
abline( v = confint( gala_mod)[2,], lty = 2 ) #confidence intervals
abline( h = confint( gala_mod)[6,], lty = 2 )

We can reject the null hypothesis because the origin does not lie within the ellipse

Bootstrapped Confidence Intervals

F & t based CIs are convenient, however, they rely on the assumption of normality. Boostrapped CI’s are good workaround.

Simulation

Sample the known distribution and compute the estimate. Repeat these steps multiple times and estimate the sampling distribution of $\hat{\beta}$ using the empirical distribution of the generated/simulated $\hat{\beta}$:

Generate $\epsilon$ from the known error distribution
From $y = X\beta + \epsilon$ from the known $'beta$ and fixed $X$.
Compute $\hat{\beta}$.

Bootstrap

Instead of sampling from the true (empirical) data, sample from observed data.
Bootstrapping mirrors simulation, but works with the data on hand without making assumptions about it’s distribution:

Generate $\epsilon^*$ by sampling with replacement from $\hat{\epsilon_1}, \ldots,\hat{\epsilon_n}$
Form $y^* = X\hat{\beta} + \epsilon^*$
Compute $\hat{\beta}^*$ from $(X,y^*)$

num_boots <- 4000
coefmat <- matrix( NA, num_boots, 6 )
resids <- residuals( gala_mod )
preds <- fitted( gala_mod )

for ( i in 1:num_boots ) {
  booty <- preds + sample( resids, rep = TRUE )
  bmod <- update( gala_mod, booty ~ . )
  coefmat[ i, ] <- coef( bmod )
}

colnames( coefmat ) <- c( "Intercept", colnames( gala[, 3:7 ] ) )
coefdf <- data.frame( coefmat )
CIs <- apply( coefdf, 2, function( x ) quantile( x, c( 0.025, 0.975 ) ) )
CIs

##       Intercept        Area Elevation   Nearest      Scruz    Adjacent
## 2.5%  -24.13428 -0.06362569 0.2303649 -1.705391 -0.6025777 -0.10519819
## 97.5%  41.65439  0.02119342 0.4202839  2.114584  0.1750448 -0.04002093

let’s remind ourselves of the CIs computed with normal distribution theory:

confint( gala_mod )

##                   2.5 %      97.5 %
## (Intercept) -32.4641006 46.60054205
## Area         -0.0702158  0.02233912
## Elevation     0.2087102  0.43021935
## Nearest      -2.1664857  2.18477363
## Scruz        -0.6850926  0.20404416
## Adjacent     -0.1113362 -0.03827344

the CIs are not identical but are comparable.

Now take a look at some of the bootstrapped distributions and the CIs:

CIs

##       Intercept        Area Elevation   Nearest      Scruz    Adjacent
## 2.5%  -24.13428 -0.06362569 0.2303649 -1.705391 -0.6025777 -0.10519819
## 97.5%  41.65439  0.02119342 0.4202839  2.114584  0.1750448 -0.04002093

pA <- ggplot( coefdf, aes( x = Area ) ) + geom_density() + 
  geom_vline( xintercept = c( CIs[ 1,2 ], CIs[ 2,2 ] ), lty = 2 )
pE <- ggplot( coefdf, aes( x = Elevation ) ) + geom_density() + 
  geom_vline( xintercept = c( CIs[ 1,3 ], CIs[ 2,3 ] ), lty = 2 )
pN <- ggplot( coefdf, aes( x = Nearest ) ) + geom_density() + 
  geom_vline( xintercept = c( CIs[ 1,4 ], CIs[ 2,4 ] ), lty = 2 )
pS <- ggplot( coefdf, aes( x = Scruz ) ) + geom_density() + 
  geom_vline( xintercept = c( CIs[ 1,5 ], CIs[ 2,5 ] ), lty = 2 )
pAd <- ggplot( coefdf, aes( x = Adjacent ) ) + geom_density() + 
  geom_vline( xintercept = c( CIs[ 1,6 ], CIs[ 2,6 ] ), lty = 2 )

grid.arrange( pA, pE, pN, pS, pAd )

The distributions are all roughly normal. However, this is not always the case

Exercise 3.4

Using sat data:

Fit a model with total sat score as the response and expend, ratio and salary as predictors. Test the hypothesis that $\beta_{salary}$ = 0. Test the hypothesis that $\beta_{salary}$ = $\beta_{ratio}$ = $\beta_{expend}$ = 0. Do any of these predictors have an effect on the response?

sat_sub <- sat %>% dplyr::select( c( total, expend, ratio, salary, takers ) )
dta <- sat_sub
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

sat_mod <- lm( total ~ expend + ratio + salary, data = sat )
sumary( sat_mod )

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept) 1069.2342   110.9249  9.6393 1.292e-12
## expend        16.4689    22.0499  0.7469   0.45893
## ratio          6.3303     6.5421  0.9676   0.33829
## salary        -8.8226     4.6968 -1.8784   0.06667
## 
## n = 50, p = 4, Residual SE = 68.65350, R-Squared = 0.21

confint( sat_mod )

##                 2.5 %       97.5 %
## (Intercept) 845.95385 1292.5144911
## expend      -27.91528   60.8530112
## ratio        -6.83820   19.4987339
## salary      -18.27679    0.6315232

We cannot reject the null hypotheses as these predictors all include 0 within thier confidence intervals and do not have a statistically significant effect on the response.

Now add takers to the model. Test the hypothesis that $\beta_{takers}$ = 0. Compare this model to the previous one using an F-test. Demonstrate that the F-test and t-test here are equivalent.

sat_mod2 <- lm( total ~ expend + ratio + salary + takers, data = sat )
sum2 <- sumary( sat_mod2 )

##               Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) 1045.97154   52.86976  19.7839 < 2.2e-16
## expend         4.46259   10.54653   0.4231    0.6742
## ratio         -3.62423    3.21542  -1.1271    0.2657
## salary         1.63792    2.38725   0.6861    0.4962
## takers        -2.90448    0.23126 -12.5594 2.607e-16
## 
## n = 50, p = 5, Residual SE = 32.70199, R-Squared = 0.82

sum2

## 
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -90.531 -20.855  -1.746  15.979  66.571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1045.9715    52.8698  19.784  < 2e-16 ***
## expend         4.4626    10.5465   0.423    0.674    
## ratio         -3.6242     3.2154  -1.127    0.266    
## salary         1.6379     2.3872   0.686    0.496    
## takers        -2.9045     0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared:  0.8246, Adjusted R-squared:  0.809 
## F-statistic: 52.88 on 4 and 45 DF,  p-value: < 2.2e-16

confint( sat_mod2)

##                  2.5 %      97.5 %
## (Intercept) 939.486374 1152.456698
## expend      -16.779204   25.704393
## ratio       -10.100417    2.851952
## salary       -3.170247    6.446081
## takers       -3.370262   -2.438699

We can reject the null hypothesis that $\beta_{takers} = 0$, because $\beta_{takers}$’s CI does not span 0.

sat_anova <- anova( sat_mod, sat_mod2 )
sat_anova

## Analysis of Variance Table
## 
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1     46 216812                                  
## 2     45  48124  1    168688 157.74 2.607e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value indicates that we can reject the null hypothesis and conclude that the addition of takers to the model leads to a statistically significant difference in the models ability to describe the variance in the data. Additionally, we can see that the F-statistic from the anova is equivalent to the t-test for takers in the summary of the linear model:

sat_anova$`Pr(>F)`[2] == sum2$coefficients[5,4]

## [1] TRUE

Prediction

Prediction is one of the main uses for regression models. However, a point estimate is not the end of the story. It is as important to consider the uncertainty of the estimate so that we may understand the range of expected outcomes. Projections are not useful without a realistic estimate of uncertainty because we need to make sensible plans for when events do not turn out as well as predicted.

Confidence Intervals for Prediction

Prediction of a Mean Response: Confidence Interval
- $\hat{y}_0 \pm t_{n-p}^{\frac{\alpha}{2}}\hat{\sigma} \sqrt{x_0^T(X^TX)^{-1}x_0}$
- There is a 95% chance that the mean value falls within this CI
Prediction of a Future Observations: Prediction Interval
- $\hat{y}_0 \pm t_{n-p}^{\frac{\alpha}{2}}\hat{\sigma} \sqrt{1 + x_0^T(X^TX)^{-1}x_0}$
- There is a 95% chance that the future value falls with in this PI
Extrapolation: Predicting an estimate for values that lie outside of the given data range

Example: Body Fat

load body fat data:

data( fat )
#glimpse( fat )
fatmod <- lm( brozek ~ age + weight + height + 
                neck + chest + abdom + hip + thigh + 
                knee + ankle + biceps + forearm + 
                wrist, data = fat )
summary( fatmod )

## 
## Call:
## lm(formula = brozek ~ age + weight + height + neck + chest + 
##     abdom + hip + thigh + knee + ankle + biceps + forearm + wrist, 
##     data = fat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.264  -2.572  -0.097   2.898   9.327 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -15.29255   16.06992  -0.952  0.34225    
## age           0.05679    0.02996   1.895  0.05929 .  
## weight       -0.08031    0.04958  -1.620  0.10660    
## height       -0.06460    0.08893  -0.726  0.46830    
## neck         -0.43754    0.21533  -2.032  0.04327 *  
## chest        -0.02360    0.09184  -0.257  0.79740    
## abdom         0.88543    0.08008  11.057  < 2e-16 ***
## hip          -0.19842    0.13516  -1.468  0.14341    
## thigh         0.23190    0.13372   1.734  0.08418 .  
## knee         -0.01168    0.22414  -0.052  0.95850    
## ankle         0.16354    0.20514   0.797  0.42614    
## biceps        0.15280    0.15851   0.964  0.33605    
## forearm       0.43049    0.18445   2.334  0.02044 *  
## wrist        -1.47654    0.49552  -2.980  0.00318 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.988 on 238 degrees of freedom
## Multiple R-squared:  0.749,  Adjusted R-squared:  0.7353 
## F-statistic: 54.63 on 13 and 238 DF,  p-value: < 2.2e-16

Consider a ‘typical’ man (mean features).

x <- model.matrix( fatmod )
x0 <- apply( x, 2, median )
x0

## (Intercept)         age      weight      height        neck       chest 
##        1.00       43.00      176.50       70.00       38.00       99.65 
##       abdom         hip       thigh        knee       ankle      biceps 
##       90.95       99.30       59.00       38.50       22.80       32.05 
##     forearm       wrist 
##       28.70       18.30

fat_select <- fat %>%
  dplyr::select( c( age, weight, height, neck, chest, abdom, hip, thigh, knee, ankle, biceps, forearm, wrist ) )
meanMan <- colMeans( fat_select )
meanMan

##       age    weight    height      neck     chest     abdom       hip     thigh 
##  44.88492 178.92440  70.14881  37.99206 100.82421  92.55595  99.90476  59.40595 
##      knee     ankle    biceps   forearm     wrist 
##  38.59048  23.10238  32.27341  28.66389  18.22976

if we are predicting the expected value and PI for an individual with mean features:

predict( fatmod, newdata = data.frame( t( meanMan ) ), interval = 'prediction' )

##        fit      lwr     upr
## 1 18.93849 11.06669 26.8103

if we are predicting the mean body fat for all men that have mean characteristics

predict( fatmod, newdata = data.frame( t( meanMan ) ), interval = 'confidence' )

##        fit     lwr      upr
## 1 18.93849 18.4436 19.43339

Extrapolation

the further away the values are from the original data, the more uncertainty there is in the estimations.
Compare the intervals calculated below with those for the mean ‘typical’ body metrics:

extremeFat <- apply( x, 2, function( x ) quantile( x, 0.95 ) )
predict( fatmod, newdata = data.frame( t( extremeFat ) ), interval = 'prediction' )

##        fit      lwr      upr
## 1 30.01804 21.92407 38.11202

predict( fatmod, newdata = data.frame( t( extremeFat ) ), interval = 'confidence' )

##        fit      lwr      upr
## 1 30.01804 28.07072 31.96537

Autoregression

Consider the trends in this data on monthly airline passengers:

data( airpass, package = 'faraway' )
AirlinePlot <- ggplot( airpass, aes( x = year, y = pass ) ) +
  geom_line() +
  ylab( 'Log(Passengers)' ) +
  theme_classic() +
  ggtitle( 'Airline Passengers', subtitle =  'with an obviously inappropriate linear regression fit')

AirlinePlot +   geom_smooth( method = 'lm' )

## `geom_smooth()` using formula 'y ~ x'

The linear fit above captures the increase in passengers over the years, but does nothing to capture the seasonal variation in the data.

Autoregressive model: the response depends on past values of the response. So, build into the model is a feature variable that describes a previous response. For example, it may be the case that we would like to predict the passengers for next month. We would like to know the expected change in ridership to predict next months ridership. Autoregressive ‘lagged’ variables can be incorporated into the model to describe the number of passengers from previous months:

\[y_t = \beta_0 + \beta_1y_{t-1} + \beta_{12}y_{t-12} + \beta_{13}y_{t-13} + \epsilon_t \]

lagdf <- embed( log( airpass$pass ), 14 )
colnames( lagdf ) <- c( 'y', paste0( 'lag',1:13 ) )
lagdf <- data.frame( lagdf )
#build our linear model:
armod <- lm( y ~ lag1 + lag12 + lag13, lagdf )
summary( armod )

## 
## Call:
## lm(formula = y ~ lag1 + lag12 + lag13, data = lagdf)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.111173 -0.024849 -0.002443  0.025373  0.134024 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.13848    0.05361   2.583   0.0109 *  
## lag1         0.69231    0.06186  11.192  < 2e-16 ***
## lag12        0.92152    0.03473  26.532  < 2e-16 ***
## lag13       -0.63214    0.06768  -9.340 4.16e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04164 on 127 degrees of freedom
## Multiple R-squared:  0.9892, Adjusted R-squared:  0.989 
## F-statistic:  3892 on 3 and 127 DF,  p-value: < 2.2e-16

visualize the autoregressive model fit:

airpass_lag <- airpass[14:144,]
airpass_lag[ 'predict' ] <- exp( predict( armod ) )
AirlinePlot +
  geom_line( aes( x = year, y = predict, color = 'red' ), data = airpass_lag, show.legend = FALSE )

predicting future values: use the last observation in the data as the lag one value:

#last value of the data
last_row_df <- lagdf[ nrow( lagdf ), ]
# the current response variable ('y') will become lag1 and everything else shifts over 1:
new_data <- data.frame( lag1 = last_row_df$y, lag12 = last_row_df$lag11, lag13 = last_row_df$lag12 )
#prediction:
next_month_res <- predict( armod, newdata = new_data, interval = 'prediction' )

paste( 'The Autoregression Mod predicts', round( exp( next_month_res[1] ),0), 
       'passengers with a 95% confidence interval from', round( exp( next_month_res[2] ),0),
       'to', round( exp( next_month_res[3] ),0))

## [1] "The Autoregression Mod predicts 448 passengers with a 95% confidence interval from 412 to 487"

What can go wrong with predictions?

Bad Models: If it doesn’t fit, don’t apply it.
Quantitative Extrapolation: We try to predict outcomes for cases with predictor values much different from what we saw in the data
Qualitative Extrapolation: We try to predict outcomes for observations that come from different populations
Overconfidence due to overfitting: data is not robust to new data
Black swans: might not have enough data to properly describe the distribution

Exercise 4.5

#the full model
#fatmod
sumary( fatmod )

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -15.292549  16.069921 -0.9516  0.342252
## age           0.056786   0.029965  1.8951  0.059290
## weight       -0.080310   0.049581 -1.6198  0.106602
## height       -0.064600   0.088930 -0.7264  0.468299
## neck         -0.437541   0.215334 -2.0319  0.043273
## chest        -0.023603   0.091839 -0.2570  0.797396
## abdom         0.885429   0.080077 11.0572 < 2.2e-16
## hip          -0.198419   0.135156 -1.4681  0.143406
## thigh         0.231895   0.133718  1.7342  0.084175
## knee         -0.011677   0.224143 -0.0521  0.958496
## ankle         0.163536   0.205143  0.7972  0.426142
## biceps        0.152799   0.158513  0.9640  0.336048
## forearm       0.430489   0.184452  2.3339  0.020436
## wrist        -1.476537   0.495519 -2.9798  0.003183
## 
## n = 252, p = 14, Residual SE = 3.98797, R-Squared = 0.75

#short model

fatmod_4 <- lm( brozek ~ age + weight + height + abdom, fat )
sumary( fatmod_4 )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept) -32.7696359   6.5419024 -5.0092 1.042e-06
## age          -0.0070513   0.0243416 -0.2897    0.7723
## weight       -0.1237218   0.0250455 -4.9399 1.442e-06
## height       -0.1166940   0.0827269 -1.4106    0.1596
## abdom         0.8897041   0.0672672 13.2264 < 2.2e-16
## 
## n = 252, p = 5, Residual SE = 4.12607, R-Squared = 0.72

Is it justifiable to use the smaller model?

anova( fatmod, fatmod_4 )

## Analysis of Variance Table
## 
## Model 1: brozek ~ age + weight + height + neck + chest + abdom + hip + 
##     thigh + knee + ankle + biceps + forearm + wrist
## Model 2: brozek ~ age + weight + height + abdom
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    238 3785.1                                
## 2    247 4205.0 -9    -419.9 2.9336 0.002558 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

There is a statistically significant difference between the amount of variance accounted for by the two models. The smaller model does not describe the data as well.

Compute a 95% prediction interval for median predictor values and compare to the results to the interval for the full model. Do the intervals differ by a practically important amount?

medvals <- model.matrix( fatmod_4 )
medvals0 <- apply( medvals, 2, median )
#95% prediction interval for the median predictor values
predict( fatmod_4, newdata = data.frame( t( medvals0 ) ), interval = 'prediction' )

##        fit      lwr      upr
## 1 17.84028 9.696631 25.98392

#CI for the full model
predict( fatmod, newdata = data.frame( t( x0 ) ), interval = 'prediction' )

##        fit     lwr      upr
## 1 17.49322 9.61783 25.36861

No, the intervals don’t seem to differ by an appreciable amount. So, in moving to the smaller, simpler model, we don’t seem to lose a lot of predictive power.

For the smaller model, examine all the observations from case numbers 25 to 50. Which two observations seem particularly anomalous?

fatmod_4_aug <- augment( fatmod_4 )
fatmod_4_aug_sub <- fatmod_4_aug[25:50,]
#fatmod_4_aug_sub
ggplot( aes( x = .fitted ), data = fatmod_4_aug_sub ) + geom_dotplot()

## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.

#plot the model predictors for observation 25:50
fat %>%
  dplyr::select( c( age, weight, height, abdom ) ) %>%
  pivot_longer( cols = everything(), names_to = 'predictor', values_to = 'val' ) %>%
  ggplot( aes( x = predictor, y = val ) ) +
  geom_boxplot() +
  ggtitle( 'Observations 25:50: Predictive Features' )

outliers <- function(dataframe){
dataframe %>%
    dplyr::select_if(is.numeric) %>% 
    map(~ boxplot.stats(.x)$out) 
}

fatsub <- fat[25:50,]
fat4 <- fatsub %>%
  dplyr::select( c( age, weight, height, abdom ) )
fat4out <- outliers( fat4 )
fat4out

## $age
## integer(0)
## 
## $weight
## [1] 363.15
## 
## $height
## [1] 29.5
## 
## $abdom
## [1] 148.1

from the output above, find the observation indices:

#39, 42, 39
fat %>% filter( weight == c( 363.15 ) )

##    brozek siri density age weight height adipos  free neck chest abdom   hip
## 39   33.8 35.2  1.0202  46 363.15  72.25   48.9 240.5 51.2 136.2 148.1 147.7
##    thigh knee ankle biceps forearm wrist
## 39  87.3 49.1  29.6     45      29  21.4

fat %>% filter( height == c( 29.5 ) )

##    brozek siri density age weight height adipos  free neck chest abdom   hip
## 42   31.7 32.9   1.025  44    205   29.5   29.9 140.1 36.6   106 104.3 115.5
##    thigh knee ankle biceps forearm wrist
## 42  70.6 42.5  23.7   33.6    28.7  17.4

fat %>% filter( weight == c( 148.1 ) )

##  [1] brozek  siri    density age     weight  height  adipos  free    neck   
## [10] chest   abdom   hip     thigh   knee    ankle   biceps  forearm wrist  
## <0 rows> (or 0-length row.names)

Recompute the 95% prediction interval for median predictor values after these two anomalous cases have been excluded from the data. Did this make much difference to the outcome?

remove the two offending records: 39 & 42

drop <- c( 39, 42 )
fat_remove2 <- fat[ -drop, ] 
fatmod_remove2_4 <- lm( brozek ~ age + weight + height + abdom, fat_remove2 )
medvals2 <- model.matrix( fatmod_remove2_4 )
medvals02 <- apply( medvals2, 2, median )
#95% prediction interval for the median predictor values
predict( fatmod_remove2_4, newdata = data.frame( t( medvals02 ) ), interval = 'prediction' )

##       fit      lwr      upr
## 1 17.9033 9.887851 25.91874

predict( fatmod_4, newdata = data.frame( t( medvals0 ) ), interval = 'prediction' )

##        fit      lwr      upr
## 1 17.84028 9.696631 25.98392

The CI narrows when the 2 outliers are removed, but is it significant?

Explanation

The relationships between variables. can include causal conclusions, however these require more rigor.

Simple Meaning

a simple interpretation of model coefficients: unit increase in $x_1$ with the other predictors help constant will produce a change of $\hat{\beta_1}$ in the response variable $y$. This is conceptually difficult: IRL there is no way to hold these variables constant to observe the model’s effects and it is not likely that the model captures all contributing variables to the system. Furthermore, our model’s explanation contains no notion of explanation.

Establishing Causality

causality is determined by experimentally manipulating a variable. Consider a drug trial test with a placebo group and a test group. We can describe the causal effect as: \[\delta_i = y_i^{\mbox{test}}-y_y^{\mbox{control}}\]

Designed Experiments

Randomization is key to success

Randomization is the only reliable way to make sure that the two groups are not unbalances in some way that favors either experimental groups.
Randomization can be used to validate assumptions. (e.g. permutation tests used to test the significance of differences between groups)

On Generalization: The results of randomized experiments apply to the subjects of the experiment unless we can reasonably claim that these subjects are representative of a larger population.
On Blocking: There may be ways that subjects differ from one another that are identifiable and that can be worked into the experiment. (e.g. Gender).

Observstional Data

Cannot always collect data in a designed experiment. Consider this dataset from the Democratic primary of 2008:

data( newhamp )
glimpse( newhamp )

## Rows: 276
## Columns: 12
## $ votesys    <fct> D, D, D, H, D, D, D, D, D, D, D, D, D, D, D, D, H, H, H, D,…
## $ Obama      <int> 371, 345, 375, 92, 668, 284, 195, 216, 203, 187, 159, 209, …
## $ Clinton    <int> 362, 333, 570, 89, 595, 273, 205, 184, 193, 188, 214, 233, …
## $ dem        <int> 979, 913, 1305, 268, 1633, 761, 530, 532, 517, 518, 494, 57…
## $ povrate    <dbl> 0.0653, 0.0380, 0.0428, 0.0669, 0.0332, 0.0586, 0.0886, 0.0…
## $ pci        <int> 25940, 19773, 19986, 25627, 32667, 23163, 19540, 19540, 195…
## $ Dean       <dbl> 0.27820, 0.24398, 0.20096, 0.28495, 0.24937, 0.30503, 0.209…
## $ Kerry      <dbl> 0.32030, 0.36747, 0.41627, 0.33333, 0.37781, 0.39341, 0.365…
## $ white      <dbl> 0.98312, 0.97349, 0.96739, 0.97892, 0.97986, 0.98301, 0.961…
## $ absentee   <dbl> 0.059857, 0.050449, 0.043649, 0.107356, 0.074706, 0.053191,…
## $ population <dbl> 4693.0, 4266.0, 7006.0, 1033.0, 7033.0, 3222.0, 2792.5, 279…
## $ pObama     <dbl> 0.3789581, 0.3778751, 0.2873563, 0.3432836, 0.4090631, 0.37…

look at the difference in counts for Digital vs Hand cast ballots for Obama vs Clinton

newhamp %>% group_by( votesys ) %>% 
  summarise( 'Obama' = sum( Obama ),
             'Clinton' = sum( Clinton ) ) %>%
  mutate( 'Obama_gt_percent' = (Obama - Clinton)/(Obama + Clinton)*100 )

## # A tibble: 2 x 4
##   votesys Obama Clinton Obama_gt_percent
## * <fct>   <int>   <int>            <dbl>
## 1 D       86353   96890            -5.75
## 2 H       16926   14471             7.82

We can see that for digital ballots, Clinton received ~6% more votes than Obama whereas for paper ballots Obama received about 8% more of the vote.
Is this difference in votes based on voting system significant?
Here we fit a linear model to proportion of votes for Obama ~ Hand voting.
First we create a dummy variable for whether the record is for hand votes:

newhamp_trt <- newhamp %>%
  mutate( treatment = case_when( votesys == 'H' ~ 1,
                                 votesys == 'D' ~ 0 ) )
newhamp_trt %>% do( tidy( lm( pObama ~ treatment, . ) ) )

## # A tibble: 2 x 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)   0.353    0.00517     68.1  7.76e-174
## 2 treatment     0.0425   0.00851      4.99 1.06e-  6

The p-value for the treatment group suggests that there is a significant difference between the proportion of vote for Obama given the voting system used. However, were there other variables involved? Did the voting system have some causal effect on the outcome?
In other words, is there a confounding variable?

glimpse( newhamp )

## Rows: 276
## Columns: 12
## $ votesys    <fct> D, D, D, H, D, D, D, D, D, D, D, D, D, D, D, D, H, H, H, D,…
## $ Obama      <int> 371, 345, 375, 92, 668, 284, 195, 216, 203, 187, 159, 209, …
## $ Clinton    <int> 362, 333, 570, 89, 595, 273, 205, 184, 193, 188, 214, 233, …
## $ dem        <int> 979, 913, 1305, 268, 1633, 761, 530, 532, 517, 518, 494, 57…
## $ povrate    <dbl> 0.0653, 0.0380, 0.0428, 0.0669, 0.0332, 0.0586, 0.0886, 0.0…
## $ pci        <int> 25940, 19773, 19986, 25627, 32667, 23163, 19540, 19540, 195…
## $ Dean       <dbl> 0.27820, 0.24398, 0.20096, 0.28495, 0.24937, 0.30503, 0.209…
## $ Kerry      <dbl> 0.32030, 0.36747, 0.41627, 0.33333, 0.37781, 0.39341, 0.365…
## $ white      <dbl> 0.98312, 0.97349, 0.96739, 0.97892, 0.97986, 0.98301, 0.961…
## $ absentee   <dbl> 0.059857, 0.050449, 0.043649, 0.107356, 0.074706, 0.053191,…
## $ population <dbl> 4693.0, 4266.0, 7006.0, 1033.0, 7033.0, 3222.0, 2792.5, 279…
## $ pObama     <dbl> 0.3789581, 0.3778751, 0.2873563, 0.3432836, 0.4090631, 0.37…

newhamp_trt %>%
  do( tidy( lm( pObama ~ treatment + Dean + Kerry + povrate + white, . ) ) )

## # A tibble: 6 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)  0.596     0.132       4.53  8.97e- 6
## 2 treatment   -0.00683   0.00806    -0.848 3.97e- 1
## 3 Dean         0.392     0.0614      6.38  7.54e-10
## 4 Kerry       -0.269     0.0858     -3.14  1.90e- 3
## 5 povrate     -0.223     0.0947     -2.35  1.94e- 2
## 6 white       -0.231     0.129      -1.80  7.36e- 2

If we include other variables we see that our treatment variable is no longer a statistically significant towards predicting the proportion of votes for Obama.

Matching

for matched pairs where two members of each pair are as alike as possible with respect to confounders.
we will use the Matching library for this:

set.seed( 123 )
matches <- GenMatch( newhamp_trt$treatment, newhamp_trt$Dean,
                     ties = FALSE, caliper = 0.05, pop.size = 1000 )

## Loading required namespace: rgenoud

## 
## 
## Fri Mar 26 20:28:54 2021
## Domains:
##  0.000000e+00   <=  X1   <=    1.000000e+03 
## 
## Data Type: Floating Point
## Operators (code number, name, population) 
##  (1) Cloning...........................  122
##  (2) Uniform Mutation..................  125
##  (3) Boundary Mutation.................  125
##  (4) Non-Uniform Mutation..............  125
##  (5) Polytope Crossover................  125
##  (6) Simple Crossover..................  126
##  (7) Whole Non-Uniform Mutation........  125
##  (8) Heuristic Crossover...............  126
##  (9) Local-Minimum Crossover...........  0
## 
## SOFT Maximum Number of Generations: 100
## Maximum Nonchanging Generations: 4
## Population size       : 1000
## Convergence Tolerance: 1.000000e-03
## 
## Not Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.
## Not Checking Gradients before Stopping.
## Using Out of Bounds Individuals.
## 
## Maximization Problem.
## GENERATION: 0 (initializing the population)
## Lexical Fit..... 2.127495e-01  9.987660e-01  
## #unique......... 1000, #Total UniqueCount: 1000
## var 1:
## best............ 1.000000e+00
## mean............ 5.006683e+02
## variance........ 8.388390e+04
## 
## GENERATION: 1
## Lexical Fit..... 6.264505e-01  9.987660e-01  
## #unique......... 604, #Total UniqueCount: 1604
## var 1:
## best............ 2.255547e-02
## mean............ 2.280858e+02
## variance........ 7.904885e+04
## 
## GENERATION: 2
## Lexical Fit..... 9.910633e-01  9.987660e-01  
## #unique......... 602, #Total UniqueCount: 2206
## var 1:
## best............ 2.702493e-02
## mean............ 7.389477e+01
## variance........ 3.118565e+04
## 
## GENERATION: 3
## Lexical Fit..... 9.910633e-01  9.987660e-01  
## #unique......... 604, #Total UniqueCount: 2810
## var 1:
## best............ 2.702493e-02
## mean............ 7.042592e+01
## variance........ 2.858671e+04
## 
## GENERATION: 4
## Lexical Fit..... 9.910633e-01  9.987660e-01  
## #unique......... 415, #Total UniqueCount: 3225
## var 1:
## best............ 2.702493e-02
## mean............ 6.569943e+01
## variance........ 2.806356e+04
## 
## GENERATION: 5
## Lexical Fit..... 9.910633e-01  9.987660e-01  
## #unique......... 425, #Total UniqueCount: 3650
## var 1:
## best............ 2.702493e-02
## mean............ 8.035088e+01
## variance........ 3.395814e+04
## 
## GENERATION: 6
## Lexical Fit..... 9.947238e-01  9.987660e-01  
## #unique......... 405, #Total UniqueCount: 4055
## var 1:
## best............ 2.604222e-03
## mean............ 6.399581e+01
## variance........ 2.788729e+04
## 
## GENERATION: 7
## Lexical Fit..... 9.970770e-01  9.987660e-01  
## #unique......... 595, #Total UniqueCount: 4650
## var 1:
## best............ 2.408582e-03
## mean............ 7.120129e+01
## variance........ 3.239014e+04
## 
## GENERATION: 8
## Lexical Fit..... 9.970770e-01  9.987660e-01  
## #unique......... 596, #Total UniqueCount: 5246
## var 1:
## best............ 2.408582e-03
## mean............ 7.760128e+01
## variance........ 3.490217e+04
## 
## GENERATION: 9
## Lexical Fit..... 9.970770e-01  9.987660e-01  
## #unique......... 403, #Total UniqueCount: 5649
## var 1:
## best............ 2.408582e-03
## mean............ 6.762662e+01
## variance........ 2.926614e+04
## 
## GENERATION: 10
## Lexical Fit..... 9.970770e-01  9.987660e-01  
## #unique......... 429, #Total UniqueCount: 6078
## var 1:
## best............ 2.408582e-03
## mean............ 7.201190e+01
## variance........ 2.786938e+04
## 
## GENERATION: 11
## Lexical Fit..... 9.987660e-01  9.990051e-01  
## #unique......... 423, #Total UniqueCount: 6501
## var 1:
## best............ 2.336515e-03
## mean............ 6.935256e+01
## variance........ 3.131574e+04
## 
## GENERATION: 12
## Lexical Fit..... 9.987660e-01  9.990882e-01  
## #unique......... 612, #Total UniqueCount: 7113
## var 1:
## best............ 2.364199e-03
## mean............ 6.593385e+01
## variance........ 2.785792e+04
## 
## GENERATION: 13
## Lexical Fit..... 9.987660e-01  9.990882e-01  
## #unique......... 592, #Total UniqueCount: 7705
## var 1:
## best............ 2.364199e-03
## mean............ 6.576213e+01
## variance........ 2.685578e+04
## 
## GENERATION: 14
## Lexical Fit..... 9.987660e-01  9.990882e-01  
## #unique......... 417, #Total UniqueCount: 8122
## var 1:
## best............ 2.364199e-03
## mean............ 6.042472e+01
## variance........ 2.583306e+04
## 
## GENERATION: 15
## Lexical Fit..... 9.987660e-01  9.990882e-01  
## #unique......... 423, #Total UniqueCount: 8545
## var 1:
## best............ 2.364199e-03
## mean............ 6.067741e+01
## variance........ 2.497988e+04
## 
## GENERATION: 16
## Lexical Fit..... 9.987660e-01  9.993674e-01  
## #unique......... 437, #Total UniqueCount: 8982
## var 1:
## best............ 7.518945e-04
## mean............ 6.819509e+01
## variance........ 2.980902e+04
## 
## 'wait.generations' limit reached.
## No significant improvement in 4 generations.
## 
## Solution Lexical Fitness Value:
## 9.987660e-01  9.993674e-01  
## 
## Parameters at the Solution:
## 
##  X[ 1] : 7.518945e-04
## 
## Solution Found Generation 11
## Number of Generations Run 16
## 
## Fri Mar 26 20:29:09 2021
## Total run time : 0 hours 0 minutes and 15 seconds

head( matches$matches[ ,1:2 ] )

##      [,1] [,2]
## [1,]    4  213
## [2,]   17  148
## [3,]   18    6
## [4,]   19   83
## [5,]   21  246
## [6,]   22  185

The output above shows some of the pairs of one subject who voted by hand/paper vs one subject who voted digitally matched by thier propensity to vote have voted for Howard Dean in the previous election.

Show all the matches when plotted pObama ~ Dean

plot( pObama ~ Dean, newhamp_trt, pch = treatment + 1 )
with( newhamp_trt, segments( Dean[ matches$matches[ ,1 ] ], 
                             pObama[ matches$matches[ ,1 ] ],
                             Dean[ matches$matches[ ,2 ] ],
                             pObama[ matches$matches[ ,2 ] ] ) )

From the figure above, we can see that the matches are reasonably well made for values of Dean

Is there any indication of a bias for voting system within the matched pairs?
Here we compute the difference between pairs

pdiff <- newhamp_trt$pObama[ matches$matches[ ,1 ] ] - 
  newhamp_trt$pObama[ matches$matches[ ,2 ] ]
t.test( pdiff )

## 
##  One Sample t-test
## 
## data:  pdiff
## t = -1.8798, df = 86, p-value = 0.06352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  -0.0328560272  0.0009183504
## sample estimates:
##   mean of x 
## -0.01596884

plot( pdiff ~ newhamp_trt$Dean[ matches$matches[ ,1 ] ], 
      xlab = 'Dean', ylab = 'Hand - Digital' )
abline( h = 0 )

The matched pairs show no clear preference for hand vs digital voting. The observed difference appears to be because voter inclined to pick Obama are also more likely to be present in hand voting wards.

Covariate Adjustment, or Controlling for the covariate

Compare univariate and covariate outcomes for digital vs hand voting systems

plot( pObama ~ Dean, newhamp_trt, pch = treatment + 1 )
abline( h = c( 0.353, 0.353 + 0.042 ), lty = 1:2 )
abline( 0.221, 0.5229 )
abline( 0.221 - 0.005, 0.5229, lty = 2 )
with( newhamp_trt, segments( Dean[ matches$matches[ ,1 ] ], 
                             pObama[ matches$matches[ ,1 ] ],
                             Dean[ matches$matches[ ,2 ] ],
                             pObama[ matches$matches[ ,2 ] ] ) )

Qualitative Support for Causation

Strenth - large effects
Consistency - similar effect found in other studies
Specificity -
Temporality - consistent direction of effect
Gradient - linear relationship
Plausibility - does it make sense
Experiment

Exercise 5.2

Use the odor dataset with odor as the response and temp as a predictor. Consider all possible models that also include all, some or none of the other two predictors. Report the coefficient for temperature, its standard error, t-statistic and p-value in each case. Discuss what stays the same, what changes and why. Which model is best?

data( odor )
glimpse( odor )

## Rows: 15
## Columns: 4
## $ odor <dbl> 66, 39, 43, 49, 58, 17, -5, -40, 65, 7, 43, -22, -31, -35, -26
## $ temp <dbl> -1, 1, -1, 1, -1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0
## $ gas  <dbl> -1, -1, 1, 1, 0, 0, 0, 0, -1, 1, -1, 1, 0, 0, 0
## $ pack <dbl> 0, 0, 0, 0, -1, -1, 1, 1, -1, -1, 1, 1, 0, 0, 0

dta <- odor
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )

mod_stem <- 'odor ~ temp'
mod_set <- c(  '+ gas', '+ pack', '+ gas + pack' )
mod <- c()
mod_tempc <- c()
mod_se <- c()
mod_t <- c()
mod_p <- c()

for ( m in 1:length( mod_set ) ) {
  mod_str <- paste( mod_stem, mod_set[ m ] )
  mod[ m ] <- mod_str
  modfitsum <- summary( lm( mod_str, odor ) )
  mod_tempc[ m ] <- modfitsum$coefficients[ 2,1 ]
  mod_se[ m ] <- modfitsum$coefficients[ 2,2 ]
  mod_t[ m ] <- modfitsum$coefficients[ 2,3 ]
  mod_p[ m ] <- modfitsum$coefficients[ 2,4 ]
}

modfitsum

## 
## Call:
## lm(formula = mod_str, data = odor)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -50.200 -17.138   1.175  20.300  62.925 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   15.200      9.298   1.635    0.130
## temp         -12.125     12.732  -0.952    0.361
## gas          -17.000     12.732  -1.335    0.209
## pack         -21.375     12.732  -1.679    0.121
## 
## Residual standard error: 36.01 on 11 degrees of freedom
## Multiple R-squared:  0.3337, Adjusted R-squared:  0.1519 
## F-statistic: 1.836 on 3 and 11 DF,  p-value: 0.1989

odor_mods_df <- data.frame( 'mod' = mod, 
                            'temp_coef' = mod_tempc, 
                            'temp_se' = mod_se,
                            'temp_t' = mod_t,
                            'temp_pval' = mod_p )
glimpse( odor_mods_df )

## Rows: 3
## Columns: 5
## $ mod       <chr> "odor ~ temp + gas", "odor ~ temp + pack", "odor ~ temp + ga…
## $ temp_coef <dbl> -12.125, -12.125, -12.125
## $ temp_se   <dbl> 13.66271, 13.14072, 12.73201
## $ temp_t    <dbl> -0.8874523, -0.9227041, -0.9523244
## $ temp_pval <dbl> 0.3922754, 0.3743416, 0.3613900

Temperature does not have a statistically significant impact on the model’s ability to describe the variability of the data. This does not change with the addition of other predictive features to the model as shown by the temperature feature’s p-value. The standard error gets smaller with other terms included whereas the t-stat get larger in magnitude where both gas + pack are included as opposed to just one added.

Categorical Predictors

Categorical features / predictors - qualitative as opposed to quantitative values

A 2-level factor

data( sexab )
glimpse( sexab )

## Rows: 76
## Columns: 3
## $ cpa  <dbl> 2.04786, 0.83895, -0.24139, -1.11461, 2.01468, 6.71131, 1.20814, …
## $ ptsd <dbl> 9.71365, 6.16933, 15.15926, 11.31277, 9.95384, 9.83884, 5.98491, …
## $ csa  <fct> Abused, Abused, Abused, Abused, Abused, Abused, Abused, Abused, A…

by( sexab, sexab$csa, summary )

## sexab$csa: Abused
##       cpa              ptsd               csa    
##  Min.   :-1.115   Min.   : 5.985   Abused   :45  
##  1st Qu.: 1.415   1st Qu.: 9.374   NotAbused: 0  
##  Median : 2.627   Median :11.313                 
##  Mean   : 3.075   Mean   :11.941                 
##  3rd Qu.: 4.317   3rd Qu.:14.901                 
##  Max.   : 8.647   Max.   :18.993                 
## ------------------------------------------------------------ 
## sexab$csa: NotAbused
##       cpa               ptsd               csa    
##  Min.   :-3.1204   Min.   :-3.349   Abused   : 0  
##  1st Qu.:-0.2299   1st Qu.: 3.544   NotAbused:31  
##  Median : 1.3216   Median : 5.794                 
##  Mean   : 1.3088   Mean   : 4.696                 
##  3rd Qu.: 2.8309   3rd Qu.: 6.838                 
##  Max.   : 5.0497   Max.   :10.914

boxp <- ggplot( sexab, aes( x = csa, y = ptsd ) ) +
  geom_boxplot()
scatp <- ggplot( sexab, aes( x = cpa, y = ptsd, color = csa ) ) +
  geom_point( show.legend = FALSE )

grid.arrange( boxp, scatp, ncol = 2 )

those in the abused group have higher rates as the not abused group.
run a t-test to evaluate the statistical significance of this

t.test( ptsd ~ csa, sexab, var.equal = TRUE )

## 
##  Two Sample t-test
## 
## data:  ptsd by csa
## t = 8.9387, df = 74, p-value = 2.172e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  5.630165 8.860273
## sample estimates:
##    mean in group Abused mean in group NotAbused 
##               11.941093                4.695874

use dummy variables to recode the 2-level feature variable

sexab %>% mutate( dummy = case_when( csa == 'Abused' ~ 1,
                                     csa == 'NotAbused' ~ 0 ) ) %>%
  do( tidy( lm( ptsd ~ dummy, . ) ) )

## # A tibble: 2 x 5
##   term        estimate std.error statistic  p.value
##   <chr>          <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)     4.70     0.624      7.53 1.00e-10
## 2 dummy           7.25     0.811      8.94 2.17e-13

smod <- lm( ptsd ~ csa, sexab )
summary( smod )

## 
## Call:
## lm(formula = ptsd ~ csa, data = sexab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0451 -2.3123  0.0951  2.1645  7.0514 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   11.9411     0.5177  23.067  < 2e-16 ***
## csaNotAbused  -7.2452     0.8105  -8.939 2.17e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.473 on 74 degrees of freedom
## Multiple R-squared:  0.5192, Adjusted R-squared:  0.5127 
## F-statistic:  79.9 on 1 and 74 DF,  p-value: 2.172e-13

the intercept ( 11.9 ) is the mean response for the reference level. the csaNotAbused estimate is the difference between levels
if it is more intuitive the other way, can rereference the model:

sexab$csa <- relevel( sexab$csa, ref = 'NotAbused' )
lmod <- lm( ptsd ~ csa, sexab )
summary( lmod )

## 
## Call:
## lm(formula = ptsd ~ csa, data = sexab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.0451 -2.3123  0.0951  2.1645  7.0514 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.6959     0.6237   7.529 1.00e-10 ***
## csaAbused     7.2452     0.8105   8.939 2.17e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.473 on 74 degrees of freedom
## Multiple R-squared:  0.5192, Adjusted R-squared:  0.5127 
## F-statistic:  79.9 on 1 and 74 DF,  p-value: 2.172e-13

Factors and Quantitative Predictors

models that can express how a factor variable and a quantitative variable might be related to a response consider modeling the sexual abuse model to include the physical abuse term cpa

sexmod1 <- lm( ptsd ~ cpa + csa + cpa:csa, sexab )
summary( sexmod1 )

## 
## Call:
## lm(formula = ptsd ~ cpa + csa + cpa:csa, data = sexab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1999 -2.5313 -0.1807  2.7744  6.9748 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     3.6959     0.7107   5.201 1.79e-06 ***
## cpa             0.7640     0.3038   2.515   0.0142 *  
## csaAbused       6.8612     1.0747   6.384 1.48e-08 ***
## cpa:csaAbused  -0.3140     0.3685  -0.852   0.3970    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.279 on 72 degrees of freedom
## Multiple R-squared:  0.5828, Adjusted R-squared:  0.5654 
## F-statistic: 33.53 on 3 and 72 DF,  p-value: 1.133e-13

we can see that the interaction term is not significant, so we simplify the model

sexmod2 <- lm( ptsd ~ cpa + csa, sexab )
summary( sexmod2 )

## 
## Call:
## lm(formula = ptsd ~ cpa + csa, data = sexab)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1567 -2.3643 -0.1533  2.1466  7.1417 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.9753     0.6293   6.317 1.87e-08 ***
## cpa           0.5506     0.1716   3.209  0.00198 ** 
## csaAbused     6.2728     0.8219   7.632 6.91e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.273 on 73 degrees of freedom
## Multiple R-squared:  0.5786, Adjusted R-squared:  0.5671 
## F-statistic: 50.12 on 2 and 73 DF,  p-value: 2.002e-14

visualize the data with parallel regression lines

group.colors <- c( Abused = 'red', NotAbused = 'blue')
ggplot( aes( x = cpa, y = ptsd, color = csa ), data = sexab ) +
  geom_point() +
  geom_abline( slope = sexmod2$coefficients[2], 
               intercept = sexmod2$coefficients[1], color = 'blue' ) +
  geom_abline( slope = sexmod2$coefficients[2], 
               intercept = sexmod2$coefficients[1] +
                 sexmod2$coefficients[3], color = 'red' ) +
  scale_color_manual(values=group.colors)

a simple check of the residuals for model diagnostics

sexmod2_aug <- augment( sexmod2 )
ggplot( aes( x = .fitted, y = .resid, color = csa ), data = sexmod2_aug ) +
  geom_point() +
  scale_color_manual(values=group.colors)

We can see that there is no clear trend in the residuals and the variance of the two groups is approximately the same.

Interpretation with Interaction Terms

data( whiteside )
glimpse( whiteside )

## Rows: 56
## Columns: 3
## $ Insul <fct> Before, Before, Before, Before, Before, Before, Before, Before, …
## $ Temp  <dbl> -0.8, -0.7, 0.4, 2.5, 2.9, 3.2, 3.6, 3.9, 4.2, 4.3, 5.4, 6.0, 6.…
## $ Gas   <dbl> 7.2, 6.9, 6.4, 6.0, 5.8, 5.8, 5.6, 4.7, 5.8, 5.2, 4.9, 4.9, 4.3,…

ggplot( aes( x = Temp, y = Gas ), data = whiteside ) +
  geom_point() + 
  facet_grid( ~ Insul ) +
  geom_smooth( method = 'lm' )

## `geom_smooth()` using formula 'y ~ x'

gas consumption is temperature dependent both before and after installing the insulation. However, less gas is used after installation.

Explore the interaction of temperature:

whitehouse_mod <- lm( Gas ~ Temp*Insul, whiteside )
summary( whitehouse_mod )

## 
## Call:
## lm(formula = Gas ~ Temp * Insul, data = whiteside)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97802 -0.18011  0.03757  0.20930  0.63803 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.85383    0.13596  50.409  < 2e-16 ***
## Temp            -0.39324    0.02249 -17.487  < 2e-16 ***
## InsulAfter      -2.12998    0.18009 -11.827 2.32e-16 ***
## Temp:InsulAfter  0.11530    0.03211   3.591 0.000731 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.323 on 52 degrees of freedom
## Multiple R-squared:  0.9277, Adjusted R-squared:  0.9235 
## F-statistic: 222.3 on 3 and 52 DF,  p-value: < 2.2e-16

From this model we can say that the average consumption before insulation at 0dC was 6.85. This makes interpretation complicated, because 0dC is at the low end extreme of the dataset. We can recenter the temperature predictor to account for this:

whiteside$ctemp <- whiteside$Temp - mean( whiteside$Temp )
whiteside_mod <- lm( Gas ~ ctemp * Insul, whiteside )
summary( whiteside_mod )

## 
## Call:
## lm(formula = Gas ~ ctemp * Insul, data = whiteside)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.97802 -0.18011  0.03757  0.20930  0.63803 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.93679    0.06424  76.848  < 2e-16 ***
## ctemp            -0.39324    0.02249 -17.487  < 2e-16 ***
## InsulAfter       -1.56787    0.08771 -17.875  < 2e-16 ***
## ctemp:InsulAfter  0.11530    0.03211   3.591 0.000731 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.323 on 52 degrees of freedom
## Multiple R-squared:  0.9277, Adjusted R-squared:  0.9235 
## F-statistic: 222.3 on 3 and 52 DF,  p-value: < 2.2e-16

Centering for mean Temp gives a more natural interpretation to the coefficients: The average consumption before insulation at the average temperature was 4.94 before insulation and 4.94-1.57= 3.37dC after installing the insulation.

Factors with more than 2 levels

Consider an experiment with 5 different treatment groups:

data( fruitfly )
glimpse( fruitfly )

## Rows: 124
## Columns: 3
## $ thorax    <dbl> 0.68, 0.68, 0.72, 0.72, 0.76, 0.76, 0.76, 0.76, 0.76, 0.80, …
## $ longevity <int> 37, 49, 46, 63, 39, 46, 56, 63, 65, 56, 65, 70, 63, 65, 70, …
## $ activity  <fct> many, many, many, many, many, many, many, many, many, many, …

visualize longevity ~ thorax for all 5 groups

ggplot( aes( x = thorax, y = longevity, color = activity ), data = fruitfly ) +
  geom_point()

plot as a facet-wrap:

ggplot( aes( x = thorax, y = longevity, color = activity ), data = fruitfly ) +
  geom_point() +
  facet_wrap( ~ activity )

fit a linear model

fruitfly_mod <- lm( longevity ~ thorax * activity, fruitfly )
sumary( fruitfly_mod )

##                      Estimate Std. Error t value  Pr(>|t|)
## (Intercept)         -50.24197   21.80118 -2.3046    0.0230
## thorax              136.12676   25.95172  5.2454 7.275e-07
## activityone           6.51716   33.87083  0.1924    0.8478
## activitylow          -7.75013   33.96901 -0.2282    0.8199
## activitymany         -1.13943   32.52975 -0.0350    0.9721
## activityhigh        -11.03803   31.28660 -0.3528    0.7249
## thorax:activityone   -4.67713   40.65175 -0.1151    0.9086
## thorax:activitylow    0.87431   40.42531  0.0216    0.9828
## thorax:activitymany   6.54779   39.36002  0.1664    0.8682
## thorax:activityhigh -11.12676   38.11997 -0.2919    0.7709
## 
## n = 124, p = 10, Residual SE = 10.71275, R-Squared = 0.65

plot( fruitfly_mod )

anova( fruitfly_mod )

## Analysis of Variance Table
## 
## Response: longevity
##                  Df  Sum Sq Mean Sq F value    Pr(>F)    
## thorax            1 15003.3 15003.3 130.733 < 2.2e-16 ***
## activity          4  9634.6  2408.6  20.988 5.503e-13 ***
## thorax:activity   4    24.3     6.1   0.053    0.9947    
## Residuals       114 13083.0   114.8                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can see that the interaction term thorac:activity is not significant, so we can refit the model without this term:

fruitfly_mod2 <- lm( longevity ~ thorax + activity, fruitfly )

But do we need both thorax and activity? Need to evaluate the model with just thorax first:

drop1( fruitfly_mod2, test='F' )

## Single term deletions
## 
## Model:
## longevity ~ thorax + activity
##          Df Sum of Sq   RSS    AIC F value    Pr(>F)    
## <none>                13107 589.92                      
## thorax    1   12368.4 25476 670.32 111.348 < 2.2e-16 ***
## activity  4    9634.6 22742 650.25  21.684 1.974e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The drop1() command tests each term relative to the full model.

sumary( fruitfly_mod2 )

##              Estimate Std. Error t value  Pr(>|t|)
## (Intercept)  -48.7494    10.8500 -4.4930 1.649e-05
## thorax       134.3414    12.7312 10.5522 < 2.2e-16
## activityone    2.6372     2.9839  0.8838   0.37861
## activitylow   -7.0149     2.9811 -2.3532   0.02027
## activitymany   4.1387     3.0267  1.3674   0.17410
## activityhigh -20.0037     3.0160 -6.6325 1.048e-09
## 
## n = 124, p = 6, Residual SE = 10.53939, R-Squared = 0.65

plot( fruitfly_mod2 )

From the ’Residuals ~ Fitted` plot, we see some heteroscedasticity which can be addressed with a log transformation:

fruitfly_mod3 <- lm( log( longevity ) ~ thorax + activity, fruitfly )
plot( fruitfly_mod3 )

sumary( fruitfly_mod3 )

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept)   1.844210   0.198818  9.2759 1.036e-15
## thorax        2.721458   0.233289 11.6656 < 2.2e-16
## activityone   0.051744   0.054678  0.9463   0.34591
## activitylow  -0.123867   0.054626 -2.2676   0.02518
## activitymany  0.087910   0.055462  1.5850   0.11563
## activityhigh -0.419252   0.055266 -7.5860 8.353e-12
## 
## n = 124, p = 6, Residual SE = 0.19313, R-Squared = 0.7

exp( coef(fruitfly_mod3))

##  (Intercept)       thorax  activityone  activitylow activitymany activityhigh 
##    6.3231018   15.2024730    1.0531064    0.8834971    1.0918894    0.6575384

Daaaaammmmn. The high sexual activity group was found to have a shorter lifespan.

Alternative Codings of Qualitative Predictors

consider a four-level factor coded with 3 dummy factors.

Treatment Coding

contr.treatment( 4 )

Helmert Coding

contr.helmert( 4 )

##   [,1] [,2] [,3]
## 1   -1   -1   -1
## 2    1   -1   -1
## 3    0    2   -1
## 4    0    0    3

Sum Coding

contr.sum( 4 )

##   [,1] [,2] [,3]
## 1    1    0    0
## 2    0    1    0
## 3    0    0    1
## 4   -1   -1   -1

Exercise 14.2

Using the ‘infmort’ data, find a simple model for the infant mortality in terms of the other variables. Be alert for transformations and unusual points. Interpret your model by explaining what the regression parameter estimates mean.

data( infmort )
glimpse( infmort )

## Rows: 105
## Columns: 4
## $ region    <fct> Asia, Europe, Europe, Americas, Europe, Europe, Europe, Euro…
## $ income    <dbl> 3426, 3350, 3346, 4751, 5029, 3312, 3403, 5040, 2009, 2298, …
## $ mortality <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, …
## $ oil       <fct> no oil exports, no oil exports, no oil exports, no oil expor…

region <- ggplot( infmort, aes( x = region, y = mortality ) ) +
  geom_boxplot( )
oil <- ggplot( infmort, aes( x = oil, y = mortality ) ) +
  geom_boxplot( )
income <- ggplot( infmort, aes( x = log(income), y = mortality, color = region ) ) +
  geom_point( )
income2 <- ggplot( infmort, aes( x = log(income), y = mortality, color = oil ) ) +
  geom_point( )
grid.arrange( region, oil, income, income2, ncol = 2)

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

## Warning: Removed 4 rows containing non-finite values (stat_boxplot).

## Warning: Removed 4 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_point).

perform an AIC-based backward selection

#make our full model to work backwards from:
infmort_full <- lm( mortality ~ log( income ) + oil + region, data = infmort )
summary( infmort_full )

## 
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -157.89  -34.33   -3.33   14.68  497.00 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        276.479     57.696   4.792 6.09e-06 ***
## log(income)        -11.234      8.666  -1.296  0.19799    
## oilno oil exports  -84.558     29.290  -2.887  0.00482 ** 
## regionEurope       -84.334     33.411  -2.524  0.01326 *  
## regionAsia         -41.095     20.508  -2.004  0.04793 *  
## regionAmericas     -72.369     24.011  -3.014  0.00331 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 76.89 on 95 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.3188, Adjusted R-squared:  0.283 
## F-statistic: 8.894 on 5 and 95 DF,  p-value: 5.906e-07

we see from the plots that there are several high leverage point on the graphs. here we identify them & remove from the data

HighLeverage <- cooks.distance(infmort_full) > (4/nrow(infmort))
infmort[ HighLeverage, ]

##                     region income mortality            oil
## Algeria             Africa    400      86.3    oil exports
## Iran                  Asia   1280        NA    oil exports
## Iraq                  Asia    560      28.1    oil exports
## Libya               Africa   3010     300.0    oil exports
## Nigeria             Africa    220      58.0    oil exports
## South_Vietnam         Asia    130     100.0 no oil exports

#remove high leverage points
infmort_filt <- infmort[ !HighLeverage, ]
#refit full model:
infmort_full <- lm( mortality ~ log( income ) + oil + region, data = infmort_filt )
summary( infmort_full )

## 
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort_filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.37  -35.25   -1.40   16.60  420.08 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        368.684     58.620   6.289 1.13e-08 ***
## log(income)        -12.568      8.513  -1.476 0.143325    
## oilno oil exports -164.768     37.593  -4.383 3.16e-05 ***
## regionEurope       -85.841     32.700  -2.625 0.010177 *  
## regionAsia         -46.597     20.576  -2.265 0.025939 *  
## regionAmericas     -83.093     23.829  -3.487 0.000757 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.88 on 90 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4001, Adjusted R-squared:  0.3667 
## F-statistic:    12 on 5 and 90 DF,  p-value: 6.502e-09

infmort_backwards <- step( infmort_full, direction = 'backward', trace = 1 )

## Start:  AIC=826.6
## mortality ~ log(income) + oil + region
## 
##               Df Sum of Sq    RSS    AIC
## <none>                     464963 826.60
## - log(income)  1     11261 476225 826.89
## - region       3     65586 530549 833.26
## - oil          1     99247 564210 843.17

summary( infmort_backwards )

## 
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort_filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.37  -35.25   -1.40   16.60  420.08 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        368.684     58.620   6.289 1.13e-08 ***
## log(income)        -12.568      8.513  -1.476 0.143325    
## oilno oil exports -164.768     37.593  -4.383 3.16e-05 ***
## regionEurope       -85.841     32.700  -2.625 0.010177 *  
## regionAsia         -46.597     20.576  -2.265 0.025939 *  
## regionAmericas     -83.093     23.829  -3.487 0.000757 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.88 on 90 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4001, Adjusted R-squared:  0.3667 
## F-statistic:    12 on 5 and 90 DF,  p-value: 6.502e-09

I tried this before filtering the high leverage points and the backwards regression fits had dropped income. It looks like there were several high leverage countries with high income and oil exports that had high infant mortality rates. However, if these observations are filtered then income becomes important to the model. Additionally, if income is not modeled on a log scale, it is dropped when evaluating by AIC.

Now trying to drop by F-test:

drop1( infmort_full, test = 'F' )

## Single term deletions
## 
## Model:
## mortality ~ log(income) + oil + region
##             Df Sum of Sq    RSS    AIC F value    Pr(>F)    
## <none>                   464963 826.60                      
## log(income)  1     11261 476225 826.89  2.1798  0.143325    
## oil          1     99247 564210 843.17 19.2105 3.163e-05 ***
## region       3     65586 530549 833.26  4.2317  0.007589 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary( infmort_full )

## 
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort_filt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -144.37  -35.25   -1.40   16.60  420.08 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        368.684     58.620   6.289 1.13e-08 ***
## log(income)        -12.568      8.513  -1.476 0.143325    
## oilno oil exports -164.768     37.593  -4.383 3.16e-05 ***
## regionEurope       -85.841     32.700  -2.625 0.010177 *  
## regionAsia         -46.597     20.576  -2.265 0.025939 *  
## regionAmericas     -83.093     23.829  -3.487 0.000757 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 71.88 on 90 degrees of freedom
##   (3 observations deleted due to missingness)
## Multiple R-squared:  0.4001, Adjusted R-squared:  0.3667 
## F-statistic:    12 on 5 and 90 DF,  p-value: 6.502e-09

the intercept estimate gives the models estimate for mortality rate if the polity is an oil exporting in Africa. Given the same income and region, a non oil exporting country can expect an infant mortality rate ~-165 less. If income and oil exporting status is controlled for, regions in Europe, Asia & the Americas can expect a lower infant mortality rate than Arican countries by the Estimates given (-86, -47, -83 respectively).

Diagnostics

Regression Diagnostics: there are several assumptions made when applying linear models, and diagnostic methods are used to determine if those assumptions are met before a model is deployed into the wild. Model building is an iterative process, and model diagnostics can often highlight aspects that need adjustment.

potential issues with model assumptions:

Error - we assume that errors are independent and normally distributed
Model - we have assumed that structurally the data supports a linear relationship
Unusual Observations - it only takes a few extreme outliers to throw off a model

Checking Error Assumptions

Constant Variance
we need to check if variance of the residuals is relate to a feature of the model. the most useful diagnostic is to check the residuals as a function of $\hat{y}$. if all is well, you should see constant symmetrical variation in the distribution. homoscedasticity: when there is nonconstant variance.

an example:

data( savings )
savings_mod <- lm( sr ~ pop15 + pop75 + dpi + ddpi, data = savings )
savings_mod_aug <- augment( savings_mod )
#glimpse( savings_mod_aug )

p1 <- ggplot( savings_mod_aug, aes( x = .fitted, y = .resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  theme_classic()
p2 <- ggplot( savings_mod_aug, aes( x = .fitted, y = sqrt( abs( .resid ) ) ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  ylab( expression( sqrt( hat( epsilon )))) +
  theme_classic()
  
grid.arrange( p1, p2, ncol = 2 )

Both plots are satisfactory. Next a numeric test can next be applied to evaluate constant variance:

sumary( lm( sqrt( abs( savings_mod_aug$.resid)) ~ savings_mod_aug$.fitted ) )

##                          Estimate Std. Error t value  Pr(>|t|)
## (Intercept)              2.162163   0.347880  6.2153 1.175e-07
## savings_mod_aug$.fitted -0.061366   0.034756 -1.7656   0.08382
## 
## n = 50, p = 2, Residual SE = 0.63415, R-Squared = 0.06

there is no obvious slope to the model summary output. the two visualizations look satisfactory, so we can assume that the assumption of constant variance for the residuals (error) is relatively constant.

The following plot shows some simulated distributions to give examples of things to look out for in data

par(mfrow=c(2,2))
n <- 500
for(i in 1:1) {x <- runif( n ) ; plot( x, rnorm( n ), title( 'Constant Variance' )) }
for(i in 1:1) {x <- runif( n ) ; plot( x, x*rnorm( n ), title( 'Strong NonConstant Variance' )) }
for(i in 1:1) {x <- runif( n ) ; plot( x, sqrt((x))*rnorm( n ), title( 'Mild NonConstant Variance' )) }
for(i in 1:1) {x <- runif( n ) ; plot( x, cos(x*pi/0.5)+rnorm( n,sd=1 ), title( 'NonLinearity' )) }

now to look at the residuals as a function of a few of the predictor variables:

p1 <- ggplot( savings_mod_aug, aes( x = pop15, y = .resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'pop15' ) +
  theme_classic()
p2 <- ggplot( savings_mod_aug, aes( x = pop75, y = .resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'pop75' ) +
  theme_classic()
p3 <- ggplot( savings_mod_aug, aes( x = dpi, y = .resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'dpi' ) +  
  theme_classic()
p4 <- ggplot( savings_mod_aug, aes( x = ddpi, y = .resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'ddpi' ) +
  theme_classic()
  
grid.arrange( p1, p2, p3, p4, ncol = 2 )

The variance appear constant for pop75, but for pop15 appears to have two groups

transformations can be applied to distributions to even out signs of non-constant variance ex: refit the model with a square root transformation of dpi and ddpi. then look at the residual plots again:

savings_mod2 <- lm( sr ~ pop15 + pop75 + sqrt(dpi) + sqrt(ddpi), data = savings )
savings_mod_aug2 <- augment( savings_mod2 )
glimpse( savings_mod_aug2 )

## Rows: 50
## Columns: 11
## $ .rownames    <chr> "Australia", "Austria", "Belgium", "Bolivia", "Brazil", "…
## $ sr           <dbl> 11.43, 12.07, 13.17, 5.75, 12.88, 8.79, 0.60, 11.90, 4.98…
## $ pop15        <dbl> 29.35, 23.32, 23.80, 41.89, 42.19, 31.72, 39.74, 44.75, 4…
## $ pop75        <dbl> 2.87, 4.41, 4.43, 1.67, 0.83, 2.85, 1.34, 0.67, 1.06, 1.1…
## $ `sqrt(dpi)`  <dbl> 48.266759, 38.832847, 45.918079, 13.752454, 26.990183, 54…
## $ `sqrt(ddpi)` <dbl> 1.6941074, 1.9824228, 1.9544820, 0.4690416, 2.1354157, 1.…
## $ .fitted      <dbl> 10.637063, 11.537541, 11.105882, 5.419434, 9.575882, 9.25…
## $ .hat         <dbl> 0.07519878, 0.11246708, 0.08712671, 0.16599769, 0.0959366…
## $ .sigma       <dbl> 3.821655, 3.822727, 3.809781, 3.823287, 3.787618, 3.82291…
## $ .cooksd      <dbl> 0.0007734259, 0.0005663132, 0.0062319919, 0.0003648478, 0…
## $ .std.resid   <dbl> 0.21807866, 0.14948337, 0.57138393, 0.09573560, 0.9190834…

p1 <- ggplot( savings_mod_aug2, aes( x = pop15, y = .std.resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'pop15' ) +
  theme_classic()
p2 <- ggplot( savings_mod_aug2, aes( x = pop75, y = .std.resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'pop75' ) +
  theme_classic()
p3 <- ggplot( savings_mod_aug2, aes( x = `sqrt(dpi)`, y = .std.resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'dpi' ) +  
  theme_classic()
p4 <- ggplot( savings_mod_aug2, aes( x = `sqrt(ddpi)`, y = .std.resid ) ) + 
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  ggtitle( 'ddpi' ) +
  theme_classic()
  
grid.arrange( p1, p2, p3, p4, ncol = 2 )

That looks a little more even, but not satisfactory.

Normality

Q-Q plots: compare the residuals to the ‘ideal’ normal observations

par(mfrow=c(1,2))

qqnorm( savings_mod_aug$.resid, ylab='Residuals',main='')
qqline( savings_mod_aug$.resid )

hist( savings_mod_aug$.resid, xlab = 'Residuals', main='' )

normal residuals should follow the line approximately. Here, the residuals look normal.

The Shapiro-Wilk test is a formal test for normality:

shapiro.test( savings_mod_aug$.resid )

## 
##  Shapiro-Wilk normality test
## 
## data:  savings_mod_aug$.resid
## W = 0.98698, p-value = 0.8524

the $H_0$ is that the residuals are normal. Since the p-value is large, we do not reject the null hypothesis

Correlated Errors

data( globwarm, package = 'faraway')
gw_mod_aug <- augment( lm( nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, globwarm ) )
gw_mod_aug$year <- seq( 1856, 2000 )
ggplot( gw_mod_aug, aes( x = year, y = .resid ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
  geom_abline( intercept = 0, slope = 0, color = 'red' ) +
  theme_classic()

that looks wild. you can see some clear trends for stretches of negative residuals and stretches of positive residuals.

next we check for serial correlation in the data

n <- length( gw_mod_aug$.resid )
plot( tail( gw_mod_aug$.resid, n-1 ) ~ head( gw_mod_aug$.resid, n-1 ) )
abline( h=0, v=0, col = grey(0.75))

the positive correlation in the graph indicates positive serial correlation

The Durbin-Watson test evaluates errors for correlation where the null hypothesis is that the errors are uncorrelated

dwtest( nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask + urals + mongolia + tasman, data = globwarm )

## 
##  Durbin-Watson test
## 
## data:  nhtemp ~ wusa + jasper + westgreen + chesapeake + tornetrask +     urals + mongolia + tasman
## DW = 0.81661, p-value = 1.402e-15
## alternative hypothesis: true autocorrelation is greater than 0

From these results, we can reject the null hypothesis and be confident that there is correlation of errors in the data.

Finding Unusual Observations

Leverage

leverage points are extreme in the predictor space.
half normal plot identifying unusually large values of leverage.

#we are looking for outliers in third plot:
countries <- row.names( savings )
hatv <- hatvalues( savings_mod )
head( hatv )

##  Australia    Austria    Belgium    Bolivia     Brazil     Canada 
## 0.06771343 0.12038393 0.08748248 0.08947114 0.06955944 0.15840239

sum( hatv )

## [1] 5

halfnorm( hatv, labs = countries, ylab = 'Leverages' )

If we standardize the residuals we can see how many standard deviations points are from the distribution center

qqnorm( rstandard( savings_mod ) )
abline( 0,1 )

Outliers

A point that does not fit the current model well.

Simulate some data to take a look at how outliers may impact a model fit:

testdata <- data.frame( x = 1:10, y = 1: 10+rnorm( 10 ) )
lmod <- lm( y ~ x, testdata )

#an outlier with a central predictor
p1 <- rbind( testdata, c( 5.5, 12 ) )
lmod1 <- lm( y ~ x, p1 )
plt1 <- ggplot( p1, aes( x = x, y = y ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) + 
  geom_smooth( method = "lm", alpha = .15 ) +
  geom_smooth( data = testdata, method = 'lm', alpha = 0.15 ) +
  theme_classic() +
  geom_point( shape = 19, alpha = 0.5, size = 3, color = 'red', aes( x = 5.5, y = 12 ) ) +
  ggtitle( 'Outlier with little leverage' ) +
  xlim( c( 0, 15 ) ) +
  ylim( c( 0, 18 ) )
plt1

#an extra point well outside the range of the data
p2 <- rbind( testdata, c( 15, 15.1 ) )
lmod2 <- lm( y ~ x, p2 )
plt2 <- ggplot( p2, aes( x = x, y = y ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) + 
  geom_smooth( method = "lm", alpha = .15 ) +
  geom_smooth( data = testdata, method = 'lm', alpha = 0.15 ) +
  theme_classic() +
  geom_point( shape = 19, alpha = 0.5, size = 3, color = 'red', aes( x = 15, y = 15.1 ) ) +
  ggtitle( 'Large leverage but is not an Outlier' ) +
  xlim( c( 0, 15 ) ) +
  ylim( c( 0, 18 ) )


#an outlier with leverage
p3 <- rbind( testdata, c( 15, 5.1 ) )
lmod3 <- lm( y ~ x, p3 )
plt3 <- ggplot( p3, aes( x = x, y = y ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) + 
  geom_smooth( method = "lm", alpha = .15 ) +
  geom_smooth( data = testdata, method = 'lm', alpha = 0.15 ) +
  theme_classic() +
  geom_point( shape = 19, alpha = 0.5, size = 3, color = 'red', aes( x = 15, y = 5.1 ) ) +
  ggtitle( 'Large leverage and an Outlier' ) +
  xlim( c( 0, 15 ) ) +
  ylim( c( 0, 18 ) )

grid.arrange( plt1, plt2, plt3, ncol = 2 )

Bonferroni Correction

stud <- rstudent( lmod )
stud[ which.max( abs( stud ) ) ]

##         1 
## -2.699505

#compute the Bonferroni critical value:
qt( 0.05 / ( 50*2 ), 44 )

## [1] -3.525801

A dataset with multiple outliers

data( star, package = 'faraway' )
plt1 <- ggplot( star, aes( x = temp, y = light ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3  ) +
  geom_smooth( method = 'lm' )
  theme_classic()

## List of 93
##  $ line                      :List of 6
##   ..$ colour       : chr "black"
##   ..$ size         : num 0.5
##   ..$ linetype     : num 1
##   ..$ lineend      : chr "butt"
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ rect                      :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : chr "black"
##   ..$ size         : num 0.5
##   ..$ linetype     : num 1
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ text                      :List of 11
##   ..$ family       : chr ""
##   ..$ face         : chr "plain"
##   ..$ colour       : chr "black"
##   ..$ size         : num 11
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : num 0
##   ..$ lineheight   : num 0.9
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ title                     : NULL
##  $ aspect.ratio              : NULL
##  $ axis.title                : NULL
##  $ axis.title.x              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.75points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.top          :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.75points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.x.bottom       : NULL
##  $ axis.title.y              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.75points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.title.y.left         : NULL
##  $ axis.title.y.right        :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.75points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text                 :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey30"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 2.2points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.top           :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : num 0
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 2.2points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.x.bottom        : NULL
##  $ axis.text.y               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 1
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 2.2points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.text.y.left          : NULL
##  $ axis.text.y.right         :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 0points 2.2points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ axis.ticks                :List of 6
##   ..$ colour       : chr "grey20"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ axis.ticks.x              : NULL
##  $ axis.ticks.x.top          : NULL
##  $ axis.ticks.x.bottom       : NULL
##  $ axis.ticks.y              : NULL
##  $ axis.ticks.y.left         : NULL
##  $ axis.ticks.y.right        : NULL
##  $ axis.ticks.length         : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ axis.ticks.length.x       : NULL
##  $ axis.ticks.length.x.top   : NULL
##  $ axis.ticks.length.x.bottom: NULL
##  $ axis.ticks.length.y       : NULL
##  $ axis.ticks.length.y.left  : NULL
##  $ axis.ticks.length.y.right : NULL
##  $ axis.line                 :List of 6
##   ..$ colour       : chr "black"
##   ..$ size         : 'rel' num 1
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ axis.line.x               : NULL
##  $ axis.line.x.top           : NULL
##  $ axis.line.x.bottom        : NULL
##  $ axis.line.y               : NULL
##  $ axis.line.y.left          : NULL
##  $ axis.line.y.right         : NULL
##  $ legend.background         :List of 5
##   ..$ fill         : NULL
##   ..$ colour       : logi NA
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ legend.margin             : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing            : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ legend.spacing.x          : NULL
##  $ legend.spacing.y          : NULL
##  $ legend.key                : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.key.size           : 'simpleUnit' num 1.2lines
##   ..- attr(*, "unit")= int 3
##  $ legend.key.height         : NULL
##  $ legend.key.width          : NULL
##  $ legend.text               :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.text.align         : NULL
##  $ legend.title              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ legend.title.align        : NULL
##  $ legend.position           : chr "right"
##  $ legend.direction          : NULL
##  $ legend.justification      : chr "center"
##  $ legend.box                : NULL
##  $ legend.box.just           : NULL
##  $ legend.box.margin         : 'margin' num [1:4] 0cm 0cm 0cm 0cm
##   ..- attr(*, "unit")= int 1
##  $ legend.box.background     : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ legend.box.spacing        : 'simpleUnit' num 11points
##   ..- attr(*, "unit")= int 8
##  $ panel.background          :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : logi NA
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ panel.border              : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.spacing             : 'simpleUnit' num 5.5points
##   ..- attr(*, "unit")= int 8
##  $ panel.spacing.x           : NULL
##  $ panel.spacing.y           : NULL
##  $ panel.grid                :List of 6
##   ..$ colour       : chr "grey92"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ lineend      : NULL
##   ..$ arrow        : logi FALSE
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_line" "element"
##  $ panel.grid.major          : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.grid.minor          : list()
##   ..- attr(*, "class")= chr [1:2] "element_blank" "element"
##  $ panel.grid.major.x        : NULL
##  $ panel.grid.major.y        : NULL
##  $ panel.grid.minor.x        : NULL
##  $ panel.grid.minor.y        : NULL
##  $ panel.ontop               : logi FALSE
##  $ plot.background           :List of 5
##   ..$ fill         : NULL
##   ..$ colour       : chr "white"
##   ..$ size         : NULL
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ plot.title                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.title.position       : chr "panel"
##  $ plot.subtitle             :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : num 0
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 0points 0points 5.5points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : num 1
##   ..$ vjust        : num 1
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 5.5points 0points 0points 0points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.caption.position     : chr "panel"
##  $ plot.tag                  :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : 'rel' num 1.2
##   ..$ hjust        : num 0.5
##   ..$ vjust        : num 0.5
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ plot.tag.position         : chr "topleft"
##  $ plot.margin               : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
##   ..- attr(*, "unit")= int 8
##  $ strip.background          :List of 5
##   ..$ fill         : chr "white"
##   ..$ colour       : chr "black"
##   ..$ size         : 'rel' num 2
##   ..$ linetype     : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_rect" "element"
##  $ strip.background.x        : NULL
##  $ strip.background.y        : NULL
##  $ strip.placement           : chr "inside"
##  $ strip.text                :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : chr "grey10"
##   ..$ size         : 'rel' num 0.8
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : NULL
##   ..$ lineheight   : NULL
##   ..$ margin       : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
##   .. ..- attr(*, "unit")= int 8
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.text.x              : NULL
##  $ strip.text.y              :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num -90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  $ strip.switch.pad.grid     : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ strip.switch.pad.wrap     : 'simpleUnit' num 2.75points
##   ..- attr(*, "unit")= int 8
##  $ strip.text.y.left         :List of 11
##   ..$ family       : NULL
##   ..$ face         : NULL
##   ..$ colour       : NULL
##   ..$ size         : NULL
##   ..$ hjust        : NULL
##   ..$ vjust        : NULL
##   ..$ angle        : num 90
##   ..$ lineheight   : NULL
##   ..$ margin       : NULL
##   ..$ debug        : NULL
##   ..$ inherit.blank: logi TRUE
##   ..- attr(*, "class")= chr [1:2] "element_text" "element"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi TRUE
##  - attr(*, "validate")= logi TRUE

plt1

## `geom_smooth()` using formula 'y ~ x'

assess the data for outliers:

range( rstudent( lm( light ~ temp, star ) ) )

## [1] -2.049393  1.905847

star2 <- star %>% filter( temp >3.6 )
plt1 + geom_smooth( data = star2, method = 'lm' )

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

When the four data point that represent four stars that are gas giants. then they are removed, the linear relationship between light intensity and core temperature is described by very different coefficients.

Influential Observations

Influential Data Points: cause a shift in the relationship of a model when removed from the data set.
the Cook statistic is a good way to show influential data points and can be visualized as a half-normal plot:

sumary( savings_mod )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept) 28.56608654  7.35451611  3.8842 0.0003338
## pop15       -0.46119315  0.14464222 -3.1885 0.0026030
## pop75       -1.69149768  1.08359893 -1.5610 0.1255298
## dpi         -0.00033690  0.00093111 -0.3618 0.7191732
## ddpi         0.40969493  0.19619713  2.0882 0.0424711
## 
## n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34

cook <- cooks.distance( savings_mod )
halfnorm( cook, labs = countries, ylab = "Cook's distances" )

Now we evaluate the fit after removing the data point with the largest cook’s distance: Libya

savings_modi <- lm( sr ~ pop15 + pop75 + dpi + ddpi, 
                    savings, 
                    subset = ( cook < max( cook ) ) )
sumary( savings_modi )

##                Estimate  Std. Error t value Pr(>|t|)
## (Intercept) 24.52404598  8.22402631  2.9820 0.004655
## pop15       -0.39144013  0.15790949 -2.4789 0.017084
## pop75       -1.28086692  1.14518206 -1.1185 0.269430
## dpi         -0.00031890  0.00092933 -0.3432 0.733119
## ddpi         0.61027903  0.26877842  2.2706 0.028122
## 
## n = 49, p = 5, Residual SE = 3.79481, R-Squared = 0.36

plot( savings_modi )

Checking the Structure of the Model:

Partial Regression Plots: a look at the marginal relationship between the response and predictor variable after the effect of the other predictors has been removed. The partial regression plot allows us to focus on the relationship between one predictor and the response as in a simple regression. Better for outlier/influential point detection
Partial Residuals Plot: the partial regression of a variable plotted against the variable. Better for non-linearity detection

d <- residuals( lm( sr ~ pop75 + dpi + ddpi, savings ) )
m <- residuals( lm( pop15 ~ pop75 + dpi + ddpi, savings ) )
p <- predict( savings_mod, type = 'term', term = 'pop15' )
r <- residuals( savings_mod, type = 'partial' )[,'pop15']
dat <- data.frame( 'd' = d, 'm' = m, 'p' = p, 'r' = r, 'pop15' = savings$pop15 )



preg <- ggplot( dat, aes( x = m, y = d, ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3  ) +
  geom_smooth( method = 'lm', se = FALSE ) +
  xlab( 'pop15 residuals' ) +
  ylab( 'Savings residuals' ) +
  ggtitle( 'Partial Regression / Added Variable Plot' ) +
  theme_classic()

pres <- ggplot( dat, aes( x = pop15.1, y = r ) ) +
  geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3  ) +
  geom_smooth( method = 'lm', se = FALSE ) +
  ylab( 'Partial for pop15' ) +
  xlab( 'pop15' ) +
  ggtitle( 'Partial Residual Plot' ) +
  theme_classic()


grid.arrange( preg, pres, ncol = 2 )

## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'

Exercise 6.1

Using the sat dataset, fit a model with the total SAT score as the response and expend, salary, ratio and takers as predictors. Perform regression diagnostics on this model to answer the following questions. Display any plots that are relevant. Do not provide any plots about which you have nothing to say. Suggest possible improvements or corrections to the model where appropriate.

data( sat )
glimpse( sat )

## Rows: 50
## Columns: 7
## $ expend <dbl> 4.405, 8.963, 4.778, 4.459, 4.992, 5.443, 8.817, 7.030, 5.718, …
## $ ratio  <dbl> 17.2, 17.6, 19.3, 17.1, 24.0, 18.4, 14.4, 16.6, 19.1, 16.3, 17.…
## $ salary <dbl> 31.144, 47.951, 32.175, 28.934, 41.078, 34.571, 50.045, 39.076,…
## $ takers <int> 8, 47, 27, 6, 45, 29, 81, 68, 48, 65, 57, 15, 13, 58, 5, 9, 11,…
## $ verbal <int> 491, 445, 448, 482, 417, 462, 431, 429, 420, 406, 407, 468, 488…
## $ math   <int> 538, 489, 496, 523, 485, 518, 477, 468, 469, 448, 482, 511, 560…
## $ total  <int> 1029, 934, 944, 1005, 902, 980, 908, 897, 889, 854, 889, 979, 1…

sat_mod <- lm( total ~ expend + salary + ratio + takers, data = sat )
sumary( sat_mod )

##               Estimate Std. Error  t value  Pr(>|t|)
## (Intercept) 1045.97154   52.86976  19.7839 < 2.2e-16
## expend         4.46259   10.54653   0.4231    0.6742
## salary         1.63792    2.38725   0.6861    0.4962
## ratio         -3.62423    3.21542  -1.1271    0.2657
## takers        -2.90448    0.23126 -12.5594 2.607e-16
## 
## n = 50, p = 5, Residual SE = 32.70199, R-Squared = 0.82

1. Check the constant variance assumption for the errors

plot( fitted( sat_mod ), residuals( sat_mod ),
      xlab = 'Fitted', ylab = 'Residuals' )
abline( h = 0 )

Some signs of nonlinearity? nonconstant variance?…

plot( fitted( sat_mod ), sqrt( abs( residuals( sat_mod ) ) ),
      xlab = 'Fitted', ylab = expression( sqrt( hat( epsilon ))) )
abline( h = 0 )

This looks more satisfactory?

1. Check the normality assumption

qqnorm( residuals( sat_mod ) )
qqline( residuals( sat_mod ) )

That looks reasonably normal

1. Check for large leverage points

halfnorm( hatvalues( sat_mod ) )

qqnorm( rstandard( sat_mod ) )
abline( 0,1 )

1. Check for outliers
1. Check for influential points

cook <- cooks.distance( sat_mod )
halfnorm( cook,5 )

44 appears to be an outlier with leverage

1. Check the structure of the relationship between the predictors and the response

termplot( sat_mod, partial.resid = TRUE, terms = 3 )

Transformation

Transformations of the response and/or predictors can improve the fit and correst violations of model assumptions.

When you use a log transformation on the response, the regression coefficients hace a particular interpretatio: \[\mbox{log}\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \cdots + \hat{\beta}_px_p\] \[\hat{y} = e^{\hat{\beta}_0}e^{\hat{\beta}_1x_1} \cdots e^{\hat{\beta}_px_p}\]

The Box-cox transformation is a popular way to determine a transformation on the response.

Transforming the response can make the model harder to interpret, so we do not want to do it unless it is really necessary Here is an example with the savings data:

data( savings )
glimpse( savings )

## Rows: 50
## Columns: 5
## $ sr    <dbl> 11.43, 12.07, 13.17, 5.75, 12.88, 8.79, 0.60, 11.90, 4.98, 10.78…
## $ pop15 <dbl> 29.35, 23.32, 23.80, 41.89, 42.19, 31.72, 39.74, 44.75, 46.64, 4…
## $ pop75 <dbl> 2.87, 4.41, 4.43, 1.67, 0.83, 2.85, 1.34, 0.67, 1.06, 1.14, 3.93…
## $ dpi   <dbl> 2329.68, 1507.99, 2108.47, 189.13, 728.47, 2982.88, 662.86, 289.…
## $ ddpi  <dbl> 2.87, 3.93, 3.82, 0.22, 4.56, 2.43, 2.67, 6.51, 3.08, 2.80, 3.99…

savings_mod <- lm( sr ~ pop15 + pop75 + dpi + ddpi, savings )
boxcox( savings_mod, plotit = T )

boxcox( savings_mod, plotot = T, lambda = seq( 0.5, 1.5, by = 0.1 ) )

## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
##  extra argument 'plotot' will be disregarded

Transforming the Predictors

Broken Stick Regression: Sometimes we have reason to believe that different linear regression models apply to different regions of the data. Broken stick uses two basis functions to describe the distribution: \[y = \beta_0 + \beta_1\mbox{B}_1(x) + \beta_2\mbox{B}_2(x) + \epsilon\]

Polynomials. response surfaces.

Splines

B-spline basis functions

funky <- function( x ) sin( 2 * pi * x^3 )^3
x = seq( 0,1,by = 0.01 )
y = funky( x ) + 0.1*rnorm( 101 )
matplot( x, cbind( y, funky(x) ), type = "pl", pch = 20, lty = 1, col = 1 )

look at the fits for orthogonal polynomial bases of order 4 and 12:

g4 <- lm( y ~ poly( x, 4 ) )
g12 <- lm( y ~ poly( x, 12 ) )
matplot( x, cbind( y, g4$fit, g12$fit ), type = "pll", pch = 20, lty = c( 1,2 ), col = 1 )

The p4 does not fit the data well enough. However, the p14 appears to overfit.

try with splines:

knots <- c( 0, 0, 0, 0, 0.2, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 1, 1, 1, 1 )
bx <- splineDesign( knots, x )
lmodb <- lm( y ~ bx -1 )
matplot( x, bx, type = 'l', col = 1 )

matplot( x, cbind( y, lmodb$fit ), type = 'pl', pch = 20, lty = 1, col = 1 )

Smoothing spline

ssf <- smooth.spline( x,y )
matplot( x, cbind( y, ssf$y ), type = 'pl', lty = 1, pch = 20, col = 1 )

Missing Data

Types of Missing Data

Missing Cases fail to observe a complete case
Incomplete Values not a full series of data
Missing Values missed incidences

Types of Missingness:

Missing Completely at Random (MCAR) randomly missed cases or values
Missing at Random (MAR) systematically missed cases or values
Missing Not at Random (MNAR) missing due to an unobserved variable

Deletion

data( chmiss, package = 'faraway')
glimpse( chmiss )

## Rows: 47
## Columns: 6
## $ race     <dbl> 10.0, 22.2, 19.6, 17.3, 24.5, 54.0, 4.9, 7.1, 5.3, 21.5, 43.1…
## $ fire     <dbl> 6.2, 9.5, 10.5, 7.7, 8.6, 34.1, 11.0, 6.9, 7.3, 15.1, 29.1, 2…
## $ theft    <dbl> 29, 44, 36, 37, 53, 68, 75, 18, 31, NA, 34, 14, 11, 11, 22, N…
## $ age      <dbl> 60.4, 76.5, NA, NA, 81.4, 52.6, 42.6, 78.5, 90.1, 89.8, 82.7,…
## $ involact <dbl> NA, 0.1, 1.2, 0.5, 0.7, 0.3, 0.0, 0.0, NA, 1.1, 1.9, 0.0, 0.0…
## $ income   <dbl> 11.744, 9.323, 9.948, 10.656, 9.730, 8.231, 21.480, 11.104, 1…

standard summaries are a handy first peek to notice missing variables

summary( chmiss )

##       race            fire           theft             age       
##  Min.   : 1.00   Min.   : 2.00   Min.   :  3.00   Min.   : 2.00  
##  1st Qu.: 3.75   1st Qu.: 5.60   1st Qu.: 22.00   1st Qu.:48.30  
##  Median :24.50   Median : 9.50   Median : 29.00   Median :64.40  
##  Mean   :35.61   Mean   :11.42   Mean   : 32.65   Mean   :59.97  
##  3rd Qu.:57.65   3rd Qu.:15.10   3rd Qu.: 38.00   3rd Qu.:78.25  
##  Max.   :99.70   Max.   :36.20   Max.   :147.00   Max.   :90.10  
##  NA's   :4       NA's   :2       NA's   :4        NA's   :5      
##     involact          income      
##  Min.   :0.0000   Min.   : 5.583  
##  1st Qu.:0.0000   1st Qu.: 8.564  
##  Median :0.5000   Median :10.694  
##  Mean   :0.6477   Mean   :10.736  
##  3rd Qu.:0.9250   3rd Qu.:12.102  
##  Max.   :2.2000   Max.   :21.480  
##  NA's   :3        NA's   :2

the dataset is small enough, why not peek at the missing values by record (row)

rowSums( is.na( chmiss ) )

## 60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631 60646 
##     1     0     1     1     0     0     0     0     1     1     0     0     1 
## 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607 60623 60608 
##     0     0     1     0     0     0     1     1     0     0     1     0     1 
## 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637 60652 60620 60619 
##     1     0     1     0     0     0     1     0     1     0     0     1     0 
## 60649 60617 60655 60643 60628 60627 60633 60645 
##     1     1     0     0     1     0     0     1

Look for patterns in the missing data:

image( is.na( chmiss ), axes = FALSE, col = gray(1:0) )
axis( 2, at = 0:5/5, labels = colnames( chmiss ) )
axis( 1, at = 0:46/46, labels = row.names( chmiss ), las = 2 )

The missing values seem to be randomly peppered throughout the data.

One strategy to deal with this is to remove all the missing values. However, this reduces the number of observations from 47 to 27. This change in sample size results in larger standard errors for a regression fit as demonstrated here:

modmiss <- lm( data = chmiss, involact ~ . )
sumary( modmiss )

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -1.1164827  0.6057615 -1.8431 0.0794750
## race         0.0104867  0.0031283  3.3522 0.0030180
## fire         0.0438757  0.0103190  4.2519 0.0003557
## theft       -0.0172198  0.0059005 -2.9184 0.0082154
## age          0.0093766  0.0034940  2.6837 0.0139041
## income       0.0687006  0.0421558  1.6297 0.1180775
## 
## n = 27, p = 6, Residual SE = 0.33822, R-Squared = 0.79

data( chredlin, package = 'faraway' )
modfull <- lm( involact ~ ., data = chredlin )
sumary( modfull )

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept) -0.6290917  0.5119934 -1.2287  0.226360
## race         0.0088999  0.0026376  3.3743  0.001655
## fire         0.0390701  0.0086378  4.5232 5.334e-05
## theft       -0.0102104  0.0029224 -3.4938  0.001178
## age          0.0084194  0.0029192  2.8841  0.006292
## income       0.0246958  0.0320917  0.7695  0.446094
## sides        0.0240311  0.1250544  0.1922  0.848585
## 
## n = 47, p = 7, Residual SE = 0.33913, R-Squared = 0.75

Deleting cases is a simple strategy, but we end up throwing away information that might allow for a more precise inference.

Single Imputation

a tempting solution is to impute using the mean

cmeans <- colMeans( chmiss, na.rm = TRUE )
mchm <- chmiss
for( i in c( 1:4,6 ) ) mchm[ is.na( chmiss[,i]),i ] <- cmeans[i]
imod <- lm( involact ~ ., mchm )
sumary( imod )

##               Estimate Std. Error t value Pr(>|t|)
## (Intercept)  0.0708021  0.5094531  0.1390 0.890203
## race         0.0071173  0.0027057  2.6305 0.012245
## fire         0.0287418  0.0093855  3.0624 0.004021
## theft       -0.0030590  0.0027457 -1.1141 0.272242
## age          0.0060795  0.0032079  1.8952 0.065695
## income      -0.0270917  0.0316782 -0.8552 0.397791
## 
## n = 44, p = 6, Residual SE = 0.38412, R-Squared = 0.68

Notice how the coefficients are dragged closer to 0. Imputing with the mean has the effect of flattening the data. This has the effect of reducing the magnitude of the importance of feature variables. Mean imputation is not recommended except where the fraction of filled values is small.

Using regression methods to predict values of covariates

lmodr <- lm( logit( race/100 ) ~ fire + theft + age + income, data = chmiss )
ilogit( predict( lmodr, chmiss[ is.na( chmiss$race ),]))*100

##      60646      60651      60616      60617 
##  0.4190909 14.7320193 84.2653995 21.3121261

compare these predicted values to the real data

chredlin$race[ is.na( chmiss$race ) ]

## [1]  1.0 13.4 62.3 36.4

The regression fill-in will also introduce a bias toward zero in the coefficients and reduce variance, but it’s not as accute as imputation by the mean.

Multiple Imputation

The problem with single imputation is that the imputed value, be it a mean or a regression-predicted value, tends to be less variable than the value we would have seen because the imputed value does not include the error variation that would normally be seen in observed data. Multiple imputation reincludes the error.
Here we use the Amelia package to perform multiple Imputation

set.seed( 123 )
chimp <- amelia( chmiss, m=25 )

## -- Imputation 1 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34
## 
## -- Imputation 2 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74
## 
## -- Imputation 3 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## 
## -- Imputation 4 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
## 
## -- Imputation 5 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 6 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13
## 
## -- Imputation 7 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43
## 
## -- Imputation 8 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83
## 
## -- Imputation 9 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 10 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17
## 
## -- Imputation 11 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51
## 
## -- Imputation 12 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
## 
## -- Imputation 13 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
##  101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
##  121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
##  141 142 143
## 
## -- Imputation 14 --
## 
##   1  2  3  4  5  6  7
## 
## -- Imputation 15 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25
## 
## -- Imputation 16 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22
## 
## -- Imputation 17 --
## 
##   1  2  3  4  5  6  7  8  9
## 
## -- Imputation 18 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54
## 
## -- Imputation 19 --
## 
##   1  2  3  4  5  6  7  8  9 10
## 
## -- Imputation 20 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
##  41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
##  61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
##  81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
##  101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
##  121 122 123 124 125 126 127 128 129 130
## 
## -- Imputation 21 --
## 
##   1  2  3  4  5  6  7  8
## 
## -- Imputation 22 --
## 
##   1  2  3  4  5  6  7  8
## 
## -- Imputation 23 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13
## 
## -- Imputation 24 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29
## 
## -- Imputation 25 --
## 
##   1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
##  21 22 23 24 25 26 27 28 29 30

now fit a linear model to the 25 imputed datasets and store the results:

betas <- NULL
ses <- NULL
for(i in 1:chimp$m ){
  lmod <- lm( involact ~ race + fire + theft + age, chimp$imputations[[i]])
  betas <- rbind( betas, coef( lmod ) )
  ses <- rbind( ses, coef( summary( lmod ) )[,2])
}

use the mi.meld() function to find the beta estimate as well as the standard error.

cr <- mi.meld( q = betas, se = ses )
cr

## $q.mi
##      (Intercept)        race       fire        theft         age
## [1,]  -0.2364977 0.007759897 0.03523587 -0.008675917 0.007214505
## 
## $se.mi
##      (Intercept)        race        fire       theft        age
## [1,]    0.163568 0.002042961 0.008428774 0.004698103 0.00251655

Exercises

13.1

13.2

Problems with Predictors

Errors in Predictors

The measurement error issue.

data( cars )
cars_mod <- lm( dist ~ speed, data = cars )

cars_p1 <- ggplot( cars, aes( x = speed, y = dist ) ) +
  geom_point() +
  geom_smooth( method = 'lm', se = F, color = 'black' ) +
  theme_classic()

cars_p1

## `geom_smooth()` using formula 'y ~ x'

the effect of adding measurement error to the predictor

cars_mod1 <- lm( dist ~ I( speed + rnorm( dim( cars )[1] )), data = cars )
cars_mod2 <- lm( dist ~ I( speed + 2*rnorm( dim( cars )[1] )), data = cars )
cars_mod5 <- lm( dist ~ I( speed + 5*rnorm( dim( cars )[1] )), data = cars )
cars_p1 + 
  geom_abline(slope = cars_mod1$coefficients[2], intercept = cars_mod1$coefficients[1], color = 'yellow' ) +
  geom_abline(slope = cars_mod2$coefficients[2], intercept = cars_mod2$coefficients[1], color = 'orange' ) +
  geom_abline(slope = cars_mod5$coefficients[2], intercept = cars_mod5$coefficients[1], color = 'red' )

## `geom_smooth()` using formula 'y ~ x'

The slope gets shallower as the error increases.

Here we simulate the effects of adding normal random error:

vv <- rep( 1:5/10, each=1000 )
slopes <- numeric( 5000 )
for ( i in 1:5000 ) slopes[i] <- lm( dist ~ I( speed + sqrt( vv[i] )*rnorm( dim( cars )[1] )), data = cars )$coef[2]
#plot the mean slopes for each variance
betas <- c( coef( cars_mod )[2], colMeans( matrix( slopes, nrow=1000 ) ) )
variances <- c( 0, 1:5/10 ) + 0.5
plot( variances, betas, xlim = c( 0,1 ), ylim = c( 3.86,4 ) )
#fit a linear model and extrapolate to 0 variance
gv <- lm( betas ~ variances )
coef( gv )

## (Intercept)   variances 
##   3.9949761  -0.1276996

points( 0, gv$coef[1], pch=3 )

The predicted value of $\hat{\beta}$ at variance equal to zero, that is no measurement error, is 4.0.

Here is a prediction result using the simex package:

set.seed( 123 )
lmod <- lm( dist ~ speed, cars, x = T )
simout <- simex( lmod, "speed", 0.5, B=1000 )
simout

## 
## Naive model:
## lm(formula = dist ~ speed, data = cars, x = T)
## 
## SIMEX-Variables: speed
## Number of Simulations: 1000
## 
## Coefficients:
## (Intercept)        speed  
##      -18.01         3.96

Changes in Scale

most methods work more reliably when variables are on roughly similar scales.

sumary( savings_mod )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept) 28.56608654  7.35451611  3.8842 0.0003338
## pop15       -0.46119315  0.14464222 -3.1885 0.0026030
## pop75       -1.69149768  1.08359893 -1.5610 0.1255298
## dpi         -0.00033690  0.00093111 -0.3618 0.7191732
## ddpi         0.40969493  0.19619713  2.0882 0.0424711
## 
## n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34

rescale to measure income in thousands instead of dollars

savings_mod_scale <- lm( sr ~ pop15 + pop75 + I(dpi/1000) + ddpi, data = savings )
sumary( savings_mod_scale )

##             Estimate Std. Error t value  Pr(>|t|)
## (Intercept) 28.56609    7.35452  3.8842 0.0003338
## pop15       -0.46119    0.14464 -3.1885 0.0026030
## pop75       -1.69150    1.08360 -1.5610 0.1255298
## I(dpi/1000) -0.33690    0.93111 -0.3618 0.7191732
## ddpi         0.40969    0.19620  2.0882 0.0424711
## 
## n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34

another approach is the use the function scale() the regression coefficients now represent the effect of a one standard unit increase in the predictor on the response in standard units

savings_scaled <- data.frame( scale( savings ) )
savings_mod_scale2 <- lm( sr ~ ., savings_scaled )
sumary( savings_mod_scale2 )

##                Estimate  Std. Error t value Pr(>|t|)
## (Intercept)  3.8091e-16  1.2003e-01  0.0000 1.000000
## pop15       -9.4204e-01  2.9545e-01 -3.1885 0.002603
## pop75       -4.8731e-01  3.1218e-01 -1.5610 0.125530
## dpi         -7.4508e-02  2.0592e-01 -0.3618 0.719173
## ddpi         2.6243e-01  1.2567e-01  2.0882 0.042471
## 
## n = 50, p = 5, Residual SE = 0.84873, R-Squared = 0.34

Now, with the predictors on a common scale, it is helpful to construct plot estimates with confidence intervals

edf <- data.frame( coef( savings_mod_scale2 ), confint( savings_mod_scale2 ) )[-1,]
names( edf ) <- c( 'Estimate', 'lb', 'ub' )

p <- ggplot( aes( y = Estimate, ymin = lb, ymax = ub, x = row.names( edf ) ), data = edf ) +
  geom_pointrange() +
  coord_flip() +
  xlab( 'Predictor' ) +
  geom_hline( yintercept = 0, col = gray( 0.75 ) ) +
  ggtitle( 'standardized coefficients', subtitle = 'confidence intervals shown' )
p

Scaling with binary predictors:

#take a look at the variable pop15 for age
ggplot( aes( x = pop15 ), data = savings ) +
  geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The data is bimodal. we can create a binary variable for age that divides at pop15=35

savings$age <- ifelse( savings$pop15 > 35, 0, 1 )

A binary predictor taking the values of 0/1 with equal probability has a standard deviation of one half. This suggests scaling the other continuous predictors by two SDs rather than one:

savings$dpis <- (savings$dpi - mean( savings$dpi ))/(2*sd( savings$dpi ) )
savings$ddpis <- (savings$ddpi - mean( savings$ddpi ))/(2*sd( savings$ddpi ) )
sumary( lm( sr ~ age + dpis + ddpis, savings ) )

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)   6.8176     1.0106  6.7464 2.19e-08
## age           5.2841     1.5849  3.3341 0.001697
## dpis         -1.5642     1.6093 -0.9720 0.336127
## ddpis         2.4681     1.1082  2.2272 0.030866
## 
## n = 50, p = 4, Residual SE = 3.79990, R-Squared = 0.32

Colinearity

colinearity: when some predictors are linear combinations of others. $X^TX$ is singular
multicolinearity: $X^TX$ is close to singular but not exactly

detecting collinearity:

correlation matrix values are close to $$1, indicating large pairwise collinearities
a regression of $x_i$ on all other predictors gives $R^2_i$. $R^2_i \sim 1$ is bad and means one predictor can be be predicted by a linear combination of other predictors.
Examine the eigenvalues. eigenvalues close to 0 denote collinearity.

data( seatpos, package = 'faraway' )
seatpos_mod <- lm( hipcenter ~ ., data = seatpos )
sumary( seatpos_mod )

##               Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.432128 166.571619  2.6201  0.01384
## Age           0.775716   0.570329  1.3601  0.18427
## Weight        0.026313   0.330970  0.0795  0.93718
## HtShoes      -2.692408   9.753035 -0.2761  0.78446
## Ht            0.601345  10.129874  0.0594  0.95307
## Seated        0.533752   3.761894  0.1419  0.88815
## Arm          -1.328069   3.900197 -0.3405  0.73592
## Thigh        -1.143119   2.660024 -0.4297  0.67056
## Leg          -6.439046   4.713860 -1.3660  0.18245
## 
## n = 38, p = 9, Residual SE = 37.72029, R-Squared = 0.69

none of the predictors are significant, yet, the $R^2$ is fairly large.
now to look at the pairwise correlations

seatpos_cor <- round( cor( seatpos[,-9] ), 2 )
seatpos_cor

##           Age Weight HtShoes    Ht Seated  Arm Thigh   Leg
## Age      1.00   0.08   -0.08 -0.09  -0.17 0.36  0.09 -0.04
## Weight   0.08   1.00    0.83  0.83   0.78 0.70  0.57  0.78
## HtShoes -0.08   0.83    1.00  1.00   0.93 0.75  0.72  0.91
## Ht      -0.09   0.83    1.00  1.00   0.93 0.75  0.73  0.91
## Seated  -0.17   0.78    0.93  0.93   1.00 0.63  0.61  0.81
## Arm      0.36   0.70    0.75  0.75   0.63 1.00  0.67  0.75
## Thigh    0.09   0.57    0.72  0.73   0.61 0.67  1.00  0.65
## Leg     -0.04   0.78    0.91  0.91   0.81 0.75  0.65  1.00

ok, that’s not very intuitive.

ggcorrplot( seatpos_cor )

There are several large pairwise correlations.
now to check the eigenvalue decomposition

x <- model.matrix( seatpos_mod )[,-1]
e <- eigen( t(x) %*% x )
e$val

## [1] 3.653671e+06 2.147948e+04 9.043225e+03 2.989526e+02 1.483948e+02
## [6] 8.117397e+01 5.336194e+01 7.298209e+00

sqrt( e$val[1]/e$val )

## [1]   1.00000  13.04226  20.10032 110.55123 156.91171 212.15650 261.66698
## [8] 707.54911

compute the variance inflation factors:

vif( x )

##        Age     Weight    HtShoes         Ht     Seated        Arm      Thigh 
##   1.997931   3.647030 307.429378 333.137832   8.951054   4.496368   2.762886 
##        Leg 
##   6.694291

let’s look at the correlation for all the length measurements:

seatpos_length_cor <-round( cor( x[,3:8] ), 2 )
seatpos_length_cor

##         HtShoes   Ht Seated  Arm Thigh  Leg
## HtShoes    1.00 1.00   0.93 0.75  0.72 0.91
## Ht         1.00 1.00   0.93 0.75  0.73 0.91
## Seated     0.93 0.93   1.00 0.63  0.61 0.81
## Arm        0.75 0.75   0.63 1.00  0.67 0.75
## Thigh      0.72 0.73   0.61 0.67  1.00 0.65
## Leg        0.91 0.91   0.81 0.75  0.65 1.00

ggcorrplot( seatpos_length_cor )

let’s take a look at model performance when we use just one length predictor, height

seatpos_mod2 <- lm( hipcenter ~ Age + Weight + Ht, data = seatpos )
sumary( seatpos_mod2 )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept) 528.2977290 135.3129471  3.9043 0.0004257
## Age           0.5195041   0.4080385  1.2732 0.2115928
## Weight        0.0042707   0.3117199  0.0137 0.9891491
## Ht           -4.2119048   0.9990562 -4.2159 0.0001737
## 
## n = 38, p = 4, Residual SE = 36.48611, R-Squared = 0.66

We get the same explanatory power but with a simpler model

Model Selection

The problem of selecting the best subset of predictors.
Occam’s Razor: among several plausible explanations for a phenomenon, the simplest is the best.

Hierarchical Models

Do not remove the lower order terms in the presence of higher order terms.

Testing baes procedures

Backwards Elimination - start with the kitchen sink. one-by-one remove the predictors with the highest p-values that are $\gt \alpha$
Forwards Elimination - start with nothing. try a model fit with each variable and progress with the variables that results in the smallest p-value. repeat this process until there are no more choices that lead to a better p-value.
Stepwise Regression - a combination of forwards and backwards elimination.

data( state )
statedata = data.frame( state.x77, row.names = state.abb )
state_mod <- lm( Life.Exp ~ ., statedata )
sumary( state_mod )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)  7.0943e+01  1.7480e+00 40.5859 < 2.2e-16
## Population   5.1800e-05  2.9187e-05  1.7748   0.08318
## Income      -2.1804e-05  2.4443e-04 -0.0892   0.92934
## Illiteracy   3.3820e-02  3.6628e-01  0.0923   0.92687
## Murder      -3.0112e-01  4.6621e-02 -6.4590  8.68e-08
## HS.Grad      4.8929e-02  2.3323e-02  2.0979   0.04197
## Frost       -5.7350e-03  3.1432e-03 -1.8246   0.07519
## Area        -7.3832e-08  1.6682e-06 -0.0443   0.96491
## 
## n = 50, p = 8, Residual SE = 0.74478, R-Squared = 0.74

remove the predictor with the highest p-val:

state_mod <- update( state_mod, . ~ . - Area )
sumary( state_mod )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)  7.0989e+01  1.3875e+00 51.1652 < 2.2e-16
## Population   5.1883e-05  2.8788e-05  1.8023   0.07852
## Income      -2.4440e-05  2.3429e-04 -0.1043   0.91740
## Illiteracy   2.8459e-02  3.4163e-01  0.0833   0.93400
## Murder      -3.0182e-01  4.3344e-02 -6.9634 1.454e-08
## HS.Grad      4.8472e-02  2.0667e-02  2.3454   0.02369
## Frost       -5.7758e-03  2.9702e-03 -1.9446   0.05839
## 
## n = 50, p = 7, Residual SE = 0.73608, R-Squared = 0.74

remove the predictor with the highest p-val:

state_mod <- update( state_mod, . ~ . - Illiteracy )
sumary( state_mod )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept)  7.1066e+01  1.0289e+00 69.0669 < 2.2e-16
## Population   5.1149e-05  2.7095e-05  1.8878   0.06566
## Income      -2.4771e-05  2.3160e-04 -0.1070   0.91531
## Murder      -3.0001e-01  3.7042e-02 -8.0992 2.907e-10
## HS.Grad      4.7758e-02  1.8591e-02  2.5689   0.01367
## Frost       -5.9099e-03  2.4678e-03 -2.3948   0.02095
## 
## n = 50, p = 6, Residual SE = 0.72773, R-Squared = 0.74

remove the predictor with the highest p-val:

state_mod <- update( state_mod, . ~ . - Income )
sumary( state_mod )

##                Estimate  Std. Error t value  Pr(>|t|)
## (Intercept) 71.02712853  0.95285296 74.5415 < 2.2e-16
## Population   0.00005014  0.00002512  1.9960  0.052005
## Murder      -0.30014880  0.03660946 -8.1987 1.775e-10
## HS.Grad      0.04658225  0.01482706  3.1417  0.002968
## Frost       -0.00594329  0.00242087 -2.4550  0.018018
## 
## n = 50, p = 5, Residual SE = 0.71969, R-Squared = 0.74

remove the predictor with the highest p-val:

state_mod <- update( state_mod, . ~ . - Population )
sumary( state_mod )

##               Estimate Std. Error t value  Pr(>|t|)
## (Intercept) 71.0363788  0.9832622 72.2456 < 2.2e-16
## Murder      -0.2830652  0.0367313 -7.7064 8.039e-10
## HS.Grad      0.0499487  0.0152011  3.2859  0.001950
## Frost       -0.0069117  0.0024475 -2.8240  0.006988
## 
## n = 50, p = 4, Residual SE = 0.74267, R-Squared = 0.71

However, it is important to understand that the variables omitted from the model may still be related to the response.

Drawbacks to step-based procedures

it is possible to miss the optimal model
the p-vals should not be treated too literally
variables that get dropped may still be correlated to the response of interest
step-wise variable selection methods tend to pick models that are smaller than desirable for prediction purposes

Criterion-Based Procedures

Choosing a model that optimizes a criterion.
Kullback-Leiber information (distance): $I(f,g)=\int f(x)log(\frac{f(x)}{g(x|\theta)})$
but we don’t know $f$. sad face. … Akaike Information Criterion: $AIC = -2L(\hat{\theta}) +2p$
Choose the model that minimizes the AIC. AIC naturally balances between fit and simplicity in model selection
Bayed Information Criterion (BIC)

b <- regsubsets( Life.Exp ~ ., data = statedata )
rs <- summary( b )
rs$which

##   (Intercept) Population Income Illiteracy Murder HS.Grad Frost  Area
## 1        TRUE      FALSE  FALSE      FALSE   TRUE   FALSE FALSE FALSE
## 2        TRUE      FALSE  FALSE      FALSE   TRUE    TRUE FALSE FALSE
## 3        TRUE      FALSE  FALSE      FALSE   TRUE    TRUE  TRUE FALSE
## 4        TRUE       TRUE  FALSE      FALSE   TRUE    TRUE  TRUE FALSE
## 5        TRUE       TRUE   TRUE      FALSE   TRUE    TRUE  TRUE FALSE
## 6        TRUE       TRUE   TRUE       TRUE   TRUE    TRUE  TRUE FALSE
## 7        TRUE       TRUE   TRUE       TRUE   TRUE    TRUE  TRUE  TRUE

AIC <- 50*log(rs$rss/50)+(2:8)*2
plot( AIC ~ I(1:7), ylab = 'AIC', xlab="Number of Predictors")

AIC is minimized by a choice of 4 predictors

Using $R^2_{adj}$

plot( 2:8, rs$adjr2, xlab = 'No. of parameters', ylab = 'Adjusted R-squared')

which.max( rs$adjr2 )

## [1] 4

Mallow’s Cp statistic

plot( 2:8, rs$cp, xlab = 'No. of parameters', ylab = 'Cp statistic')
abline( 0,1 )

step()

state_mod <- lm( Life.Exp ~ ., data = statedata )
step( state_mod )

## Start:  AIC=-22.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost + Area
## 
##              Df Sum of Sq    RSS     AIC
## - Area        1    0.0011 23.298 -24.182
## - Income      1    0.0044 23.302 -24.175
## - Illiteracy  1    0.0047 23.302 -24.174
## <none>                    23.297 -22.185
## - Population  1    1.7472 25.044 -20.569
## - Frost       1    1.8466 25.144 -20.371
## - HS.Grad     1    2.4413 25.738 -19.202
## - Murder      1   23.1411 46.438  10.305
## 
## Step:  AIC=-24.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad + 
##     Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Illiteracy  1    0.0038 23.302 -26.174
## - Income      1    0.0059 23.304 -26.170
## <none>                    23.298 -24.182
## - Population  1    1.7599 25.058 -22.541
## - Frost       1    2.0488 25.347 -21.968
## - HS.Grad     1    2.9804 26.279 -20.163
## - Murder      1   26.2721 49.570  11.569
## 
## Step:  AIC=-26.17
## Life.Exp ~ Population + Income + Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## - Income      1     0.006 23.308 -28.161
## <none>                    23.302 -26.174
## - Population  1     1.887 25.189 -24.280
## - Frost       1     3.037 26.339 -22.048
## - HS.Grad     1     3.495 26.797 -21.187
## - Murder      1    34.739 58.041  17.456
## 
## Step:  AIC=-28.16
## Life.Exp ~ Population + Murder + HS.Grad + Frost
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    23.308 -28.161
## - Population  1     2.064 25.372 -25.920
## - Frost       1     3.122 26.430 -23.877
## - HS.Grad     1     5.112 28.420 -20.246
## - Murder      1    34.816 58.124  15.528

## 
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost, 
##     data = statedata)
## 
## Coefficients:
## (Intercept)   Population       Murder      HS.Grad        Frost  
##   7.103e+01    5.014e-05   -3.001e-01    4.658e-02   -5.943e-03

The sequence of variable removal is the same as with backward elimination and the model selected is the same as for AIC.

Check for high leverage:

h <- lm.influence( state_mod )$hat
names(h) <- state.abb
rev( sort( h ) )

##         AK         CA         HI         NV         NM         TX         NY 
## 0.80952226 0.40885689 0.37876168 0.36524576 0.32472157 0.28416353 0.25694978 
##         WA         OR         ND         LA         CT         UT         RI 
## 0.22268196 0.22183076 0.21968606 0.19494860 0.19362508 0.19097575 0.17082010 
##         MD         AL         AZ         MS         FL         IL         PA 
## 0.16407127 0.16110483 0.15890792 0.15571602 0.14856937 0.13743337 0.13209846 
##         NJ         SD         ME         SC         MI         WY         VT 
## 0.12993167 0.12528365 0.12181903 0.11757971 0.11725238 0.11638093 0.11504122 
##         MA         GA         MO         KY         AR         CO         WV 
## 0.11274391 0.11028889 0.11026698 0.10909216 0.10447656 0.10250875 0.09964990 
##         NC         DE         NH         ID         OH         MT         MN 
## 0.09361765 0.09322149 0.08980829 0.08755753 0.08751277 0.08638978 0.07617370 
##         TN         WI         VA         OK         IA         NE         KS 
## 0.07006383 0.06835416 0.06393987 0.06349971 0.06200255 0.05749338 0.05537644 
##         IN 
## 0.05198210

Alaska has high leverage. retry fitting without Alaska:

b <- regsubsets( Life.Exp ~ ., data = statedata, subset = ( state.abb!="AK"))
rs <- summary( b )
rs$which[ which.max( rs$adjr2 ),]

## (Intercept)  Population      Income  Illiteracy      Murder     HS.Grad 
##        TRUE        TRUE       FALSE       FALSE        TRUE        TRUE 
##       Frost        Area 
##        TRUE        TRUE

AREA is now included in the model

Now observe the data and see if transformations are necessary

stripchart( data.frame( scale( statedata ) ), method = 'jitter', las = 2, vertical = TRUE )

both AREA and Population are skewed. re-evaluate the model:

b <- regsubsets( Life.Exp ~ log( Population ) + Income + Illiteracy + Murder + HS.Grad + Frost + log( Area ), data = statedata )
rs <- summary( b )
rs$which[which.max( rs$adjr2),]

##     (Intercept) log(Population)          Income      Illiteracy          Murder 
##            TRUE            TRUE           FALSE           FALSE            TRUE 
##         HS.Grad           Frost       log(Area) 
##            TRUE            TRUE           FALSE

While selecting models, think about:

Do the models have similar qualitative consequences?
Do they make similar predictions?
What is the cost of measuring the predictors?
Which has the best diagnostics?

Shrinkage Methods

More predictors in a model mean more information. This chapter looks at how to shrink the additional information into a more useful (interpretable) form.

Principal Components

Principal Component Analysis (PCA): finding low-dimensional linear structure in higher dimensional data.
find the orthogonal directions of the greatest variation in the data

cfat <- fat[,9:18]
prfat <- prcomp( cfat, scale = TRUE )
dim( prfat$rot )

## [1] 10 10

dim( prfat$x )

## [1] 252  10

summary( prfat )

## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.6498 0.85301 0.81909 0.70114 0.54708 0.52831 0.45196
## Proportion of Variance 0.7021 0.07276 0.06709 0.04916 0.02993 0.02791 0.02043
## Cumulative Proportion  0.7021 0.77490 0.84199 0.89115 0.92108 0.94899 0.96942
##                            PC8     PC9   PC10
## Standard deviation     0.40539 0.27827 0.2530
## Proportion of Variance 0.01643 0.00774 0.0064
## Cumulative Proportion  0.98586 0.99360 1.0000

round( prfat$rot[,1],2 )

##    neck   chest   abdom     hip   thigh    knee   ankle  biceps forearm   wrist 
##    0.33    0.34    0.33    0.35    0.33    0.33    0.25    0.32    0.27    0.30

Mahalanobis Distance: is a measure of the distance of a point from the mean that adjusts for the correlation in the data.

robfat <- cov.rob( cfat )
md <- mahalanobis( cfat, center = robfat$center, cov=robfat$cov )
n <- nrow( cfat ); p <- ncol( cfat )
plot( qchisq( 1:n/(n+1), p), sort(md), 
      xlab = expression( paste( chi^2,"quantiles")), 
      ylab = "Sorted Mahalanobis Distances")
abline( 0,1 )

use the 1st 2 PCs to model the data

lmodpcr <- lm( fat$brozek ~ prfat$x[,1:2] )
sumary( lmodpcr )

##                   Estimate Std. Error t value  Pr(>|t|)
## (Intercept)       18.93849    0.32913 57.5416 < 2.2e-16
## prfat$x[, 1:2]PC1  1.84198    0.12446 14.8003 < 2.2e-16
## prfat$x[, 1:2]PC2 -3.55053    0.38661 -9.1837 < 2.2e-16
## 
## n = 252, p = 3, Residual SE = 5.22473, R-Squared = 0.55

Take a few representative predictors based on the two largest coefficients seen in the PCs: from the first 2 PCs, go with Abdomen and Ankle-Abdomen abdomen wasn’t the largest coef in PC1, but going with it keeps the model down to 2 features whereas going with the largest (Hip) adds a 3rd predictor.

lmodr <- lm( fat$brozek ~ scale( abdom ) + I( scale( ankle ) - scale( abdom ) ), data = cfat )
sumary( lmodr )

##                                Estimate Std. Error t value  Pr(>|t|)
## (Intercept)                    18.93849    0.27937 67.7891 < 2.2e-16
## scale(abdom)                    5.76286    0.32840 17.5485 < 2.2e-16
## I(scale(ankle) - scale(abdom)) -0.99504    0.31403 -3.1686  0.001723
## 
## n = 252, p = 3, Residual SE = 4.43492, R-Squared = 0.68

There we go: we have a simple model that fits as well as the 10 predictor model. We can interpret it similarly to the previous model but it is easier to explain to others. Future studies might be done more cheaply because we might only need these two measures.

Build a model to predict the fat content of new samples using the 100 absorbances which can be measured more easily:

data( meatspec, package = 'faraway' )
#glimpse( meatspec )
trainmeat <- meatspec[ 1:172, ]
testmeat <- meatspec[ 173: 215,]
meat_mod <- lm( fat ~ ., trainmeat )
# r^2 for the kitchen sink
summary( meat_mod )$r.squared

## [1] 0.9970196

compare the rmse for the train and test sets

rmse <- function( x,y ) sqrt( mean( (x-y )^2 ) )
rmse( fitted( meat_mod ), trainmeat$fat )

## [1] 0.6903167

rmse( predict( meat_mod, testmeat ), testmeat$fat )

## [1] 3.814

Performance is much worse for the test data.

OKay, so there are 100 features in this dataset. We probably don’t need them all. Next try a stepwise feature selection method:

~28 features were dropped and performance was increased a bit.

Now let’s go with PCR

meatpca <- prcomp( trainmeat[, -101 ] )
round( meatpca$sdev, 3 )

##   [1] 5.055 0.511 0.282 0.168 0.038 0.025 0.014 0.011 0.005 0.003 0.002 0.002
##  [13] 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [25] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [37] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [49] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [61] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [73] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [85] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
##  [97] 0.000 0.000 0.000 0.000

matplot( 1:100, meatpca$rot[,1:3], type = 'l', xlab = 'Frequency', ylab = '', col = 1 )

PCR

meatpcr <- pcr( fat ~ ., data = trainmeat, ncomp = 50 )
rmse( predict( meatpcr, ncomp = 4 ), trainmeat$fat )

## [1] 4.064745

par(mfrow=c(1,2))
#plot the 100 slope coefs for the kitchen sink
plot( meat_mod$coef[-1], xlab = 'Frequency', ylab = 'Coefficient', type = 'l' )
coefplot( meatpcr, ncomp = 4, xlab = 'Frequency', main = '' )

we used 4 PCs, why? How to select the number: skree plot (plot of stds)

plot( meatpca$sdev[ 1:10 ], type = 'l', ylab = 'SD of PC', xlab = 'PC number' )

General rule of thumb: use PCs up to the last ‘elbow’ in the skree plot

How well does 4 PCs work with the test data?

rmse( predict( meatpcr, testmeat, ncomp = 4 ), testmeat$fat )

## [1] 4.533982

pcrmse <- RMSEP( meatpcr, newdata = testmeat )
plot( pcrmse, main = '' )

which.min( pcrmse$val )

## [1] 28

pcrmse$val[ 28 ]

## [1] 1.854858

The best result occurs fro 28 PCs.

Crossvalidation

set.seed( 123 )
meatpcr <- pcr( fat ~ ., data = trainmeat, validation = 'CV', ncomp = 50 )
meatCV <- RMSEP( meatpcr, estimate = 'CV' )
plot( meatCV, main = '' )

minPCs <- which.min( meatCV$val )
ypred <- predict( meatpcr, testmeat, ncomp = minPCs )
rmse( ypred, testmeat$fat )

## [1] 2.28158

This gives a much more improved RMSE

Partial Least Squares

a method for relating a set of input variables and outputs
apply PLS using CV to select components:

par(mfrow=c(1,2))
set.seed( 123 )
meatpls <- plsr( fat ~ ., data = meatspec[1:172,], ncomp=50, validation = "CV" )
coefplot( meatpls, ncomp = 4, xlab = "Frequency", main = "" )
plsCV <- RMSEP( meatpls, estimate = 'CV' )
plot( plsCV, main="" )

From the right hand figure, we see we need around 15 components
see how it performs on the train set:

ypred <- predict( meatpls, ncomp = 15 )
rmse( ypred, trainmeat$fat )

## [1] 1.889838

see how it performs on the test set:

ytpred <- predict( meatpls, testmeat, ncomp = 15 )
rmse( ytpred, testmeat$fat )

## [1] 1.971828

We did slightly better than PCR.

Ridge Regression

Suppose that the predictors have been centered by their means and scaled by their standard deviations and that the response has been centered.

Lasso Regression

Lasso regression is pretty good when there are lots of predictors. Lasso is particularly good when we believe the effects are sparse; that few of the predictors have an effect.

trainy <- trainmeat$fat
trainx <- as.matrix( trainmeat[,-101] )
meatlasso <- lars( trainx, trainy )

now the CV

set.seed( 123 )
cvout <- cv.lars( trainx, trainy )

cvout$index[ which.min( cvout$cv ) ]

## [1] 0.01010101

testx <- as.matrix( testmeat[,-101] )
predlars <- predict( meatlasso, testx, s=0.0101, mode='fraction')
rmse( testmeat$fat, predlars$fit )

## [1] 2.132265

Thats the best RMSE so far.

predlars <- predict( meatlasso, s = 0.0101, type = "coef", mode="fraction" )
plot( predlars$coef, type = 'h', ylab="Coefficient" )

sum( predlars$coef != 0 )

## [1] 20

Linear Models with R

by Julian Faraway

Introduction

Initial Data Analysis

When to use Linear Modeling

History of Regression

Recommended Exercises:

Estimation

Linear Model

Estimating

Least Squares Estimation

An Example from the Galapagos

Goodness of Fit

Identifiability

Orthogonality

Exercises

2.4

2.5

Inference

Hypothesis Tests to Compare Models

Are any of the Feature Variables Useful in Predicting the Response?

Can 1 particular feature be dropped from a model?

Testing a pair of predictors

Testing a Subspace

What can’t we use an F-test to evaluate?

Permutation Tests: when you don’t want to assume normality

Sampling

Confidence Intervals for \(\beta\)

Bootstrapped Confidence Intervals

Simulation

Bootstrap

Exercise 3.4

Prediction

Confidence Intervals for Prediction

Example: Body Fat

Extrapolation

Autoregression

What can go wrong with predictions?

Exercise 4.5

Explanation

Simple Meaning

Establishing Causality

Designed Experiments

Observstional Data

Matching

Covariate Adjustment, or Controlling for the covariate

Qualitative Support for Causation

Exercise 5.2

Categorical Predictors

A 2-level factor

Factors and Quantitative Predictors

Interpretation with Interaction Terms

Factors with more than 2 levels

Alternative Codings of Qualitative Predictors

Exercise 14.2

Diagnostics

Checking Error Assumptions

Normality

Correlated Errors

Finding Unusual Observations

Leverage

Outliers

Influential Observations

Checking the Structure of the Model:

Exercise 6.1

Transformation

Transforming the Predictors

Splines

Missing Data

Types of Missing Data

Deletion

Single Imputation

Multiple Imputation

Exercises

13.1

13.2

Problems with Predictors

Errors in Predictors

Changes in Scale

Colinearity

Linear Models with `R`