The following are notes from readings in ‘Linear Models with R’ by Julian Faraway for the course DATA621, ‘Business Analystics and Data Mining’ as part of the Masters of Science in Data Science program at CUNY SPS.
R libraries used:
library( faraway )
library( dplyr )
library( ggplot2 )
library( gridExtra )
library( tidyverse )
library( HistData )
library( broom )
library( gclus )
library( asbio )
library( retistruct )
library( Matching )
library( MASS )
library( ellipse )
library( lmtest )
library( splines )
library( Amelia )
library( simex )
library( ggcorrplot )
library( leaps )
library( pls )
library( lars )Introduction
“The formulation of a problem is often more essential than its solution”
Formulate the Problem Correctly
- Understand the Physical Background
- Understand the Objective
- Make sure you know what the client wants
- Put the Problem into Statistical Terms
- observational or experimental data?
- Is there Nonresponse?
- Are there Missing Values?
- How are the Data Coded?
- What are the Units of Measurement?
- Beware of data entry errors and other forms of data corruption
Initial Data Analysis
Critical to do an initial exploration to get the feel for the data. summary statistics. basic visualizations. inspect quality of the data…
look at some practice data:
data( pima )
head( pima )## pregnant glucose diastolic triceps insulin bmi diabetes age test
## 1 6 148 72 35 0 33.6 0.627 50 1
## 2 1 85 66 29 0 26.6 0.351 31 0
## 3 8 183 64 0 0 23.3 0.672 32 1
## 4 1 89 66 23 94 28.1 0.167 21 0
## 5 0 137 40 35 168 43.1 2.288 33 1
## 6 5 116 74 0 0 25.6 0.201 30 0
summary( pima )## pregnant glucose diastolic triceps
## Min. : 0.000 Min. : 0.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.000 1st Qu.: 99.0 1st Qu.: 62.00 1st Qu.: 0.00
## Median : 3.000 Median :117.0 Median : 72.00 Median :23.00
## Mean : 3.845 Mean :120.9 Mean : 69.11 Mean :20.54
## 3rd Qu.: 6.000 3rd Qu.:140.2 3rd Qu.: 80.00 3rd Qu.:32.00
## Max. :17.000 Max. :199.0 Max. :122.00 Max. :99.00
## insulin bmi diabetes age
## Min. : 0.0 Min. : 0.00 Min. :0.0780 Min. :21.00
## 1st Qu.: 0.0 1st Qu.:27.30 1st Qu.:0.2437 1st Qu.:24.00
## Median : 30.5 Median :32.00 Median :0.3725 Median :29.00
## Mean : 79.8 Mean :31.99 Mean :0.4719 Mean :33.24
## 3rd Qu.:127.2 3rd Qu.:36.60 3rd Qu.:0.6262 3rd Qu.:41.00
## Max. :846.0 Max. :67.10 Max. :2.4200 Max. :81.00
## test
## Min. :0.000
## 1st Qu.:0.000
## Median :0.000
## Mean :0.349
## 3rd Qu.:1.000
## Max. :1.000
Not all missing values are NA. the minimum for diastolic is \(0\). However, that’s not a realistic value for a living person. Therefore, it is more likely that this is a missing value. Look closer at diastolic:
sort( pima$diastolic )[1:100]## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [26] 0 0 0 0 0 0 0 0 0 0 24 30 30 38 40 44 44 44 44 46 46 48 48 48 48
## [51] 48 50 50 50 50 50 50 50 50 50 50 50 50 50 52 52 52 52 52 52 52 52 52 52 52
## [76] 54 54 54 54 54 54 54 54 54 54 54 55 55 56 56 56 56 56 56 56 56 56 56 56 56
clean these values up for diastolic and the other similar numeric variables
pima_NA <- pima %>%
mutate( diastolic = na_if( diastolic, 0 ),
glucose = na_if( glucose, 0 ),
triceps = na_if( triceps, 0 ),
insulin = na_if( insulin, 0 ),
bmi = na_if( bmi, 0 ) )
glimpse( pima_NA )## Rows: 768
## Columns: 9
## $ pregnant <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, …
## $ glucose <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139…
## $ diastolic <int> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74, 80, 60, 72, …
## $ triceps <int> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA, NA, 23, 19, …
## $ insulin <int> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, NA, NA, 846, 1…
## $ bmi <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, NA, 37…
## $ diabetes <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.15…
## $ age <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, …
## $ test <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, …
change the test feature to a categorical with descriptive labels:
pima_NA$test <- factor( pima_NA$test )
levels( pima_NA$test ) <- c( 'negative', 'positive' )
summary( pima_NA$test )## negative positive
## 500 268
Do some basic visualization of some features:
par(mfrow=c(1,3))
pimahist <- hist( pima_NA$diastolic, xlab='Diastolic',main='' )
pimadens <- plot( density( pima_NA$diastolic, na.rm=TRUE ), main="" )
pimasort <- plot( sort( pima_NA$diastolic ), ylab = 'Sorted Diastolic' )Now try again with ggplot:
pimahist <- ggplot( pima_NA, aes( x = diastolic ) ) +
geom_histogram()
pimadens <- ggplot( pima_NA, aes( x = diastolic ) ) +
geom_density()
sdiastolic <- sort( pima_NA$diastolic )
pima_sort <- data.frame( 'sorted' = sdiastolic )
pimasort <- ggplot( pima_sort, aes( y = sorted, x = 1:length( sdiastolic ) ) ) +
geom_point()
grid.arrange( pimahist, pimadens, pimasort, ncol = 3 )## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 35 rows containing non-finite values (stat_bin).
## Warning: Removed 35 rows containing non-finite values (stat_density).
Visualize some bivariate data
p1 <- ggplot( pima_NA, aes( x = diastolic, y = diabetes ) ) +
geom_point()
p2 <- ggplot( pima_NA, aes( x = test, y = diabetes ) ) +
geom_boxplot()
grid.arrange( p1, p2, ncol = 2 )## Warning: Removed 35 rows containing missing values (geom_point).
ggplot2 is a handy library that is more flexible when visualizing complex dataframes. For instance: controlling color/shape/fill by another variables, or creating faceted plots:
#bisambiguating a factor variable:
p1 <- ggplot( pima, aes( x = diastolic, y = diabetes, color = factor( test ) ) ) +
geom_point() +
theme( legend.position = 'top',
legend.direction = 'horizontal' )
p2 <- ggplot( pima, aes( x = diastolic, y = diabetes ) ) +
geom_point( size = 1 ) +
facet_grid( ~ factor( test ) )
grid.arrange(p1, p2, ncol=2,top="Disambiguating a factor variable") On Visualizations: Good graphics are vital in data analysis. They help you avoid mistakes and suggest the form of the modeling to come. They are also important in communicating your analysis to others. Many in your audience or readership will focus on the graphs. This is your best opportunity to get your message over clearly and without misunderstanding. In some cases, the graphics can be so convincing that the formal analysis becomes just a confirmation of what has already been seen
When to use Linear Modeling
Linear Modeling is used for explaining or modeling the relationship between a response/outcome/output variable and one or more predictor/input/explanatory variable(s).
Simple Regression: modeling an outcome variable with just 1 explanatory variable.
Mulitple/Multivariate Regression: modeling an outcome variable with more than 1 explanatory varaible.
Regression Objectives
- Prediction of future or unseen responses given specified values of the predictors
- Assessment of an effect, or relationship between, explanatory variables and the response
History of Regression
describing the libration of the moon:
data( manilius )
glimpse( manilius )## Rows: 27
## Columns: 4
## $ arc <dbl> 13.16667, 13.13333, 13.20000, 14.25000, 14.70000, 13.01667, 14.…
## $ sinang <dbl> 0.8836, 0.9996, 0.9899, 0.2221, 0.0006, 0.9308, 0.0602, -0.1570…
## $ cosang <dbl> -0.4682, -0.0282, 0.1421, 0.9750, 1.0000, -0.3654, 0.9982, 0.98…
## $ group <int> 1, 1, 1, 3, 3, 1, 3, 2, 1, 1, 1, 1, 3, 3, 3, 3, 3, 2, 2, 3, 2, …
The data are divided into three groups based on similarity. Next, compute the sum of the three coefficients by group.
moon3 <- manilius %>%
group_by( group ) %>%
summarise( arc_sum = sum( arc ),
sin_sum = sum( sinang ),
cos_sum = sum( cosang ) )
moon3## # A tibble: 3 x 4
## group arc_sum sin_sum cos_sum
## * <int> <dbl> <dbl> <dbl>
## 1 1 118. 8.50 -0.793
## 2 2 140. -6.14 1.74
## 3 3 128. 2.98 7.96
The result are 3 linear equations with three unknowns each to solve
solve( cbind( 9, moon3$sin_sum, moon3$cos_sum ), moon3$arc_sum )## [1] 14.5445859 -1.4898221 0.1341264
Observe how similar the results are if we fit a linear regression to the original data:
mod <- lm( arc ~ sinang + cosang, manilius )
summary( mod )##
## Call:
## lm(formula = arc ~ sinang + cosang, data = manilius)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.37033 -0.07810 0.03655 0.07960 0.31581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.56162 0.03411 426.846 <2e-16 ***
## sinang -1.50458 0.04013 -37.490 <2e-16 ***
## cosang 0.09137 0.04890 1.868 0.074 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1483 on 24 degrees of freedom
## Multiple R-squared: 0.9836, Adjusted R-squared: 0.9822
## F-statistic: 719 on 2 and 24 DF, p-value: < 2.2e-16
The word regression derives from a term of Sir Francis Galton’s: regression to mediocrity
Important: regression to mediocrity refers to a particular statistical finding and is quite a different concept to that of regression.
Sir Francis Galton found that outcomes have a tendency towards the mean of the data using the heights of parents and their offspring. That is, tall parents are more likely to have a child shorter than they are and short parents are more likely to have a child taller than they are.
Here we explore this data:
data( GaltonFamilies )
ggplot( GaltonFamilies, aes( x = midparentHeight,
y = childHeight ) ) +
geom_point( size = 3 ) find the linear regression line the best fits the distribution
mod <- lm( childHeight ~ midparentHeight, GaltonFamilies )
coef( mod )## (Intercept) midparentHeight
## 22.6362405 0.6373609
ggplot(GaltonFamilies, aes(x = midparentHeight, y = childHeight)) +
geom_point() +
geom_smooth( method = "lm" )## `geom_smooth()` using formula 'y ~ x'
Let’s add the line that describes a relation where parents have children the same height as they are…
ggplot(GaltonFamilies, aes(x = midparentHeight, y = childHeight)) +
geom_point() +
geom_smooth( method = "lm" ) +
geom_line( aes( y = midparentHeight ), color = 'red' )## `geom_smooth()` using formula 'y ~ x'
Now suppose children height is fully correlated to thier parents. We can use the following equation: \[\frac{y-\bar{y}}{SD_y} = r\frac{x-\bar{x}}{SD_x}\] where the correlation, \(r = 1\) to find the coefficients of a line describing this:
beta <- with( GaltonFamilies,
sd( childHeight )/ sd( midparentHeight ) )
alpha <- with( GaltonFamilies,
mean( childHeight ) -
beta * mean( midparentHeight ) )
ggplot(GaltonFamilies, aes(x = midparentHeight, y = childHeight)) +
geom_point() +
geom_smooth( method = "lm" ) +
geom_line( aes( y = midparentHeight ), color = 'red' ) +
geom_abline(data=GaltonFamilies, aes(slope=beta, intercept=alpha ), color='green' )## `geom_smooth()` using formula 'y ~ x'
Regression to the Mean: We can see that a child of tall parents is predicted by the least squares line to have a height which is above average but not quite as tall as the parents as the green line would have you believe. Similarly children of below average height parents are predicted to have a height which is still below average but not quite as short as the parents. This is why Galton used the phrase “regression to mediocrity” and the phenomenon is sometimes called the regression effect.
Recommended Exercises:
1.1 The dataset teengamb concerns a study of teenage gambling in Britain. Make a numerical and graphical summary of the data, commenting on any features that you find interesting. Limit the output you present to a quantity that a busy reader would find sufficient to get a basic understanding of the data.
The teengamb dataset:
data( teengamb )
glimpse( teengamb )## Rows: 47
## Columns: 5
## $ sex <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, …
## $ status <int> 51, 28, 37, 28, 65, 61, 28, 27, 43, 18, 18, 43, 30, 28, 38, 38,…
## $ income <dbl> 2.00, 2.50, 2.00, 7.00, 2.00, 3.47, 5.50, 6.42, 2.00, 6.00, 3.0…
## $ verbal <int> 8, 8, 6, 4, 8, 6, 7, 5, 6, 7, 6, 6, 4, 6, 6, 8, 8, 5, 8, 9, 8, …
## $ gamble <dbl> 0.00, 0.00, 0.00, 7.30, 19.60, 0.10, 1.45, 6.60, 1.70, 0.10, 0.…
teengamb_clean <- teengamb %>%
mutate( sex = factor( sex ),
status = factor( status ) )
summary( teengamb_clean )## sex status income verbal gamble
## 0:28 28 : 7 Min. : 0.600 Min. : 1.00 Min. : 0.0
## 1:19 38 : 5 1st Qu.: 2.000 1st Qu.: 6.00 1st Qu.: 1.1
## 51 : 5 Median : 3.250 Median : 7.00 Median : 6.0
## 71 : 5 Mean : 4.642 Mean : 6.66 Mean : 19.3
## 18 : 4 3rd Qu.: 6.210 3rd Qu.: 8.00 3rd Qu.: 19.4
## 43 : 4 Max. :15.000 Max. :10.00 Max. :156.0
## (Other):17
The numerical summary shows the distribution of subjects in each group for the categorical variables sex and status as well as the mode/spread of the numerical features: income (lbs/wk), verbal, and gamble (annual).
and now to visualize the data:
#visulize the numeric features by group features
si <- ggplot( teengamb_clean, aes( x = sex, y = income, fill = sex ) ) +
geom_boxplot()
sv <- ggplot( teengamb_clean, aes( x = sex, y = verbal, fill = sex ) ) +
geom_boxplot()
sg <- ggplot( teengamb_clean, aes( x = sex, y = gamble, fill = sex ) ) +
geom_boxplot()
grid.arrange( si, sv, sg, ncol = 3 )sti <- ggplot( teengamb_clean, aes( x = status, y = income, fill = status ) ) +
geom_boxplot() + theme(legend.position = "none")
stv <- ggplot( teengamb_clean, aes( x = status, y = verbal, fill = status ) ) +
geom_boxplot() + theme(legend.position = "none")
stg <- ggplot( teengamb_clean, aes( x = status, y = gamble, fill = status ) ) +
geom_boxplot() + theme(legend.position = "none")
grid.arrange( sti, stv, stg, ncol = 3 ) For
sex, both genders have a very similar distribution of income and verbal scores. However, one gender has a very different distribution for the annual gambling expenditure.
There appears to be a linear relationship between verbal scores and status although for other features, the relationship is less clear. However, without knowing the meaning behind the status designations, it is difficult to tell.
Look at how some variable covary:
ggplot( teengamb_clean, aes( y = gamble, x = income, color = sex ) ) + geom_point( size = 3 ) +
geom_smooth(method = "lm", alpha = .15, aes(fill = sex))## `geom_smooth()` using formula 'y ~ x'
If we plot the amount gambled against weekly income we see two very different trends for the different genders. Males (0) increase their gambling as a function of their weekly income whereas females (1), do not tend to gamble nor show the trend of increasing gambling with increasing income.
Let’s take a look at the coefficients for these regression lines:
teengamb_clean %>% group_by( sex ) %>%
do( gambling = broom::tidy( lm( gamble ~ income, data = .))) %>%
unnest( gambling )## # A tibble: 4 x 6
## sex term estimate std.error statistic p.value
## <fct> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 0 (Intercept) -2.66 8.02 -0.332 0.743
## 2 0 income 6.52 1.25 5.20 0.0000201
## 3 1 (Intercept) 3.14 2.33 1.35 0.195
## 4 1 income 0.175 0.479 0.365 0.719
for every GBP increase in weekly income, a male British teen gambles 6.5GBP more annually. However, there is only a negligible increase for females of 0.17.
1.3 The dataset prostate is from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Make a numerical and graphical summary of the data as in the first question.
data( prostate, package = 'faraway' )
glimpse( prostate )## Rows: 97
## Columns: 9
## $ lcavol <dbl> -0.5798185, -0.9942523, -0.5108256, -1.2039728, 0.7514161, -1.…
## $ lweight <dbl> 2.7695, 3.3196, 2.6912, 3.2828, 3.4324, 3.2288, 3.4735, 3.5395…
## $ age <int> 50, 58, 74, 58, 62, 50, 64, 58, 47, 63, 65, 63, 63, 67, 57, 66…
## $ lbph <dbl> -1.386294, -1.386294, -1.386294, -1.386294, -1.386294, -1.3862…
## $ svi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lcp <dbl> -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1…
## $ gleason <int> 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6,…
## $ pgg45 <int> 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 5, 5, 0, 30, 0, 0, 0,…
## $ lpsa <dbl> -0.43078, -0.16252, -0.16252, -0.16252, 0.37156, 0.76547, 0.76…
let’s learn what the variables are before proceeding:
- lcavol -> log(cancer volume)
- lweight -> log(prostate weight)
- age -> age (years)
- lbph -> log(benign prostatic hyperplasia amount)
- svi -> seminal vesicle invasion
- lcp -> log(capsular penetration)
- gleason -> Gleason score
- pgg45 -> percentage Gleason scores 4 or 5
- lpsa -> log(prostate specific antigen)
observe summaries of each feature:
# svi appears to be binary, so make that adjust:
prostate_clean <- prostate %>%
mutate( svi = factor( svi ) )
summary( prostate_clean )## lcavol lweight age lbph svi
## Min. :-1.3471 Min. :2.375 Min. :41.00 Min. :-1.3863 0:76
## 1st Qu.: 0.5128 1st Qu.:3.376 1st Qu.:60.00 1st Qu.:-1.3863 1:21
## Median : 1.4469 Median :3.623 Median :65.00 Median : 0.3001
## Mean : 1.3500 Mean :3.653 Mean :63.87 Mean : 0.1004
## 3rd Qu.: 2.1270 3rd Qu.:3.878 3rd Qu.:68.00 3rd Qu.: 1.5581
## Max. : 3.8210 Max. :6.108 Max. :79.00 Max. : 2.3263
## lcp gleason pgg45 lpsa
## Min. :-1.3863 Min. :6.000 Min. : 0.00 Min. :-0.4308
## 1st Qu.:-1.3863 1st Qu.:6.000 1st Qu.: 0.00 1st Qu.: 1.7317
## Median :-0.7985 Median :7.000 Median : 15.00 Median : 2.5915
## Mean :-0.1794 Mean :6.753 Mean : 24.38 Mean : 2.4784
## 3rd Qu.: 1.1786 3rd Qu.:7.000 3rd Qu.: 40.00 3rd Qu.: 3.0564
## Max. : 2.9042 Max. :9.000 Max. :100.00 Max. : 5.5829
generate a matrix plot ordered by degree of correlation:
dta <- prostate_clean %>%
dplyr::select( -c( svi, gleason ) )
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" ) we observe a number of cross-correlations that might need attention for further analysis. For example,
lbph, or the amount of benign prostatic hyperplasia amount is highly correlated with the weight of the prostate, lweight.
the Gleason Score assesses the severity of prostate cancer: A Gleason score of 6 is low grade, 7 is intermediate grade, and a score of 8 to 10 is high grade cancer.
We might be interested in using these features to find out what is predictive of gleason. How predictive are certain data features correlate with the Gleason Score?
1.4 The dataset sat comes from a study entitled “Getting What You Pay For: The Debate Over Equity in Public School Expenditures.” Make a numerical and graphical summary of the data as in the first question.
data( sat )
glimpse( sat )## Rows: 50
## Columns: 7
## $ expend <dbl> 4.405, 8.963, 4.778, 4.459, 4.992, 5.443, 8.817, 7.030, 5.718, …
## $ ratio <dbl> 17.2, 17.6, 19.3, 17.1, 24.0, 18.4, 14.4, 16.6, 19.1, 16.3, 17.…
## $ salary <dbl> 31.144, 47.951, 32.175, 28.934, 41.078, 34.571, 50.045, 39.076,…
## $ takers <int> 8, 47, 27, 6, 45, 29, 81, 68, 48, 65, 57, 15, 13, 58, 5, 9, 11,…
## $ verbal <int> 491, 445, 448, 482, 417, 462, 431, 429, 420, 406, 407, 468, 488…
## $ math <int> 538, 489, 496, 523, 485, 518, 477, 468, 469, 448, 482, 511, 560…
## $ total <int> 1029, 934, 944, 1005, 902, 980, 908, 897, 889, 854, 889, 979, 1…
observe the general properties of each variable
summary( sat )## expend ratio salary takers
## Min. :3.656 Min. :13.80 Min. :25.99 Min. : 4.00
## 1st Qu.:4.882 1st Qu.:15.22 1st Qu.:30.98 1st Qu.: 9.00
## Median :5.768 Median :16.60 Median :33.29 Median :28.00
## Mean :5.905 Mean :16.86 Mean :34.83 Mean :35.24
## 3rd Qu.:6.434 3rd Qu.:17.57 3rd Qu.:38.55 3rd Qu.:63.00
## Max. :9.774 Max. :24.30 Max. :50.05 Max. :81.00
## verbal math total
## Min. :401.0 Min. :443.0 Min. : 844.0
## 1st Qu.:427.2 1st Qu.:474.8 1st Qu.: 897.2
## Median :448.0 Median :497.5 Median : 945.5
## Mean :457.1 Mean :508.8 Mean : 965.9
## 3rd Qu.:490.2 3rd Qu.:539.5 3rd Qu.:1032.0
## Max. :516.0 Max. :592.0 Max. :1107.0
generate a matrix plot ordered by degree of correlation:
dta <- sat
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )Plot the trends for sat scores as a function of expenditure:
sat_scores <- sat %>%
dplyr::select( expend, verbal, math ) %>%
pivot_longer( cols = c( verbal, math ),
names_to = 'testcat',
values_to = 'score' )
ggplot( sat_scores, aes( x = expend, y = score, color = testcat ) ) +
geom_point( size = 2 ) +
ylab( 'score' ) This plot suggests that there is a negative relationship between the expenditures of a state and the average sat scores for both math and verbal.
sat_scores %>% group_by( testcat ) %>%
do( exscor = broom::tidy( lm( score ~ expend, data = .))) %>%
unnest( exscor )## # A tibble: 4 x 6
## testcat term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 math (Intercept) 570. 24.2 23.6 4.88e-28
## 2 math expend -10.3 3.99 -2.58 1.29e- 2
## 3 verbal (Intercept) 520. 20.6 25.2 2.30e-29
## 4 verbal expend -10.6 3.40 -3.11 3.10e- 3
look at the cummulative scores
sat_total_lm <- sat %>%
do( totalsat = broom::tidy( lm( total ~ expend, data = . ) ) ) %>%
unnest( totalsat )
sat_total_lm## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 1089. 44.4 24.5 8.17e-29
## 2 expend -20.9 7.33 -2.85 6.41e- 3
Although the negative relationship is significant, this does not take into account an important confounding variable: the percentage of students that takes the exam:
ggplot( sat, aes( y = total, x = takers ) ) +
geom_point( size = 2 ) Perform a multiple linear regression that includes the percentage of people taking the sat
satmod <- lm( total ~ expend + takers, sat )
summary( satmod )##
## Call:
## lm(formula = total ~ expend + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.400 -22.884 1.968 19.142 68.755
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 993.8317 21.8332 45.519 < 2e-16 ***
## expend 12.2865 4.2243 2.909 0.00553 **
## takers -2.8509 0.2151 -13.253 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.46 on 47 degrees of freedom
## Multiple R-squared: 0.8195, Adjusted R-squared: 0.8118
## F-statistic: 106.7 on 2 and 47 DF, p-value: < 2.2e-16
There is a dramatic improvement in the model fit when percentage of sat test takers is included. However, there is definitely room for improving on this model. For instance, the ‘U’ shaped envelope of the relationship between total score and percent test takers hints at a nonlinearity. Additionally, there are more data features in the set that could be explored.
1.5 The dataset ‘divusa’ contains data on divorces in the United States from 1920 to 1996. Make a numerical and graphical summary of the data as in the first question.
data( divusa )
glimpse( divusa )## Rows: 77
## Columns: 7
## $ year <int> 1920, 1921, 1922, 1923, 1924, 1925, 1926, 1927, 1928, 1929,…
## $ divorce <dbl> 8.0, 7.2, 6.6, 7.1, 7.2, 7.2, 7.5, 7.8, 7.8, 8.0, 7.5, 7.1,…
## $ unemployed <dbl> 5.2, 11.7, 6.7, 2.4, 5.0, 3.2, 1.8, 3.3, 4.2, 3.2, 8.7, 15.…
## $ femlab <dbl> 22.70, 22.79, 22.88, 22.97, 23.06, 23.15, 23.24, 23.33, 23.…
## $ marriage <dbl> 92.0, 83.0, 79.7, 85.2, 80.3, 79.2, 78.7, 77.0, 74.1, 75.5,…
## $ birth <dbl> 117.9, 119.8, 111.2, 110.5, 110.9, 106.6, 102.6, 99.8, 93.8…
## $ military <dbl> 3.2247, 3.5614, 2.4553, 2.2065, 2.2889, 2.1735, 2.1073, 2.0…
inspect the summary stats:
summary( divusa )## year divorce unemployed femlab
## Min. :1920 Min. : 6.10 Min. : 1.200 Min. :22.70
## 1st Qu.:1939 1st Qu.: 8.70 1st Qu.: 4.200 1st Qu.:27.47
## Median :1958 Median :10.60 Median : 5.600 Median :37.10
## Mean :1958 Mean :13.27 Mean : 7.173 Mean :38.58
## 3rd Qu.:1977 3rd Qu.:20.30 3rd Qu.: 7.500 3rd Qu.:47.80
## Max. :1996 Max. :22.80 Max. :24.900 Max. :59.30
## marriage birth military
## Min. : 49.70 Min. : 65.30 Min. : 1.940
## 1st Qu.: 61.90 1st Qu.: 68.90 1st Qu.: 3.469
## Median : 74.10 Median : 85.90 Median : 9.102
## Mean : 72.97 Mean : 88.89 Mean :12.365
## 3rd Qu.: 80.00 3rd Qu.:107.30 3rd Qu.:14.266
## Max. :118.10 Max. :122.90 Max. :86.641
Use facet_wrap() to look at how each of the features varies with time (year):
divusa_long <- divusa %>%
pivot_longer( cols = c( divorce, unemployed, femlab, marriage, birth, military ), names_to = 'cat', values_to = 'val' )
ggplot( aes( x = year, y = val ), data = divusa_long ) +
geom_line() +
#geom_point() +
facet_wrap( ~ cat ) There are a few trends we see that occur over time:
- the increase in percentage of women in the workplace
- spikes in military/femlab/divorce/marriage coinciding with WW2
- a peak in unemployment corresponding with the great depression
and now to view pairwise comparisons:
dta <- divusa
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" ) There are several relationships that look interesting to explore further:
- the percentage of women in the workplace has a positive correlation with the divorce rate and a negative relationship with the marriage rate.
- the birth rate has a similar relationship: positively correlated with marriage and negatively correlated to divorce
- Not immediately clear from the previous figure: we see that divorce has a positive correlation with time whereas marriage is negative
Estimation
Linear Model
suppose we want to model a response property \(Y\) by three feature variables \(X_1\) \(X_2\) and \(X_3\). A general form to represent that is given: \[Y = f(X_1,X_2,X_3) + \epsilon\] typically, we won’t ever know the true \(f\). For linear models, we make the assumption that parameters enter linearly to predict the response: \[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_3 + \epsilon\] Linear models seem restrictive, but predictors can be transformed and combined such that linear models can be applied. Linear model implies simplicity, however, in practice linear models and the data they describe may in fact be quite complex.
Where do linear models come from?
- Physical Theory may suggest a model (Hooke’s Law)
- Experience with past data
- No prior data exists (linear model used as a starting point)
A good model is like a map that guides us to our destination
Estimating
We would like to choose \(\beta\) such that the systematic component of our model explains as much of the response as possible. where:
| Data = | Systemic Structure | + | Random Variation |
|---|---|---|---|
| n dimentions= | p dimensions | + | (n-p) dimensions |
The difference between the actual response and the predicted response = the residual. The conceptual purpose of the linear model is to represent as accurately as possible, some data that is complex with n-dimensions with something that is comparatively simple, a model with just ‘p’ terms/coefficients.
Least Squares Estimation
The random variation in the residuals lies in the n-p dimensional space; the aim of fitting a model is the minimize the residuals.
n-p = degrees of freedom
RSS = Residual Sum of Squares
estimate of \(\sigma^2\): \(\hat{\sigma}^2 = \frac{RSS}{n-p}\)
An Example from the Galapagos
data( gala )
glimpse( gala )## Rows: 30
## Columns: 7
## $ Species <dbl> 58, 31, 3, 25, 2, 18, 24, 10, 8, 2, 97, 93, 58, 5, 40, 347, …
## $ Endemics <dbl> 23, 21, 3, 9, 1, 11, 0, 7, 4, 2, 26, 35, 17, 4, 19, 89, 23, …
## $ Area <dbl> 25.09, 1.24, 0.21, 0.10, 0.05, 0.34, 0.08, 2.33, 0.03, 0.18,…
## $ Elevation <dbl> 346, 109, 114, 46, 77, 119, 93, 168, 71, 112, 198, 1494, 49,…
## $ Nearest <dbl> 0.6, 0.6, 2.8, 1.9, 1.9, 8.0, 6.0, 34.1, 0.4, 2.6, 1.1, 4.3,…
## $ Scruz <dbl> 0.6, 26.3, 58.7, 47.4, 1.9, 8.0, 12.0, 290.2, 0.4, 50.2, 88.…
## $ Adjacent <dbl> 1.84, 572.33, 0.78, 0.18, 903.82, 1.84, 0.34, 2.85, 17.95, 0…
fit a linear model using lm()
gala_mod <- lm( Species ~ Area + Elevation + Nearest + Scruz + Adjacent, data = gala )
gala_mod_sum <- summary( gala_mod )
gala_mod_sum ##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
the abridged output alternative from the faraway library:
sumary( gala_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.3690 0.7153508
## Area -0.023938 0.022422 -1.0676 0.2963180
## Elevation 0.319465 0.053663 5.9532 3.823e-06
## Nearest 0.009144 1.054136 0.0087 0.9931506
## Scruz -0.240524 0.215402 -1.1166 0.2752082
## Adjacent -0.074805 0.017700 -4.2262 0.0002971
##
## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
Let’s directly calculate some quantities of interest:
#extract the feature matrix and the response variable:
fmat <- model.matrix( ~ Area + Elevation + Nearest + Scruz + Adjacent, gala )
r <- gala$Speciesconstruct \((X^TX)^{-1}\):
xtxi <- solve( t( fmat ) %*% fmat )get \(\hat{\beta}\) by calculating \((X^TX)^{-1}X^Ty\):
xtxi %*% t( fmat ) %*% r## [,1]
## (Intercept) 7.068220709
## Area -0.023938338
## Elevation 0.319464761
## Nearest 0.009143961
## Scruz -0.240524230
## Adjacent -0.074804832
A more efficient and accurate way to get \(\hat{\beta}\):
solve( crossprod( fmat, fmat ), crossprod( fmat, r ) )## [,1]
## (Intercept) 7.068220709
## Area -0.023938338
## Elevation 0.319464761
## Nearest 0.009143961
## Scruz -0.240524230
## Adjacent -0.074804832
compare with the lm() results:
sumary( gala_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.3690 0.7153508
## Area -0.023938 0.022422 -1.0676 0.2963180
## Elevation 0.319465 0.053663 5.9532 3.823e-06
## Nearest 0.009144 1.054136 0.0087 0.9931506
## Scruz -0.240524 0.215402 -1.1166 0.2752082
## Adjacent -0.074805 0.017700 -4.2262 0.0002971
##
## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
We can extract a lot of information from the lm structure:
residuals( gala_mod ) #extract the residuals## Baltra Bartolome Caldwell Champion Coamano Daphne.Major
## -58.725946 38.273154 -26.330659 14.635734 38.383916 -25.087705
## Daphne.Minor Darwin Eden Enderby Espanola Fernandina
## -9.919668 19.018992 -20.314202 -28.785943 49.343513 -3.989598
## Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
## 62.033276 -59.633796 40.497176 -39.403558 -37.694540 -2.037233
## Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
## -111.679486 -42.475375 -23.075807 -5.553122 73.048122 -40.676318
## SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
## 182.583587 -23.376486 89.383371 -5.805095 -36.935732 -5.700573
fitted( gala_mod ) #extract the model predictions## Baltra Bartolome Caldwell Champion Coamano Daphne.Major
## 116.7259460 -7.2731544 29.3306594 10.3642660 -36.3839155 43.0877052
## Daphne.Minor Darwin Eden Enderby Espanola Fernandina
## 33.9196678 -9.0189919 28.3142017 30.7859425 47.6564865 96.9895982
## Gardner1 Gardner2 Genovesa Isabela Marchena Onslow
## -4.0332759 64.6337956 -0.4971756 386.4035578 88.6945404 4.0372328
## Pinta Pinzon Las.Plazas Rabida SanCristobal SanSalvador
## 215.6794862 150.4753750 35.0758066 75.5531221 206.9518779 277.6763183
## SantaCruz SantaFe SantaMaria Seymour Tortuga Wolf
## 261.4164131 85.3764857 195.6166286 49.8050946 52.9357316 26.7005735
df.residual( gala_mod ) #extract the degrees of freedom## [1] 24
deviance( gala_mod ) #gives the RSS## [1] 89231.37
coef( gala_mod ) #returns the coefficient estimates## (Intercept) Area Elevation Nearest Scruz Adjacent
## 7.068220709 -0.023938338 0.319464761 0.009143961 -0.240524230 -0.074804832
estimate \(\sigma\):
sqrt( deviance( gala_mod ) / df.residual( gala_mod ) )## [1] 60.97519
gala_mod_sum$sigma## [1] 60.97519
compute the standard error for the coefficients
xtxi <- gala_mod_sum$cov.unscaled
SE <- sqrt( diag( xtxi ) ) * gala_mod_sum$sigma
SE## (Intercept) Area Elevation Nearest Scruz Adjacent
## 19.15419782 0.02242235 0.05366280 1.05413595 0.21540225 0.01770019
or extract the SE
gala_mod_sum$coef[,2]## (Intercept) Area Elevation Nearest Scruz Adjacent
## 19.15419782 0.02242235 0.05366280 1.05413595 0.21540225 0.01770019
There are great reasons to use Least Squares:
- It makes sense geometrically (makes an orthogonal projection onto model space)
- If the errors are independent and identically normally distributed, it estimates the maximum likelihood
- The Gause-Markov rule states that \(\hat{\beta}\) is the best linear unbiased estimate
However, there are some situations where Least Squares might not be the best:
- When errors are correlated of have unequal variance
- When the error distribution is long-tailed
- When the predictors are collinear
Goodness of Fit
a measure of how well the model fits the data.
\(R^2\) or the coefficient of determination or the percentage of variance explained: \[R^2 = 1 - \frac{\mbox{RSS}}{\mbox{Total SS(Corrected for Mean)}} = \mbox{cor}^2 (\hat{y},y)\] It is a mistake to rely on \(R^2\) as the sole measure of goodness of fit.
op <- par(mfrow=c(1,4),mar=c (0,0,2,3), oma = c(5, 4.2, 0, 0))
with(anscombe, plot(x1, y1, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x1, y1),2)))))); abline(3,0.5)
with(anscombe, plot(x2, y2, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x2, y2),2)))))); abline(3,0.5)
with(anscombe, plot(x3, y3, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x3, y3),2)))))); abline(3,0.5)
with(anscombe, plot(x4, y4, xlab = "", ylab = "", main = bquote(paste(italic(r),
" = ",.(round(cor(x4, y4),2)))))); abline(3,0.5)
mtext(expression(italic(y[1])),side=1, outer = TRUE, line = 3)
mtext(expression(italic(y[2])),side=2, outer = TRUE, line = 2.6)
mtext("(a)",side=3, at = -42, line = .5)
mtext("(b)",side=3, at = -26, line = .5)
mtext("(c)",side=3, at = -10.3, line = .5)
mtext("(d)",side=3, at = 5.5, line = .5)par(op)
# }All 4 of these distributions result in the same lm slope with the same \(R^2\) value.
\(\hat{\sigma}\) is another measure returned by the lm() summary that is in the same units as the response variable.
Identifiability
Unidentifiability will occur when \(X\) is not of full rank. e.g. columns that are linear combinations of one another.
Orthogonality
data( odor )
glimpse( odor )## Rows: 15
## Columns: 4
## $ odor <dbl> 66, 39, 43, 49, 58, 17, -5, -40, 65, 7, 43, -22, -31, -35, -26
## $ temp <dbl> -1, 1, -1, 1, -1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0
## $ gas <dbl> -1, -1, 1, 1, 0, 0, 0, 0, -1, 1, -1, 1, 0, 0, 0
## $ pack <dbl> 0, 0, 0, 0, -1, -1, 1, 1, -1, -1, 1, 1, 0, 0, 0
compute the covariance:
cov( odor[,-1] )## temp gas pack
## temp 0.5714286 0.0000000 0.0000000
## gas 0.0000000 0.5714286 0.0000000
## pack 0.0000000 0.0000000 0.5714286
odor_mod <- lm( odor ~ temp + gas + pack, odor )
summary( odor_mod, cor = T )##
## Call:
## lm(formula = odor ~ temp + gas + pack, data = odor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.200 -17.138 1.175 20.300 62.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.200 9.298 1.635 0.130
## temp -12.125 12.732 -0.952 0.361
## gas -17.000 12.732 -1.335 0.209
## pack -21.375 12.732 -1.679 0.121
##
## Residual standard error: 36.01 on 11 degrees of freedom
## Multiple R-squared: 0.3337, Adjusted R-squared: 0.1519
## F-statistic: 1.836 on 3 and 11 DF, p-value: 0.1989
##
## Correlation of Coefficients:
## (Intercept) temp gas
## temp 0.00
## gas 0.00 0.00
## pack 0.00 0.00 0.00
The Correlation Coefs are 0.
These feature variables are entirely independent
Exercises
2.4
The dataset prostate comes from a study on 97 men with prostate cancer who were due to receive a radical prostatectomy. Fit a model with lpsa as the response and lcavol as the predictor. Record the residual standard error and the \(R^2\) . Now add lweight, svi, lbph, age, lcp, pgg45 and gleason to the model one at a time. For each model record the residual standard error and the \(R^2\). Plot the trends in these two statistics.
data( prostate, package = 'faraway' )
glimpse( prostate )## Rows: 97
## Columns: 9
## $ lcavol <dbl> -0.5798185, -0.9942523, -0.5108256, -1.2039728, 0.7514161, -1.…
## $ lweight <dbl> 2.7695, 3.3196, 2.6912, 3.2828, 3.4324, 3.2288, 3.4735, 3.5395…
## $ age <int> 50, 58, 74, 58, 62, 50, 64, 58, 47, 63, 65, 63, 63, 67, 57, 66…
## $ lbph <dbl> -1.386294, -1.386294, -1.386294, -1.386294, -1.386294, -1.3862…
## $ svi <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ lcp <dbl> -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1.38629, -1…
## $ gleason <int> 6, 6, 7, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 7, 7, 6, 7, 6, 6, 6, 6,…
## $ pgg45 <int> 0, 0, 20, 0, 0, 0, 0, 0, 0, 0, 0, 0, 30, 5, 5, 0, 30, 0, 0, 0,…
## $ lpsa <dbl> -0.43078, -0.16252, -0.16252, -0.16252, 0.37156, 0.76547, 0.76…
fit model and keep track of residual standard error and \(R^3\):
formula <- ''
formulas <- c()
res_std_err <- c()
r_sqrd <- c()
feature_order <- c( 'lcavol', 'lweight', 'svi', 'lbph', 'age', 'lcp', 'pgg45', 'gleason' )
for (i in 1:length( feature_order )) {
if (i == 1) {
formula <- paste0('lpsa', " ~ ", feature_order[i] )
}
else {
formula <- paste0(formula, " + ", feature_order[i] )
}
mod_sum <- summary( lm( formula, data = prostate ) )
res_std_err[[i]] <- mod_sum$sigma
r_sqrd[[i]] <- mod_sum$r.squared
#print( formula )
}
mods_df <- data.frame( 'n_terms' = seq( 1,8 ),
'sigma' = unlist(res_std_err, recursive = FALSE),
'r_sqrd' = unlist(r_sqrd, recursive = FALSE) )
glimpse( mods_df )## Rows: 8
## Columns: 3
## $ n_terms <int> 1, 2, 3, 4, 5, 6, 7, 8
## $ sigma <dbl> 0.7874994, 0.7506469, 0.7168094, 0.7108232, 0.7073054, 0.71021…
## $ r_sqrd <dbl> 0.5394319, 0.5859345, 0.6264403, 0.6366035, 0.6441024, 0.64511…
visualize the trends of the summary statistics:
sigma_plot <- ggplot( mods_df, aes( x = n_terms, y = sigma ) ) +
geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_line() +
xlab( 'Number of features added to lm()' ) +
ylab( 'sigma' ) +
theme_classic()
r_sqrd_plot <- ggplot( mods_df, aes( x = n_terms, y = r_sqrd ) ) +
geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_line() +
xlab( 'Number of features added to lm()' ) +
ylab( 'R^2' ) +
theme_classic()
grid.arrange( sigma_plot, r_sqrd_plot, ncol = 2 )2.5
Using the prostate data, plot lpsa against lcavol. Fit the regressions of lpsa on lcavol and lcavol on lpsa. Display both regression lines on the plot. At what point do the two lines intersect?
m <- lm(lcavol ~ lpsa, prostate)
m2 <- lm(lpsa ~ lcavol, prostate)
plot1 <- ggplot(prostate, aes(lcavol, lpsa)) +
geom_point(fill = NA, shape = 21, alpha = 0.5, size = 3) +
geom_line(aes(x = predict(m), color = "lcavol ~ lpsa")) +
geom_line(aes(y = predict(m2), color = "lpsa ~ lcavol")) +
theme_classic()
plot1 find the intersection of the two lines:
m_aug <- augment( m )
m2_aug <- augment( m2 )
P3 <- c( m2_aug$lcavol[1], m2_aug$.fitted[1] )
P4 <- c( m2_aug$lcavol[length( m2_aug$lcavol )], m2_aug$.fitted[length( m2_aug$lcavol )] )
P1 <- c( m_aug$.fitted[1], m_aug$lpsa[1] )
P2 <- c( m_aug$.fitted[length( m_aug$lcavol )], m_aug$lpsa[length( m_aug$lcavol )] )
intersect <- line.line.intersection(P1, P2, P3, P4, interior.only = TRUE)
int_df <- data.frame( 'X' = intersect[1], 'Y' = intersect[2] )
int_df## X Y
## 1 1.35001 2.478387
visually confirm the intersection
coords <- paste( '(', round(int_df$X,2), ',', round(int_df$Y,2), ')' )
plot1 + annotate(geom="point", x=int_df$X, y=int_df$Y,
color="red") +
annotate(geom="text", x=2, y=2, label=coords,
color="red") ggtitle( 'intersection of lm() model fits' )## $title
## [1] "intersection of lm() model fits"
##
## attr(,"class")
## [1] "labels"
Inference
Hypothesis Tests to Compare Models
F-statistic: A statistic used to evaluate the residuals for the model compared to the null hypothesis.
\[F = \frac{ \frac{(RSS_{\omega}-RSS_{\Omega})}{p-q}}{\frac{ RSS_{\Omega}}{n-p}} \sim F_{p-q,n-p}\] We would reject the null hypothesis if \(F\gt F_{p-q,n-p}^{(\alpha)}\) Where p = num parameters in \(\Omege\) and q is the num params in \(\omega\). The degrees of freedom of a model at typically the number of observations - the number of parameters, so F can also be written as: \[F = \frac{ \frac{(RSS_{\omega}-RSS_{\Omega})}{df_{\omega}-df_{\Omega}}}{\frac{ RSS_{\Omega}}{df_{\Omega}}}\]
Are any of the Feature Variables Useful in Predicting the Response?
We do not know whether all predictors are required to predict a response or just some. The F-statistic can help determine this.
Demonstrate this by revisiting the data from the Galapagos:
gala_mod_sum##
## Call:
## lm(formula = Species ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala)
##
## Residuals:
## Min 1Q Median 3Q Max
## -111.679 -34.898 -7.862 33.460 182.584
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.369 0.715351
## Area -0.023938 0.022422 -1.068 0.296318
## Elevation 0.319465 0.053663 5.953 3.82e-06 ***
## Nearest 0.009144 1.054136 0.009 0.993151
## Scruz -0.240524 0.215402 -1.117 0.275208
## Adjacent -0.074805 0.017700 -4.226 0.000297 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.98 on 24 degrees of freedom
## Multiple R-squared: 0.7658, Adjusted R-squared: 0.7171
## F-statistic: 15.7 on 5 and 24 DF, p-value: 6.838e-07
model the Null Hypothesis and perform an ANOVA:
gala_nullmod <- lm( Species ~ 1, gala )
anova( gala_nullmod, gala_mod )## Analysis of Variance Table
##
## Model 1: Species ~ 1
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 29 381081
## 2 24 89231 5 291850 15.699 6.838e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Directly computing the F-stat & it’s p-value:
rss0 <- deviance( gala_nullmod )
rss <- deviance( gala_mod )
df0 <- df.residual( gala_nullmod )
df <- df.residual( gala_mod )
fstat <- ((rss0-rss)/(df0-df)/(rss/df))
fstat## [1] 15.69941
1 - pf( fstat, df0-df, df )## [1] 6.837893e-07
Can 1 particular feature be dropped from a model?
Let \(\Omega\) be the full model.
Let \(\omega\) be the full model - suspect feature.
Test whether the feature can be dropped by looking at the F-statistic. The null hypothesis is that \(H_0 : \beta_i = 0\)
gala_mod_noArea <- lm( Species ~ Elevation + Nearest + Scruz + Adjacent, gala )
anova( gala_mod_noArea, gala_mod )## Analysis of Variance Table
##
## Model 1: Species ~ Elevation + Nearest + Scruz + Adjacent
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 25 93469
## 2 24 89231 1 4237.7 1.1398 0.2963
The p-val of 0.3 indicates that the Null cannot be rejected
Testing a pair of predictors
Fit a model without the pair & build the F-test:
gala_mod_noAreas <- lm( Species ~ Elevation + Nearest + Scruz, gala )
anova( gala_mod_noAreas, gala_mod )## Analysis of Variance Table
##
## Model 1: Species ~ Elevation + Nearest + Scruz
## Model 2: Species ~ Area + Elevation + Nearest + Scruz + Adjacent
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 26 158292
## 2 24 89231 2 69060 9.2874 0.00103 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Here, the F-test is statistically significant suggesting that we can reject the null hypothesis. The rejection suggests that there is a difference in the amount of residual error that is accounted for between the two models therefore, removing the pair of feature variables cannot be justified.
Testing a Subspace
- Can we combine two features? use the
I(), or interaction function to test linear combinations of feature variables - Can a feature be set to a particular values? use the
offset()function
What can’t we use an F-test to evaluate?
- Non-linear relationships. ex: \(H_0 : \beta_j \beta_k = 1\)
- the null model needs to be in the feature subspace of the full model.
Permutation Tests: when you don’t want to assume normality
Permutation Test: what is the chance that the F-test is larger than we observed? We could compute this exactly by computing the F-stat for all permutations of the response variable and see what proportion have greater F-tests.
gala_mod_2 <- lm( Species ~ Nearest + Scruz, gala )
gala_mod_2_sum <- summary( gala_mod_2 )
gala_mod_2_sum$fstatistic## value numdf dendf
## 0.6019558 2.0000000 27.0000000
1 - pf( gala_mod_2_sum$fstatistic[1],
gala_mod_2_sum$fstatistic[2],
gala_mod_2_sum$fstatistic[3])## value
## 0.5549255
#anova( gala_mod_noAreas, gala_mod )use the sample() function to repeat this 4000x
nreps <- 4000
fstats <- numeric( nreps )
for (i in 1:nreps ) {
lmsum <- summary( lm( sample( Species ) ~ Nearest + Scruz, gala ) )
fstats[i] <- lmsum$fstat[1]
}
mean( fstats > gala_mod_2_sum$fstatistic[1] )## [1] 0.5585
Sampling
a finite population from which we draw a simple random sample that is our data.
Confidence Intervals for \(\beta\)
consider our galapagos model:
sumary( gala_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.068221 19.154198 0.3690 0.7153508
## Area -0.023938 0.022422 -1.0676 0.2963180
## Elevation 0.319465 0.053663 5.9532 3.823e-06
## Nearest 0.009144 1.054136 0.0087 0.9931506
## Scruz -0.240524 0.215402 -1.1166 0.2752082
## Adjacent -0.074805 0.017700 -4.2262 0.0002971
##
## n = 30, p = 6, Residual SE = 60.97519, R-Squared = 0.77
construct 95% CI for \(\beta_{area}\)
tdist <- qt( 0.975, 30-6 ) #t-dist w/ 30-6 degrees of freedom
c( -0.023938 - tdist * 0.022422, -0.023938 + tdist * 0.022422)## [1] -0.07021473 0.02233873
CIs have a duality with the 2-sided hypothesis tests: if the interval contains zero, this indicates that we cannot reject the null hypothesis. \(\beta_{area}\)’s CI contains 0, therefore we cannot reject the null hypothesis.
c( -0.074805 - tdist * 0.0177, -0.074805 + tdist * 0.0177)## [1] -0.111336 -0.038274
\(\beta_{adjacent}\)’s CI does not, so we can reject the null hypothesis.
We can quickly find the CI for all feature variables in the model:
confint( gala_mod )## 2.5 % 97.5 %
## (Intercept) -32.4641006 46.60054205
## Area -0.0702158 0.02233912
## Elevation 0.2087102 0.43021935
## Nearest -2.1664857 2.18477363
## Scruz -0.6850926 0.20404416
## Adjacent -0.1113362 -0.03827344
Confidence Regions:
plot( ellipse( gala_mod, c( 2,6 ) ), type = 'l', ylim = c(-0.13,0 ) ) #Confidence Ellipse
points( coef( gala_mod )[2], coef( gala_mod )[6], pch=19 ) #least squares estimate
abline( v = confint( gala_mod)[2,], lty = 2 ) #confidence intervals
abline( h = confint( gala_mod)[6,], lty = 2 ) We can reject the null hypothesis because the origin does not lie within the ellipse
Bootstrapped Confidence Intervals
F & t based CIs are convenient, however, they rely on the assumption of normality. Boostrapped CI’s are good workaround.
Simulation
Sample the known distribution and compute the estimate. Repeat these steps multiple times and estimate the sampling distribution of \(\hat{\beta}\) using the empirical distribution of the generated/simulated \(\hat{\beta}\):
- Generate \(\epsilon\) from the known error distribution
- From \(y = X\beta + \epsilon\) from the known \('beta\) and fixed \(X\).
- Compute \(\hat{\beta}\).
Bootstrap
Instead of sampling from the true (empirical) data, sample from observed data.
Bootstrapping mirrors simulation, but works with the data on hand without making assumptions about it’s distribution:
- Generate \(\epsilon^*\) by sampling with replacement from \(\hat{\epsilon_1}, \ldots,\hat{\epsilon_n}\)
- Form \(y^* = X\hat{\beta} + \epsilon^*\)
- Compute \(\hat{\beta}^*\) from \((X,y^*)\)
num_boots <- 4000
coefmat <- matrix( NA, num_boots, 6 )
resids <- residuals( gala_mod )
preds <- fitted( gala_mod )
for ( i in 1:num_boots ) {
booty <- preds + sample( resids, rep = TRUE )
bmod <- update( gala_mod, booty ~ . )
coefmat[ i, ] <- coef( bmod )
}
colnames( coefmat ) <- c( "Intercept", colnames( gala[, 3:7 ] ) )
coefdf <- data.frame( coefmat )
CIs <- apply( coefdf, 2, function( x ) quantile( x, c( 0.025, 0.975 ) ) )
CIs## Intercept Area Elevation Nearest Scruz Adjacent
## 2.5% -24.13428 -0.06362569 0.2303649 -1.705391 -0.6025777 -0.10519819
## 97.5% 41.65439 0.02119342 0.4202839 2.114584 0.1750448 -0.04002093
let’s remind ourselves of the CIs computed with normal distribution theory:
confint( gala_mod )## 2.5 % 97.5 %
## (Intercept) -32.4641006 46.60054205
## Area -0.0702158 0.02233912
## Elevation 0.2087102 0.43021935
## Nearest -2.1664857 2.18477363
## Scruz -0.6850926 0.20404416
## Adjacent -0.1113362 -0.03827344
the CIs are not identical but are comparable.
Now take a look at some of the bootstrapped distributions and the CIs:
CIs## Intercept Area Elevation Nearest Scruz Adjacent
## 2.5% -24.13428 -0.06362569 0.2303649 -1.705391 -0.6025777 -0.10519819
## 97.5% 41.65439 0.02119342 0.4202839 2.114584 0.1750448 -0.04002093
pA <- ggplot( coefdf, aes( x = Area ) ) + geom_density() +
geom_vline( xintercept = c( CIs[ 1,2 ], CIs[ 2,2 ] ), lty = 2 )
pE <- ggplot( coefdf, aes( x = Elevation ) ) + geom_density() +
geom_vline( xintercept = c( CIs[ 1,3 ], CIs[ 2,3 ] ), lty = 2 )
pN <- ggplot( coefdf, aes( x = Nearest ) ) + geom_density() +
geom_vline( xintercept = c( CIs[ 1,4 ], CIs[ 2,4 ] ), lty = 2 )
pS <- ggplot( coefdf, aes( x = Scruz ) ) + geom_density() +
geom_vline( xintercept = c( CIs[ 1,5 ], CIs[ 2,5 ] ), lty = 2 )
pAd <- ggplot( coefdf, aes( x = Adjacent ) ) + geom_density() +
geom_vline( xintercept = c( CIs[ 1,6 ], CIs[ 2,6 ] ), lty = 2 )
grid.arrange( pA, pE, pN, pS, pAd ) The distributions are all roughly normal. However, this is not always the case
Exercise 3.4
Using sat data:
- Fit a model with total sat score as the response and expend, ratio and salary as predictors. Test the hypothesis that \(\beta_{salary}\) = 0. Test the hypothesis that \(\beta_{salary}\) = \(\beta_{ratio}\) = \(\beta_{expend}\) = 0. Do any of these predictors have an effect on the response?
sat_sub <- sat %>% dplyr::select( c( total, expend, ratio, salary, takers ) )
dta <- sat_sub
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )sat_mod <- lm( total ~ expend + ratio + salary, data = sat )
sumary( sat_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1069.2342 110.9249 9.6393 1.292e-12
## expend 16.4689 22.0499 0.7469 0.45893
## ratio 6.3303 6.5421 0.9676 0.33829
## salary -8.8226 4.6968 -1.8784 0.06667
##
## n = 50, p = 4, Residual SE = 68.65350, R-Squared = 0.21
confint( sat_mod )## 2.5 % 97.5 %
## (Intercept) 845.95385 1292.5144911
## expend -27.91528 60.8530112
## ratio -6.83820 19.4987339
## salary -18.27679 0.6315232
We cannot reject the null hypotheses as these predictors all include 0 within thier confidence intervals and do not have a statistically significant effect on the response.
- Now add takers to the model. Test the hypothesis that \(\beta_{takers}\) = 0. Compare this model to the previous one using an F-test. Demonstrate that the F-test and t-test here are equivalent.
sat_mod2 <- lm( total ~ expend + ratio + salary + takers, data = sat )
sum2 <- sumary( sat_mod2 )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.97154 52.86976 19.7839 < 2.2e-16
## expend 4.46259 10.54653 0.4231 0.6742
## ratio -3.62423 3.21542 -1.1271 0.2657
## salary 1.63792 2.38725 0.6861 0.4962
## takers -2.90448 0.23126 -12.5594 2.607e-16
##
## n = 50, p = 5, Residual SE = 32.70199, R-Squared = 0.82
sum2##
## Call:
## lm(formula = total ~ expend + ratio + salary + takers, data = sat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -90.531 -20.855 -1.746 15.979 66.571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.9715 52.8698 19.784 < 2e-16 ***
## expend 4.4626 10.5465 0.423 0.674
## ratio -3.6242 3.2154 -1.127 0.266
## salary 1.6379 2.3872 0.686 0.496
## takers -2.9045 0.2313 -12.559 2.61e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 32.7 on 45 degrees of freedom
## Multiple R-squared: 0.8246, Adjusted R-squared: 0.809
## F-statistic: 52.88 on 4 and 45 DF, p-value: < 2.2e-16
confint( sat_mod2)## 2.5 % 97.5 %
## (Intercept) 939.486374 1152.456698
## expend -16.779204 25.704393
## ratio -10.100417 2.851952
## salary -3.170247 6.446081
## takers -3.370262 -2.438699
We can reject the null hypothesis that \(\beta_{takers} = 0\), because \(\beta_{takers}\)’s CI does not span 0.
sat_anova <- anova( sat_mod, sat_mod2 )
sat_anova## Analysis of Variance Table
##
## Model 1: total ~ expend + ratio + salary
## Model 2: total ~ expend + ratio + salary + takers
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 46 216812
## 2 45 48124 1 168688 157.74 2.607e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value indicates that we can reject the null hypothesis and conclude that the addition of takers to the model leads to a statistically significant difference in the models ability to describe the variance in the data. Additionally, we can see that the F-statistic from the anova is equivalent to the t-test for takers in the summary of the linear model:
sat_anova$`Pr(>F)`[2] == sum2$coefficients[5,4]## [1] TRUE
Prediction
Prediction is one of the main uses for regression models. However, a point estimate is not the end of the story. It is as important to consider the uncertainty of the estimate so that we may understand the range of expected outcomes. Projections are not useful without a realistic estimate of uncertainty because we need to make sensible plans for when events do not turn out as well as predicted.
Confidence Intervals for Prediction
- Prediction of a Mean Response: Confidence Interval
- \(\hat{y}_0 \pm t_{n-p}^{\frac{\alpha}{2}}\hat{\sigma} \sqrt{x_0^T(X^TX)^{-1}x_0}\)
- There is a 95% chance that the mean value falls within this CI
- Prediction of a Future Observations: Prediction Interval
- \(\hat{y}_0 \pm t_{n-p}^{\frac{\alpha}{2}}\hat{\sigma} \sqrt{1 + x_0^T(X^TX)^{-1}x_0}\)
- There is a 95% chance that the future value falls with in this PI
- Extrapolation: Predicting an estimate for values that lie outside of the given data range
Example: Body Fat
load body fat data:
data( fat )
#glimpse( fat )
fatmod <- lm( brozek ~ age + weight + height +
neck + chest + abdom + hip + thigh +
knee + ankle + biceps + forearm +
wrist, data = fat )
summary( fatmod )##
## Call:
## lm(formula = brozek ~ age + weight + height + neck + chest +
## abdom + hip + thigh + knee + ankle + biceps + forearm + wrist,
## data = fat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.264 -2.572 -0.097 2.898 9.327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.29255 16.06992 -0.952 0.34225
## age 0.05679 0.02996 1.895 0.05929 .
## weight -0.08031 0.04958 -1.620 0.10660
## height -0.06460 0.08893 -0.726 0.46830
## neck -0.43754 0.21533 -2.032 0.04327 *
## chest -0.02360 0.09184 -0.257 0.79740
## abdom 0.88543 0.08008 11.057 < 2e-16 ***
## hip -0.19842 0.13516 -1.468 0.14341
## thigh 0.23190 0.13372 1.734 0.08418 .
## knee -0.01168 0.22414 -0.052 0.95850
## ankle 0.16354 0.20514 0.797 0.42614
## biceps 0.15280 0.15851 0.964 0.33605
## forearm 0.43049 0.18445 2.334 0.02044 *
## wrist -1.47654 0.49552 -2.980 0.00318 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.988 on 238 degrees of freedom
## Multiple R-squared: 0.749, Adjusted R-squared: 0.7353
## F-statistic: 54.63 on 13 and 238 DF, p-value: < 2.2e-16
Consider a ‘typical’ man (mean features).
x <- model.matrix( fatmod )
x0 <- apply( x, 2, median )
x0## (Intercept) age weight height neck chest
## 1.00 43.00 176.50 70.00 38.00 99.65
## abdom hip thigh knee ankle biceps
## 90.95 99.30 59.00 38.50 22.80 32.05
## forearm wrist
## 28.70 18.30
fat_select <- fat %>%
dplyr::select( c( age, weight, height, neck, chest, abdom, hip, thigh, knee, ankle, biceps, forearm, wrist ) )
meanMan <- colMeans( fat_select )
meanMan## age weight height neck chest abdom hip thigh
## 44.88492 178.92440 70.14881 37.99206 100.82421 92.55595 99.90476 59.40595
## knee ankle biceps forearm wrist
## 38.59048 23.10238 32.27341 28.66389 18.22976
if we are predicting the expected value and PI for an individual with mean features:
predict( fatmod, newdata = data.frame( t( meanMan ) ), interval = 'prediction' )## fit lwr upr
## 1 18.93849 11.06669 26.8103
if we are predicting the mean body fat for all men that have mean characteristics
predict( fatmod, newdata = data.frame( t( meanMan ) ), interval = 'confidence' )## fit lwr upr
## 1 18.93849 18.4436 19.43339
Extrapolation
the further away the values are from the original data, the more uncertainty there is in the estimations.
Compare the intervals calculated below with those for the mean ‘typical’ body metrics:
extremeFat <- apply( x, 2, function( x ) quantile( x, 0.95 ) )
predict( fatmod, newdata = data.frame( t( extremeFat ) ), interval = 'prediction' )## fit lwr upr
## 1 30.01804 21.92407 38.11202
predict( fatmod, newdata = data.frame( t( extremeFat ) ), interval = 'confidence' )## fit lwr upr
## 1 30.01804 28.07072 31.96537
Autoregression
Consider the trends in this data on monthly airline passengers:
data( airpass, package = 'faraway' )
AirlinePlot <- ggplot( airpass, aes( x = year, y = pass ) ) +
geom_line() +
ylab( 'Log(Passengers)' ) +
theme_classic() +
ggtitle( 'Airline Passengers', subtitle = 'with an obviously inappropriate linear regression fit')
AirlinePlot + geom_smooth( method = 'lm' )## `geom_smooth()` using formula 'y ~ x'
The linear fit above captures the increase in passengers over the years, but does nothing to capture the seasonal variation in the data.
Autoregressive model: the response depends on past values of the response. So, build into the model is a feature variable that describes a previous response. For example, it may be the case that we would like to predict the passengers for next month. We would like to know the expected change in ridership to predict next months ridership. Autoregressive ‘lagged’ variables can be incorporated into the model to describe the number of passengers from previous months:
\[y_t = \beta_0 + \beta_1y_{t-1} + \beta_{12}y_{t-12} + \beta_{13}y_{t-13} + \epsilon_t \]
lagdf <- embed( log( airpass$pass ), 14 )
colnames( lagdf ) <- c( 'y', paste0( 'lag',1:13 ) )
lagdf <- data.frame( lagdf )
#build our linear model:
armod <- lm( y ~ lag1 + lag12 + lag13, lagdf )
summary( armod )##
## Call:
## lm(formula = y ~ lag1 + lag12 + lag13, data = lagdf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.111173 -0.024849 -0.002443 0.025373 0.134024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.13848 0.05361 2.583 0.0109 *
## lag1 0.69231 0.06186 11.192 < 2e-16 ***
## lag12 0.92152 0.03473 26.532 < 2e-16 ***
## lag13 -0.63214 0.06768 -9.340 4.16e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04164 on 127 degrees of freedom
## Multiple R-squared: 0.9892, Adjusted R-squared: 0.989
## F-statistic: 3892 on 3 and 127 DF, p-value: < 2.2e-16
visualize the autoregressive model fit:
airpass_lag <- airpass[14:144,]
airpass_lag[ 'predict' ] <- exp( predict( armod ) )
AirlinePlot +
geom_line( aes( x = year, y = predict, color = 'red' ), data = airpass_lag, show.legend = FALSE )predicting future values: use the last observation in the data as the lag one value:
#last value of the data
last_row_df <- lagdf[ nrow( lagdf ), ]
# the current response variable ('y') will become lag1 and everything else shifts over 1:
new_data <- data.frame( lag1 = last_row_df$y, lag12 = last_row_df$lag11, lag13 = last_row_df$lag12 )
#prediction:
next_month_res <- predict( armod, newdata = new_data, interval = 'prediction' )
paste( 'The Autoregression Mod predicts', round( exp( next_month_res[1] ),0),
'passengers with a 95% confidence interval from', round( exp( next_month_res[2] ),0),
'to', round( exp( next_month_res[3] ),0))## [1] "The Autoregression Mod predicts 448 passengers with a 95% confidence interval from 412 to 487"
What can go wrong with predictions?
- Bad Models: If it doesn’t fit, don’t apply it.
- Quantitative Extrapolation: We try to predict outcomes for cases with predictor values much different from what we saw in the data
- Qualitative Extrapolation: We try to predict outcomes for observations that come from different populations
- Overconfidence due to overfitting: data is not robust to new data
- Black swans: might not have enough data to properly describe the distribution
Exercise 4.5
#the full model
#fatmod
sumary( fatmod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -15.292549 16.069921 -0.9516 0.342252
## age 0.056786 0.029965 1.8951 0.059290
## weight -0.080310 0.049581 -1.6198 0.106602
## height -0.064600 0.088930 -0.7264 0.468299
## neck -0.437541 0.215334 -2.0319 0.043273
## chest -0.023603 0.091839 -0.2570 0.797396
## abdom 0.885429 0.080077 11.0572 < 2.2e-16
## hip -0.198419 0.135156 -1.4681 0.143406
## thigh 0.231895 0.133718 1.7342 0.084175
## knee -0.011677 0.224143 -0.0521 0.958496
## ankle 0.163536 0.205143 0.7972 0.426142
## biceps 0.152799 0.158513 0.9640 0.336048
## forearm 0.430489 0.184452 2.3339 0.020436
## wrist -1.476537 0.495519 -2.9798 0.003183
##
## n = 252, p = 14, Residual SE = 3.98797, R-Squared = 0.75
#short model
fatmod_4 <- lm( brozek ~ age + weight + height + abdom, fat )
sumary( fatmod_4 )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -32.7696359 6.5419024 -5.0092 1.042e-06
## age -0.0070513 0.0243416 -0.2897 0.7723
## weight -0.1237218 0.0250455 -4.9399 1.442e-06
## height -0.1166940 0.0827269 -1.4106 0.1596
## abdom 0.8897041 0.0672672 13.2264 < 2.2e-16
##
## n = 252, p = 5, Residual SE = 4.12607, R-Squared = 0.72
- Is it justifiable to use the smaller model?
anova( fatmod, fatmod_4 )## Analysis of Variance Table
##
## Model 1: brozek ~ age + weight + height + neck + chest + abdom + hip +
## thigh + knee + ankle + biceps + forearm + wrist
## Model 2: brozek ~ age + weight + height + abdom
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 238 3785.1
## 2 247 4205.0 -9 -419.9 2.9336 0.002558 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is a statistically significant difference between the amount of variance accounted for by the two models. The smaller model does not describe the data as well.
- Compute a 95% prediction interval for median predictor values and compare to the results to the interval for the full model. Do the intervals differ by a practically important amount?
medvals <- model.matrix( fatmod_4 )
medvals0 <- apply( medvals, 2, median )
#95% prediction interval for the median predictor values
predict( fatmod_4, newdata = data.frame( t( medvals0 ) ), interval = 'prediction' )## fit lwr upr
## 1 17.84028 9.696631 25.98392
#CI for the full model
predict( fatmod, newdata = data.frame( t( x0 ) ), interval = 'prediction' )## fit lwr upr
## 1 17.49322 9.61783 25.36861
No, the intervals don’t seem to differ by an appreciable amount. So, in moving to the smaller, simpler model, we don’t seem to lose a lot of predictive power.
- For the smaller model, examine all the observations from case numbers 25 to 50. Which two observations seem particularly anomalous?
fatmod_4_aug <- augment( fatmod_4 )
fatmod_4_aug_sub <- fatmod_4_aug[25:50,]
#fatmod_4_aug_sub
ggplot( aes( x = .fitted ), data = fatmod_4_aug_sub ) + geom_dotplot()## `stat_bindot()` using `bins = 30`. Pick better value with `binwidth`.
#plot the model predictors for observation 25:50
fat %>%
dplyr::select( c( age, weight, height, abdom ) ) %>%
pivot_longer( cols = everything(), names_to = 'predictor', values_to = 'val' ) %>%
ggplot( aes( x = predictor, y = val ) ) +
geom_boxplot() +
ggtitle( 'Observations 25:50: Predictive Features' )outliers <- function(dataframe){
dataframe %>%
dplyr::select_if(is.numeric) %>%
map(~ boxplot.stats(.x)$out)
}
fatsub <- fat[25:50,]
fat4 <- fatsub %>%
dplyr::select( c( age, weight, height, abdom ) )
fat4out <- outliers( fat4 )
fat4out## $age
## integer(0)
##
## $weight
## [1] 363.15
##
## $height
## [1] 29.5
##
## $abdom
## [1] 148.1
from the output above, find the observation indices:
#39, 42, 39
fat %>% filter( weight == c( 363.15 ) )## brozek siri density age weight height adipos free neck chest abdom hip
## 39 33.8 35.2 1.0202 46 363.15 72.25 48.9 240.5 51.2 136.2 148.1 147.7
## thigh knee ankle biceps forearm wrist
## 39 87.3 49.1 29.6 45 29 21.4
fat %>% filter( height == c( 29.5 ) )## brozek siri density age weight height adipos free neck chest abdom hip
## 42 31.7 32.9 1.025 44 205 29.5 29.9 140.1 36.6 106 104.3 115.5
## thigh knee ankle biceps forearm wrist
## 42 70.6 42.5 23.7 33.6 28.7 17.4
fat %>% filter( weight == c( 148.1 ) )## [1] brozek siri density age weight height adipos free neck
## [10] chest abdom hip thigh knee ankle biceps forearm wrist
## <0 rows> (or 0-length row.names)
- Recompute the 95% prediction interval for median predictor values after these two anomalous cases have been excluded from the data. Did this make much difference to the outcome?
remove the two offending records: 39 & 42
drop <- c( 39, 42 )
fat_remove2 <- fat[ -drop, ]
fatmod_remove2_4 <- lm( brozek ~ age + weight + height + abdom, fat_remove2 )
medvals2 <- model.matrix( fatmod_remove2_4 )
medvals02 <- apply( medvals2, 2, median )
#95% prediction interval for the median predictor values
predict( fatmod_remove2_4, newdata = data.frame( t( medvals02 ) ), interval = 'prediction' )## fit lwr upr
## 1 17.9033 9.887851 25.91874
predict( fatmod_4, newdata = data.frame( t( medvals0 ) ), interval = 'prediction' )## fit lwr upr
## 1 17.84028 9.696631 25.98392
The CI narrows when the 2 outliers are removed, but is it significant?
Explanation
The relationships between variables. can include causal conclusions, however these require more rigor.
Simple Meaning
a simple interpretation of model coefficients: unit increase in \(x_1\) with the other predictors help constant will produce a change of \(\hat{\beta_1}\) in the response variable \(y\). This is conceptually difficult: IRL there is no way to hold these variables constant to observe the model’s effects and it is not likely that the model captures all contributing variables to the system. Furthermore, our model’s explanation contains no notion of explanation.
Establishing Causality
causality is determined by experimentally manipulating a variable. Consider a drug trial test with a placebo group and a test group. We can describe the causal effect as: \[\delta_i = y_i^{\mbox{test}}-y_y^{\mbox{control}}\]
Designed Experiments
Randomization is key to success
- Randomization is the only reliable way to make sure that the two groups are not unbalances in some way that favors either experimental groups.
- Randomization can be used to validate assumptions. (e.g. permutation tests used to test the significance of differences between groups)
On Generalization: The results of randomized experiments apply to the subjects of the experiment unless we can reasonably claim that these subjects are representative of a larger population.
On Blocking: There may be ways that subjects differ from one another that are identifiable and that can be worked into the experiment. (e.g. Gender).
Observstional Data
Cannot always collect data in a designed experiment. Consider this dataset from the Democratic primary of 2008:
data( newhamp )
glimpse( newhamp )## Rows: 276
## Columns: 12
## $ votesys <fct> D, D, D, H, D, D, D, D, D, D, D, D, D, D, D, D, H, H, H, D,…
## $ Obama <int> 371, 345, 375, 92, 668, 284, 195, 216, 203, 187, 159, 209, …
## $ Clinton <int> 362, 333, 570, 89, 595, 273, 205, 184, 193, 188, 214, 233, …
## $ dem <int> 979, 913, 1305, 268, 1633, 761, 530, 532, 517, 518, 494, 57…
## $ povrate <dbl> 0.0653, 0.0380, 0.0428, 0.0669, 0.0332, 0.0586, 0.0886, 0.0…
## $ pci <int> 25940, 19773, 19986, 25627, 32667, 23163, 19540, 19540, 195…
## $ Dean <dbl> 0.27820, 0.24398, 0.20096, 0.28495, 0.24937, 0.30503, 0.209…
## $ Kerry <dbl> 0.32030, 0.36747, 0.41627, 0.33333, 0.37781, 0.39341, 0.365…
## $ white <dbl> 0.98312, 0.97349, 0.96739, 0.97892, 0.97986, 0.98301, 0.961…
## $ absentee <dbl> 0.059857, 0.050449, 0.043649, 0.107356, 0.074706, 0.053191,…
## $ population <dbl> 4693.0, 4266.0, 7006.0, 1033.0, 7033.0, 3222.0, 2792.5, 279…
## $ pObama <dbl> 0.3789581, 0.3778751, 0.2873563, 0.3432836, 0.4090631, 0.37…
look at the difference in counts for Digital vs Hand cast ballots for Obama vs Clinton
newhamp %>% group_by( votesys ) %>%
summarise( 'Obama' = sum( Obama ),
'Clinton' = sum( Clinton ) ) %>%
mutate( 'Obama_gt_percent' = (Obama - Clinton)/(Obama + Clinton)*100 )## # A tibble: 2 x 4
## votesys Obama Clinton Obama_gt_percent
## * <fct> <int> <int> <dbl>
## 1 D 86353 96890 -5.75
## 2 H 16926 14471 7.82
We can see that for digital ballots, Clinton received ~6% more votes than Obama whereas for paper ballots Obama received about 8% more of the vote.
Is this difference in votes based on voting system significant?
Here we fit a linear model to proportion of votes for Obama ~ Hand voting.
First we create a dummy variable for whether the record is for hand votes:
newhamp_trt <- newhamp %>%
mutate( treatment = case_when( votesys == 'H' ~ 1,
votesys == 'D' ~ 0 ) )
newhamp_trt %>% do( tidy( lm( pObama ~ treatment, . ) ) )## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.353 0.00517 68.1 7.76e-174
## 2 treatment 0.0425 0.00851 4.99 1.06e- 6
The p-value for the treatment group suggests that there is a significant difference between the proportion of vote for Obama given the voting system used. However, were there other variables involved? Did the voting system have some causal effect on the outcome?
In other words, is there a confounding variable?
glimpse( newhamp )## Rows: 276
## Columns: 12
## $ votesys <fct> D, D, D, H, D, D, D, D, D, D, D, D, D, D, D, D, H, H, H, D,…
## $ Obama <int> 371, 345, 375, 92, 668, 284, 195, 216, 203, 187, 159, 209, …
## $ Clinton <int> 362, 333, 570, 89, 595, 273, 205, 184, 193, 188, 214, 233, …
## $ dem <int> 979, 913, 1305, 268, 1633, 761, 530, 532, 517, 518, 494, 57…
## $ povrate <dbl> 0.0653, 0.0380, 0.0428, 0.0669, 0.0332, 0.0586, 0.0886, 0.0…
## $ pci <int> 25940, 19773, 19986, 25627, 32667, 23163, 19540, 19540, 195…
## $ Dean <dbl> 0.27820, 0.24398, 0.20096, 0.28495, 0.24937, 0.30503, 0.209…
## $ Kerry <dbl> 0.32030, 0.36747, 0.41627, 0.33333, 0.37781, 0.39341, 0.365…
## $ white <dbl> 0.98312, 0.97349, 0.96739, 0.97892, 0.97986, 0.98301, 0.961…
## $ absentee <dbl> 0.059857, 0.050449, 0.043649, 0.107356, 0.074706, 0.053191,…
## $ population <dbl> 4693.0, 4266.0, 7006.0, 1033.0, 7033.0, 3222.0, 2792.5, 279…
## $ pObama <dbl> 0.3789581, 0.3778751, 0.2873563, 0.3432836, 0.4090631, 0.37…
newhamp_trt %>%
do( tidy( lm( pObama ~ treatment + Dean + Kerry + povrate + white, . ) ) )## # A tibble: 6 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 0.596 0.132 4.53 8.97e- 6
## 2 treatment -0.00683 0.00806 -0.848 3.97e- 1
## 3 Dean 0.392 0.0614 6.38 7.54e-10
## 4 Kerry -0.269 0.0858 -3.14 1.90e- 3
## 5 povrate -0.223 0.0947 -2.35 1.94e- 2
## 6 white -0.231 0.129 -1.80 7.36e- 2
If we include other variables we see that our treatment variable is no longer a statistically significant towards predicting the proportion of votes for Obama.
Matching
for matched pairs where two members of each pair are as alike as possible with respect to confounders.
we will use the Matching library for this:
set.seed( 123 )
matches <- GenMatch( newhamp_trt$treatment, newhamp_trt$Dean,
ties = FALSE, caliper = 0.05, pop.size = 1000 )## Loading required namespace: rgenoud
##
##
## Fri Mar 26 20:28:54 2021
## Domains:
## 0.000000e+00 <= X1 <= 1.000000e+03
##
## Data Type: Floating Point
## Operators (code number, name, population)
## (1) Cloning........................... 122
## (2) Uniform Mutation.................. 125
## (3) Boundary Mutation................. 125
## (4) Non-Uniform Mutation.............. 125
## (5) Polytope Crossover................ 125
## (6) Simple Crossover.................. 126
## (7) Whole Non-Uniform Mutation........ 125
## (8) Heuristic Crossover............... 126
## (9) Local-Minimum Crossover........... 0
##
## SOFT Maximum Number of Generations: 100
## Maximum Nonchanging Generations: 4
## Population size : 1000
## Convergence Tolerance: 1.000000e-03
##
## Not Using the BFGS Derivative Based Optimizer on the Best Individual Each Generation.
## Not Checking Gradients before Stopping.
## Using Out of Bounds Individuals.
##
## Maximization Problem.
## GENERATION: 0 (initializing the population)
## Lexical Fit..... 2.127495e-01 9.987660e-01
## #unique......... 1000, #Total UniqueCount: 1000
## var 1:
## best............ 1.000000e+00
## mean............ 5.006683e+02
## variance........ 8.388390e+04
##
## GENERATION: 1
## Lexical Fit..... 6.264505e-01 9.987660e-01
## #unique......... 604, #Total UniqueCount: 1604
## var 1:
## best............ 2.255547e-02
## mean............ 2.280858e+02
## variance........ 7.904885e+04
##
## GENERATION: 2
## Lexical Fit..... 9.910633e-01 9.987660e-01
## #unique......... 602, #Total UniqueCount: 2206
## var 1:
## best............ 2.702493e-02
## mean............ 7.389477e+01
## variance........ 3.118565e+04
##
## GENERATION: 3
## Lexical Fit..... 9.910633e-01 9.987660e-01
## #unique......... 604, #Total UniqueCount: 2810
## var 1:
## best............ 2.702493e-02
## mean............ 7.042592e+01
## variance........ 2.858671e+04
##
## GENERATION: 4
## Lexical Fit..... 9.910633e-01 9.987660e-01
## #unique......... 415, #Total UniqueCount: 3225
## var 1:
## best............ 2.702493e-02
## mean............ 6.569943e+01
## variance........ 2.806356e+04
##
## GENERATION: 5
## Lexical Fit..... 9.910633e-01 9.987660e-01
## #unique......... 425, #Total UniqueCount: 3650
## var 1:
## best............ 2.702493e-02
## mean............ 8.035088e+01
## variance........ 3.395814e+04
##
## GENERATION: 6
## Lexical Fit..... 9.947238e-01 9.987660e-01
## #unique......... 405, #Total UniqueCount: 4055
## var 1:
## best............ 2.604222e-03
## mean............ 6.399581e+01
## variance........ 2.788729e+04
##
## GENERATION: 7
## Lexical Fit..... 9.970770e-01 9.987660e-01
## #unique......... 595, #Total UniqueCount: 4650
## var 1:
## best............ 2.408582e-03
## mean............ 7.120129e+01
## variance........ 3.239014e+04
##
## GENERATION: 8
## Lexical Fit..... 9.970770e-01 9.987660e-01
## #unique......... 596, #Total UniqueCount: 5246
## var 1:
## best............ 2.408582e-03
## mean............ 7.760128e+01
## variance........ 3.490217e+04
##
## GENERATION: 9
## Lexical Fit..... 9.970770e-01 9.987660e-01
## #unique......... 403, #Total UniqueCount: 5649
## var 1:
## best............ 2.408582e-03
## mean............ 6.762662e+01
## variance........ 2.926614e+04
##
## GENERATION: 10
## Lexical Fit..... 9.970770e-01 9.987660e-01
## #unique......... 429, #Total UniqueCount: 6078
## var 1:
## best............ 2.408582e-03
## mean............ 7.201190e+01
## variance........ 2.786938e+04
##
## GENERATION: 11
## Lexical Fit..... 9.987660e-01 9.990051e-01
## #unique......... 423, #Total UniqueCount: 6501
## var 1:
## best............ 2.336515e-03
## mean............ 6.935256e+01
## variance........ 3.131574e+04
##
## GENERATION: 12
## Lexical Fit..... 9.987660e-01 9.990882e-01
## #unique......... 612, #Total UniqueCount: 7113
## var 1:
## best............ 2.364199e-03
## mean............ 6.593385e+01
## variance........ 2.785792e+04
##
## GENERATION: 13
## Lexical Fit..... 9.987660e-01 9.990882e-01
## #unique......... 592, #Total UniqueCount: 7705
## var 1:
## best............ 2.364199e-03
## mean............ 6.576213e+01
## variance........ 2.685578e+04
##
## GENERATION: 14
## Lexical Fit..... 9.987660e-01 9.990882e-01
## #unique......... 417, #Total UniqueCount: 8122
## var 1:
## best............ 2.364199e-03
## mean............ 6.042472e+01
## variance........ 2.583306e+04
##
## GENERATION: 15
## Lexical Fit..... 9.987660e-01 9.990882e-01
## #unique......... 423, #Total UniqueCount: 8545
## var 1:
## best............ 2.364199e-03
## mean............ 6.067741e+01
## variance........ 2.497988e+04
##
## GENERATION: 16
## Lexical Fit..... 9.987660e-01 9.993674e-01
## #unique......... 437, #Total UniqueCount: 8982
## var 1:
## best............ 7.518945e-04
## mean............ 6.819509e+01
## variance........ 2.980902e+04
##
## 'wait.generations' limit reached.
## No significant improvement in 4 generations.
##
## Solution Lexical Fitness Value:
## 9.987660e-01 9.993674e-01
##
## Parameters at the Solution:
##
## X[ 1] : 7.518945e-04
##
## Solution Found Generation 11
## Number of Generations Run 16
##
## Fri Mar 26 20:29:09 2021
## Total run time : 0 hours 0 minutes and 15 seconds
head( matches$matches[ ,1:2 ] )## [,1] [,2]
## [1,] 4 213
## [2,] 17 148
## [3,] 18 6
## [4,] 19 83
## [5,] 21 246
## [6,] 22 185
The output above shows some of the pairs of one subject who voted by hand/paper vs one subject who voted digitally matched by thier propensity to vote have voted for Howard Dean in the previous election.
Show all the matches when plotted pObama ~ Dean
plot( pObama ~ Dean, newhamp_trt, pch = treatment + 1 )
with( newhamp_trt, segments( Dean[ matches$matches[ ,1 ] ],
pObama[ matches$matches[ ,1 ] ],
Dean[ matches$matches[ ,2 ] ],
pObama[ matches$matches[ ,2 ] ] ) ) From the figure above, we can see that the matches are reasonably well made for values of
Dean
Is there any indication of a bias for voting system within the matched pairs?
Here we compute the difference between pairs
pdiff <- newhamp_trt$pObama[ matches$matches[ ,1 ] ] -
newhamp_trt$pObama[ matches$matches[ ,2 ] ]
t.test( pdiff )##
## One Sample t-test
##
## data: pdiff
## t = -1.8798, df = 86, p-value = 0.06352
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## -0.0328560272 0.0009183504
## sample estimates:
## mean of x
## -0.01596884
plot( pdiff ~ newhamp_trt$Dean[ matches$matches[ ,1 ] ],
xlab = 'Dean', ylab = 'Hand - Digital' )
abline( h = 0 ) The matched pairs show no clear preference for hand vs digital voting. The observed difference appears to be because voter inclined to pick Obama are also more likely to be present in hand voting wards.
Covariate Adjustment, or Controlling for the covariate
Compare univariate and covariate outcomes for digital vs hand voting systems
plot( pObama ~ Dean, newhamp_trt, pch = treatment + 1 )
abline( h = c( 0.353, 0.353 + 0.042 ), lty = 1:2 )
abline( 0.221, 0.5229 )
abline( 0.221 - 0.005, 0.5229, lty = 2 )
with( newhamp_trt, segments( Dean[ matches$matches[ ,1 ] ],
pObama[ matches$matches[ ,1 ] ],
Dean[ matches$matches[ ,2 ] ],
pObama[ matches$matches[ ,2 ] ] ) )Qualitative Support for Causation
- Strenth - large effects
- Consistency - similar effect found in other studies
- Specificity -
- Temporality - consistent direction of effect
- Gradient - linear relationship
- Plausibility - does it make sense
- Experiment
Exercise 5.2
Use the odor dataset with odor as the response and temp as a predictor. Consider all possible models that also include all, some or none of the other two predictors. Report the coefficient for temperature, its standard error, t-statistic and p-value in each case. Discuss what stays the same, what changes and why. Which model is best?
data( odor )
glimpse( odor )## Rows: 15
## Columns: 4
## $ odor <dbl> 66, 39, 43, 49, 58, 17, -5, -40, 65, 7, 43, -22, -31, -35, -26
## $ temp <dbl> -1, 1, -1, 1, -1, 1, -1, 1, 0, 0, 0, 0, 0, 0, 0
## $ gas <dbl> -1, -1, 1, 1, 0, 0, 0, 0, -1, 1, -1, 1, 0, 0, 0
## $ pack <dbl> 0, 0, 0, 0, -1, -1, 1, 1, -1, -1, 1, 1, 0, 0, 0
dta <- odor
dta.r <- abs(cor(dta)) # get correlations
dta.col <- dmat.color(dta.r) # get colors
# reorder variables so those with highest correlation
# are closest to the diagonal
dta.o <- order.single(dta.r)
cpairs(dta, dta.o, panel.colors=dta.col, gap=.5,
main="Matrix plot of features" )mod_stem <- 'odor ~ temp'
mod_set <- c( '+ gas', '+ pack', '+ gas + pack' )
mod <- c()
mod_tempc <- c()
mod_se <- c()
mod_t <- c()
mod_p <- c()
for ( m in 1:length( mod_set ) ) {
mod_str <- paste( mod_stem, mod_set[ m ] )
mod[ m ] <- mod_str
modfitsum <- summary( lm( mod_str, odor ) )
mod_tempc[ m ] <- modfitsum$coefficients[ 2,1 ]
mod_se[ m ] <- modfitsum$coefficients[ 2,2 ]
mod_t[ m ] <- modfitsum$coefficients[ 2,3 ]
mod_p[ m ] <- modfitsum$coefficients[ 2,4 ]
}
modfitsum##
## Call:
## lm(formula = mod_str, data = odor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -50.200 -17.138 1.175 20.300 62.925
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.200 9.298 1.635 0.130
## temp -12.125 12.732 -0.952 0.361
## gas -17.000 12.732 -1.335 0.209
## pack -21.375 12.732 -1.679 0.121
##
## Residual standard error: 36.01 on 11 degrees of freedom
## Multiple R-squared: 0.3337, Adjusted R-squared: 0.1519
## F-statistic: 1.836 on 3 and 11 DF, p-value: 0.1989
odor_mods_df <- data.frame( 'mod' = mod,
'temp_coef' = mod_tempc,
'temp_se' = mod_se,
'temp_t' = mod_t,
'temp_pval' = mod_p )
glimpse( odor_mods_df )## Rows: 3
## Columns: 5
## $ mod <chr> "odor ~ temp + gas", "odor ~ temp + pack", "odor ~ temp + ga…
## $ temp_coef <dbl> -12.125, -12.125, -12.125
## $ temp_se <dbl> 13.66271, 13.14072, 12.73201
## $ temp_t <dbl> -0.8874523, -0.9227041, -0.9523244
## $ temp_pval <dbl> 0.3922754, 0.3743416, 0.3613900
Temperature does not have a statistically significant impact on the model’s ability to describe the variability of the data. This does not change with the addition of other predictive features to the model as shown by the temperature feature’s p-value. The standard error gets smaller with other terms included whereas the t-stat get larger in magnitude where both gas + pack are included as opposed to just one added.
Categorical Predictors
Categorical features / predictors - qualitative as opposed to quantitative values
A 2-level factor
data( sexab )
glimpse( sexab )## Rows: 76
## Columns: 3
## $ cpa <dbl> 2.04786, 0.83895, -0.24139, -1.11461, 2.01468, 6.71131, 1.20814, …
## $ ptsd <dbl> 9.71365, 6.16933, 15.15926, 11.31277, 9.95384, 9.83884, 5.98491, …
## $ csa <fct> Abused, Abused, Abused, Abused, Abused, Abused, Abused, Abused, A…
by( sexab, sexab$csa, summary )## sexab$csa: Abused
## cpa ptsd csa
## Min. :-1.115 Min. : 5.985 Abused :45
## 1st Qu.: 1.415 1st Qu.: 9.374 NotAbused: 0
## Median : 2.627 Median :11.313
## Mean : 3.075 Mean :11.941
## 3rd Qu.: 4.317 3rd Qu.:14.901
## Max. : 8.647 Max. :18.993
## ------------------------------------------------------------
## sexab$csa: NotAbused
## cpa ptsd csa
## Min. :-3.1204 Min. :-3.349 Abused : 0
## 1st Qu.:-0.2299 1st Qu.: 3.544 NotAbused:31
## Median : 1.3216 Median : 5.794
## Mean : 1.3088 Mean : 4.696
## 3rd Qu.: 2.8309 3rd Qu.: 6.838
## Max. : 5.0497 Max. :10.914
boxp <- ggplot( sexab, aes( x = csa, y = ptsd ) ) +
geom_boxplot()
scatp <- ggplot( sexab, aes( x = cpa, y = ptsd, color = csa ) ) +
geom_point( show.legend = FALSE )
grid.arrange( boxp, scatp, ncol = 2 ) those in the abused group have higher rates as the not abused group.
run a t-test to evaluate the statistical significance of this
t.test( ptsd ~ csa, sexab, var.equal = TRUE )##
## Two Sample t-test
##
## data: ptsd by csa
## t = 8.9387, df = 74, p-value = 2.172e-13
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 5.630165 8.860273
## sample estimates:
## mean in group Abused mean in group NotAbused
## 11.941093 4.695874
use dummy variables to recode the 2-level feature variable
sexab %>% mutate( dummy = case_when( csa == 'Abused' ~ 1,
csa == 'NotAbused' ~ 0 ) ) %>%
do( tidy( lm( ptsd ~ dummy, . ) ) )## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 4.70 0.624 7.53 1.00e-10
## 2 dummy 7.25 0.811 8.94 2.17e-13
smod <- lm( ptsd ~ csa, sexab )
summary( smod )##
## Call:
## lm(formula = ptsd ~ csa, data = sexab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0451 -2.3123 0.0951 2.1645 7.0514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.9411 0.5177 23.067 < 2e-16 ***
## csaNotAbused -7.2452 0.8105 -8.939 2.17e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.473 on 74 degrees of freedom
## Multiple R-squared: 0.5192, Adjusted R-squared: 0.5127
## F-statistic: 79.9 on 1 and 74 DF, p-value: 2.172e-13
the intercept ( 11.9 ) is the mean response for the reference level. the csaNotAbused estimate is the difference between levels
if it is more intuitive the other way, can rereference the model:
sexab$csa <- relevel( sexab$csa, ref = 'NotAbused' )
lmod <- lm( ptsd ~ csa, sexab )
summary( lmod )##
## Call:
## lm(formula = ptsd ~ csa, data = sexab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.0451 -2.3123 0.0951 2.1645 7.0514
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.6959 0.6237 7.529 1.00e-10 ***
## csaAbused 7.2452 0.8105 8.939 2.17e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.473 on 74 degrees of freedom
## Multiple R-squared: 0.5192, Adjusted R-squared: 0.5127
## F-statistic: 79.9 on 1 and 74 DF, p-value: 2.172e-13
Factors and Quantitative Predictors
models that can express how a factor variable and a quantitative variable might be related to a response consider modeling the sexual abuse model to include the physical abuse term cpa
sexmod1 <- lm( ptsd ~ cpa + csa + cpa:csa, sexab )
summary( sexmod1 )##
## Call:
## lm(formula = ptsd ~ cpa + csa + cpa:csa, data = sexab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1999 -2.5313 -0.1807 2.7744 6.9748
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6959 0.7107 5.201 1.79e-06 ***
## cpa 0.7640 0.3038 2.515 0.0142 *
## csaAbused 6.8612 1.0747 6.384 1.48e-08 ***
## cpa:csaAbused -0.3140 0.3685 -0.852 0.3970
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.279 on 72 degrees of freedom
## Multiple R-squared: 0.5828, Adjusted R-squared: 0.5654
## F-statistic: 33.53 on 3 and 72 DF, p-value: 1.133e-13
we can see that the interaction term is not significant, so we simplify the model
sexmod2 <- lm( ptsd ~ cpa + csa, sexab )
summary( sexmod2 )##
## Call:
## lm(formula = ptsd ~ cpa + csa, data = sexab)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1567 -2.3643 -0.1533 2.1466 7.1417
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.9753 0.6293 6.317 1.87e-08 ***
## cpa 0.5506 0.1716 3.209 0.00198 **
## csaAbused 6.2728 0.8219 7.632 6.91e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.273 on 73 degrees of freedom
## Multiple R-squared: 0.5786, Adjusted R-squared: 0.5671
## F-statistic: 50.12 on 2 and 73 DF, p-value: 2.002e-14
visualize the data with parallel regression lines
group.colors <- c( Abused = 'red', NotAbused = 'blue')
ggplot( aes( x = cpa, y = ptsd, color = csa ), data = sexab ) +
geom_point() +
geom_abline( slope = sexmod2$coefficients[2],
intercept = sexmod2$coefficients[1], color = 'blue' ) +
geom_abline( slope = sexmod2$coefficients[2],
intercept = sexmod2$coefficients[1] +
sexmod2$coefficients[3], color = 'red' ) +
scale_color_manual(values=group.colors) a simple check of the residuals for model diagnostics
sexmod2_aug <- augment( sexmod2 )
ggplot( aes( x = .fitted, y = .resid, color = csa ), data = sexmod2_aug ) +
geom_point() +
scale_color_manual(values=group.colors) We can see that there is no clear trend in the residuals and the variance of the two groups is approximately the same.
Interpretation with Interaction Terms
data( whiteside )
glimpse( whiteside )## Rows: 56
## Columns: 3
## $ Insul <fct> Before, Before, Before, Before, Before, Before, Before, Before, …
## $ Temp <dbl> -0.8, -0.7, 0.4, 2.5, 2.9, 3.2, 3.6, 3.9, 4.2, 4.3, 5.4, 6.0, 6.…
## $ Gas <dbl> 7.2, 6.9, 6.4, 6.0, 5.8, 5.8, 5.6, 4.7, 5.8, 5.2, 4.9, 4.9, 4.3,…
ggplot( aes( x = Temp, y = Gas ), data = whiteside ) +
geom_point() +
facet_grid( ~ Insul ) +
geom_smooth( method = 'lm' )## `geom_smooth()` using formula 'y ~ x'
gas consumption is temperature dependent both before and after installing the insulation. However, less gas is used after installation.
Explore the interaction of temperature:
whitehouse_mod <- lm( Gas ~ Temp*Insul, whiteside )
summary( whitehouse_mod )##
## Call:
## lm(formula = Gas ~ Temp * Insul, data = whiteside)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97802 -0.18011 0.03757 0.20930 0.63803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.85383 0.13596 50.409 < 2e-16 ***
## Temp -0.39324 0.02249 -17.487 < 2e-16 ***
## InsulAfter -2.12998 0.18009 -11.827 2.32e-16 ***
## Temp:InsulAfter 0.11530 0.03211 3.591 0.000731 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.323 on 52 degrees of freedom
## Multiple R-squared: 0.9277, Adjusted R-squared: 0.9235
## F-statistic: 222.3 on 3 and 52 DF, p-value: < 2.2e-16
From this model we can say that the average consumption before insulation at 0dC was 6.85. This makes interpretation complicated, because 0dC is at the low end extreme of the dataset. We can recenter the temperature predictor to account for this:
whiteside$ctemp <- whiteside$Temp - mean( whiteside$Temp )
whiteside_mod <- lm( Gas ~ ctemp * Insul, whiteside )
summary( whiteside_mod )##
## Call:
## lm(formula = Gas ~ ctemp * Insul, data = whiteside)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.97802 -0.18011 0.03757 0.20930 0.63803
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.93679 0.06424 76.848 < 2e-16 ***
## ctemp -0.39324 0.02249 -17.487 < 2e-16 ***
## InsulAfter -1.56787 0.08771 -17.875 < 2e-16 ***
## ctemp:InsulAfter 0.11530 0.03211 3.591 0.000731 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.323 on 52 degrees of freedom
## Multiple R-squared: 0.9277, Adjusted R-squared: 0.9235
## F-statistic: 222.3 on 3 and 52 DF, p-value: < 2.2e-16
Centering for mean Temp gives a more natural interpretation to the coefficients: The average consumption before insulation at the average temperature was 4.94 before insulation and 4.94-1.57= 3.37dC after installing the insulation.
Factors with more than 2 levels
Consider an experiment with 5 different treatment groups:
data( fruitfly )
glimpse( fruitfly )## Rows: 124
## Columns: 3
## $ thorax <dbl> 0.68, 0.68, 0.72, 0.72, 0.76, 0.76, 0.76, 0.76, 0.76, 0.80, …
## $ longevity <int> 37, 49, 46, 63, 39, 46, 56, 63, 65, 56, 65, 70, 63, 65, 70, …
## $ activity <fct> many, many, many, many, many, many, many, many, many, many, …
visualize longevity ~ thorax for all 5 groups
ggplot( aes( x = thorax, y = longevity, color = activity ), data = fruitfly ) +
geom_point() plot as a facet-wrap:
ggplot( aes( x = thorax, y = longevity, color = activity ), data = fruitfly ) +
geom_point() +
facet_wrap( ~ activity ) fit a linear model
fruitfly_mod <- lm( longevity ~ thorax * activity, fruitfly )
sumary( fruitfly_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -50.24197 21.80118 -2.3046 0.0230
## thorax 136.12676 25.95172 5.2454 7.275e-07
## activityone 6.51716 33.87083 0.1924 0.8478
## activitylow -7.75013 33.96901 -0.2282 0.8199
## activitymany -1.13943 32.52975 -0.0350 0.9721
## activityhigh -11.03803 31.28660 -0.3528 0.7249
## thorax:activityone -4.67713 40.65175 -0.1151 0.9086
## thorax:activitylow 0.87431 40.42531 0.0216 0.9828
## thorax:activitymany 6.54779 39.36002 0.1664 0.8682
## thorax:activityhigh -11.12676 38.11997 -0.2919 0.7709
##
## n = 124, p = 10, Residual SE = 10.71275, R-Squared = 0.65
plot( fruitfly_mod )anova( fruitfly_mod )## Analysis of Variance Table
##
## Response: longevity
## Df Sum Sq Mean Sq F value Pr(>F)
## thorax 1 15003.3 15003.3 130.733 < 2.2e-16 ***
## activity 4 9634.6 2408.6 20.988 5.503e-13 ***
## thorax:activity 4 24.3 6.1 0.053 0.9947
## Residuals 114 13083.0 114.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that the interaction term thorac:activity is not significant, so we can refit the model without this term:
fruitfly_mod2 <- lm( longevity ~ thorax + activity, fruitfly )But do we need both thorax and activity? Need to evaluate the model with just thorax first:
drop1( fruitfly_mod2, test='F' )## Single term deletions
##
## Model:
## longevity ~ thorax + activity
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 13107 589.92
## thorax 1 12368.4 25476 670.32 111.348 < 2.2e-16 ***
## activity 4 9634.6 22742 650.25 21.684 1.974e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The drop1() command tests each term relative to the full model.
sumary( fruitfly_mod2 )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -48.7494 10.8500 -4.4930 1.649e-05
## thorax 134.3414 12.7312 10.5522 < 2.2e-16
## activityone 2.6372 2.9839 0.8838 0.37861
## activitylow -7.0149 2.9811 -2.3532 0.02027
## activitymany 4.1387 3.0267 1.3674 0.17410
## activityhigh -20.0037 3.0160 -6.6325 1.048e-09
##
## n = 124, p = 6, Residual SE = 10.53939, R-Squared = 0.65
plot( fruitfly_mod2 )From the ’Residuals ~ Fitted` plot, we see some heteroscedasticity which can be addressed with a log transformation:
fruitfly_mod3 <- lm( log( longevity ) ~ thorax + activity, fruitfly )
plot( fruitfly_mod3 )sumary( fruitfly_mod3 )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.844210 0.198818 9.2759 1.036e-15
## thorax 2.721458 0.233289 11.6656 < 2.2e-16
## activityone 0.051744 0.054678 0.9463 0.34591
## activitylow -0.123867 0.054626 -2.2676 0.02518
## activitymany 0.087910 0.055462 1.5850 0.11563
## activityhigh -0.419252 0.055266 -7.5860 8.353e-12
##
## n = 124, p = 6, Residual SE = 0.19313, R-Squared = 0.7
exp( coef(fruitfly_mod3))## (Intercept) thorax activityone activitylow activitymany activityhigh
## 6.3231018 15.2024730 1.0531064 0.8834971 1.0918894 0.6575384
Daaaaammmmn. The high sexual activity group was found to have a shorter lifespan.
Alternative Codings of Qualitative Predictors
consider a four-level factor coded with 3 dummy factors.
Treatment Coding
contr.treatment( 4 )## 2 3 4
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
Helmert Coding
contr.helmert( 4 )## [,1] [,2] [,3]
## 1 -1 -1 -1
## 2 1 -1 -1
## 3 0 2 -1
## 4 0 0 3
Sum Coding
contr.sum( 4 )## [,1] [,2] [,3]
## 1 1 0 0
## 2 0 1 0
## 3 0 0 1
## 4 -1 -1 -1
Exercise 14.2
Using the ‘infmort’ data, find a simple model for the infant mortality in terms of the other variables. Be alert for transformations and unusual points. Interpret your model by explaining what the regression parameter estimates mean.
data( infmort )
glimpse( infmort )## Rows: 105
## Columns: 4
## $ region <fct> Asia, Europe, Europe, Americas, Europe, Europe, Europe, Euro…
## $ income <dbl> 3426, 3350, 3346, 4751, 5029, 3312, 3403, 5040, 2009, 2298, …
## $ mortality <dbl> 26.7, 23.7, 17.0, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, …
## $ oil <fct> no oil exports, no oil exports, no oil exports, no oil expor…
region <- ggplot( infmort, aes( x = region, y = mortality ) ) +
geom_boxplot( )
oil <- ggplot( infmort, aes( x = oil, y = mortality ) ) +
geom_boxplot( )
income <- ggplot( infmort, aes( x = log(income), y = mortality, color = region ) ) +
geom_point( )
income2 <- ggplot( infmort, aes( x = log(income), y = mortality, color = oil ) ) +
geom_point( )
grid.arrange( region, oil, income, income2, ncol = 2)## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing non-finite values (stat_boxplot).
## Warning: Removed 4 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_point).
perform an AIC-based backward selection
#make our full model to work backwards from:
infmort_full <- lm( mortality ~ log( income ) + oil + region, data = infmort )
summary( infmort_full )##
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort)
##
## Residuals:
## Min 1Q Median 3Q Max
## -157.89 -34.33 -3.33 14.68 497.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 276.479 57.696 4.792 6.09e-06 ***
## log(income) -11.234 8.666 -1.296 0.19799
## oilno oil exports -84.558 29.290 -2.887 0.00482 **
## regionEurope -84.334 33.411 -2.524 0.01326 *
## regionAsia -41.095 20.508 -2.004 0.04793 *
## regionAmericas -72.369 24.011 -3.014 0.00331 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 76.89 on 95 degrees of freedom
## (4 observations deleted due to missingness)
## Multiple R-squared: 0.3188, Adjusted R-squared: 0.283
## F-statistic: 8.894 on 5 and 95 DF, p-value: 5.906e-07
we see from the plots that there are several high leverage point on the graphs. here we identify them & remove from the data
HighLeverage <- cooks.distance(infmort_full) > (4/nrow(infmort))
infmort[ HighLeverage, ]## region income mortality oil
## Algeria Africa 400 86.3 oil exports
## Iran Asia 1280 NA oil exports
## Iraq Asia 560 28.1 oil exports
## Libya Africa 3010 300.0 oil exports
## Nigeria Africa 220 58.0 oil exports
## South_Vietnam Asia 130 100.0 no oil exports
#remove high leverage points
infmort_filt <- infmort[ !HighLeverage, ]
#refit full model:
infmort_full <- lm( mortality ~ log( income ) + oil + region, data = infmort_filt )
summary( infmort_full )##
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort_filt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.37 -35.25 -1.40 16.60 420.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 368.684 58.620 6.289 1.13e-08 ***
## log(income) -12.568 8.513 -1.476 0.143325
## oilno oil exports -164.768 37.593 -4.383 3.16e-05 ***
## regionEurope -85.841 32.700 -2.625 0.010177 *
## regionAsia -46.597 20.576 -2.265 0.025939 *
## regionAmericas -83.093 23.829 -3.487 0.000757 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.88 on 90 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4001, Adjusted R-squared: 0.3667
## F-statistic: 12 on 5 and 90 DF, p-value: 6.502e-09
infmort_backwards <- step( infmort_full, direction = 'backward', trace = 1 )## Start: AIC=826.6
## mortality ~ log(income) + oil + region
##
## Df Sum of Sq RSS AIC
## <none> 464963 826.60
## - log(income) 1 11261 476225 826.89
## - region 3 65586 530549 833.26
## - oil 1 99247 564210 843.17
summary( infmort_backwards )##
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort_filt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.37 -35.25 -1.40 16.60 420.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 368.684 58.620 6.289 1.13e-08 ***
## log(income) -12.568 8.513 -1.476 0.143325
## oilno oil exports -164.768 37.593 -4.383 3.16e-05 ***
## regionEurope -85.841 32.700 -2.625 0.010177 *
## regionAsia -46.597 20.576 -2.265 0.025939 *
## regionAmericas -83.093 23.829 -3.487 0.000757 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.88 on 90 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4001, Adjusted R-squared: 0.3667
## F-statistic: 12 on 5 and 90 DF, p-value: 6.502e-09
I tried this before filtering the high leverage points and the backwards regression fits had dropped income. It looks like there were several high leverage countries with high income and oil exports that had high infant mortality rates. However, if these observations are filtered then income becomes important to the model. Additionally, if income is not modeled on a log scale, it is dropped when evaluating by AIC.
Now trying to drop by F-test:
drop1( infmort_full, test = 'F' )## Single term deletions
##
## Model:
## mortality ~ log(income) + oil + region
## Df Sum of Sq RSS AIC F value Pr(>F)
## <none> 464963 826.60
## log(income) 1 11261 476225 826.89 2.1798 0.143325
## oil 1 99247 564210 843.17 19.2105 3.163e-05 ***
## region 3 65586 530549 833.26 4.2317 0.007589 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary( infmort_full )##
## Call:
## lm(formula = mortality ~ log(income) + oil + region, data = infmort_filt)
##
## Residuals:
## Min 1Q Median 3Q Max
## -144.37 -35.25 -1.40 16.60 420.08
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 368.684 58.620 6.289 1.13e-08 ***
## log(income) -12.568 8.513 -1.476 0.143325
## oilno oil exports -164.768 37.593 -4.383 3.16e-05 ***
## regionEurope -85.841 32.700 -2.625 0.010177 *
## regionAsia -46.597 20.576 -2.265 0.025939 *
## regionAmericas -83.093 23.829 -3.487 0.000757 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 71.88 on 90 degrees of freedom
## (3 observations deleted due to missingness)
## Multiple R-squared: 0.4001, Adjusted R-squared: 0.3667
## F-statistic: 12 on 5 and 90 DF, p-value: 6.502e-09
the intercept estimate gives the models estimate for mortality rate if the polity is an oil exporting in Africa. Given the same income and region, a non oil exporting country can expect an infant mortality rate ~-165 less. If income and oil exporting status is controlled for, regions in Europe, Asia & the Americas can expect a lower infant mortality rate than Arican countries by the Estimates given (-86, -47, -83 respectively).
Diagnostics
Regression Diagnostics: there are several assumptions made when applying linear models, and diagnostic methods are used to determine if those assumptions are met before a model is deployed into the wild. Model building is an iterative process, and model diagnostics can often highlight aspects that need adjustment.
potential issues with model assumptions:
- Error - we assume that errors are independent and normally distributed
- Model - we have assumed that structurally the data supports a linear relationship
- Unusual Observations - it only takes a few extreme outliers to throw off a model
Checking Error Assumptions
Constant Variance
we need to check if variance of the residuals is relate to a feature of the model. the most useful diagnostic is to check the residuals as a function of \(\hat{y}\). if all is well, you should see constant symmetrical variation in the distribution. homoscedasticity: when there is nonconstant variance.
an example:
data( savings )
savings_mod <- lm( sr ~ pop15 + pop75 + dpi + ddpi, data = savings )
savings_mod_aug <- augment( savings_mod )
#glimpse( savings_mod_aug )
p1 <- ggplot( savings_mod_aug, aes( x = .fitted, y = .resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
theme_classic()
p2 <- ggplot( savings_mod_aug, aes( x = .fitted, y = sqrt( abs( .resid ) ) ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
ylab( expression( sqrt( hat( epsilon )))) +
theme_classic()
grid.arrange( p1, p2, ncol = 2 ) Both plots are satisfactory. Next a numeric test can next be applied to evaluate constant variance:
sumary( lm( sqrt( abs( savings_mod_aug$.resid)) ~ savings_mod_aug$.fitted ) )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.162163 0.347880 6.2153 1.175e-07
## savings_mod_aug$.fitted -0.061366 0.034756 -1.7656 0.08382
##
## n = 50, p = 2, Residual SE = 0.63415, R-Squared = 0.06
there is no obvious slope to the model summary output. the two visualizations look satisfactory, so we can assume that the assumption of constant variance for the residuals (error) is relatively constant.
The following plot shows some simulated distributions to give examples of things to look out for in data
par(mfrow=c(2,2))
n <- 500
for(i in 1:1) {x <- runif( n ) ; plot( x, rnorm( n ), title( 'Constant Variance' )) }
for(i in 1:1) {x <- runif( n ) ; plot( x, x*rnorm( n ), title( 'Strong NonConstant Variance' )) }
for(i in 1:1) {x <- runif( n ) ; plot( x, sqrt((x))*rnorm( n ), title( 'Mild NonConstant Variance' )) }
for(i in 1:1) {x <- runif( n ) ; plot( x, cos(x*pi/0.5)+rnorm( n,sd=1 ), title( 'NonLinearity' )) }now to look at the residuals as a function of a few of the predictor variables:
p1 <- ggplot( savings_mod_aug, aes( x = pop15, y = .resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'pop15' ) +
theme_classic()
p2 <- ggplot( savings_mod_aug, aes( x = pop75, y = .resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'pop75' ) +
theme_classic()
p3 <- ggplot( savings_mod_aug, aes( x = dpi, y = .resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'dpi' ) +
theme_classic()
p4 <- ggplot( savings_mod_aug, aes( x = ddpi, y = .resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'ddpi' ) +
theme_classic()
grid.arrange( p1, p2, p3, p4, ncol = 2 ) The variance appear constant for
pop75, but for pop15 appears to have two groups
transformations can be applied to distributions to even out signs of non-constant variance ex: refit the model with a square root transformation of dpi and ddpi. then look at the residual plots again:
savings_mod2 <- lm( sr ~ pop15 + pop75 + sqrt(dpi) + sqrt(ddpi), data = savings )
savings_mod_aug2 <- augment( savings_mod2 )
glimpse( savings_mod_aug2 )## Rows: 50
## Columns: 11
## $ .rownames <chr> "Australia", "Austria", "Belgium", "Bolivia", "Brazil", "…
## $ sr <dbl> 11.43, 12.07, 13.17, 5.75, 12.88, 8.79, 0.60, 11.90, 4.98…
## $ pop15 <dbl> 29.35, 23.32, 23.80, 41.89, 42.19, 31.72, 39.74, 44.75, 4…
## $ pop75 <dbl> 2.87, 4.41, 4.43, 1.67, 0.83, 2.85, 1.34, 0.67, 1.06, 1.1…
## $ `sqrt(dpi)` <dbl> 48.266759, 38.832847, 45.918079, 13.752454, 26.990183, 54…
## $ `sqrt(ddpi)` <dbl> 1.6941074, 1.9824228, 1.9544820, 0.4690416, 2.1354157, 1.…
## $ .fitted <dbl> 10.637063, 11.537541, 11.105882, 5.419434, 9.575882, 9.25…
## $ .hat <dbl> 0.07519878, 0.11246708, 0.08712671, 0.16599769, 0.0959366…
## $ .sigma <dbl> 3.821655, 3.822727, 3.809781, 3.823287, 3.787618, 3.82291…
## $ .cooksd <dbl> 0.0007734259, 0.0005663132, 0.0062319919, 0.0003648478, 0…
## $ .std.resid <dbl> 0.21807866, 0.14948337, 0.57138393, 0.09573560, 0.9190834…
p1 <- ggplot( savings_mod_aug2, aes( x = pop15, y = .std.resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'pop15' ) +
theme_classic()
p2 <- ggplot( savings_mod_aug2, aes( x = pop75, y = .std.resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'pop75' ) +
theme_classic()
p3 <- ggplot( savings_mod_aug2, aes( x = `sqrt(dpi)`, y = .std.resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'dpi' ) +
theme_classic()
p4 <- ggplot( savings_mod_aug2, aes( x = `sqrt(ddpi)`, y = .std.resid ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_abline( intercept = 0, slope = 0, color = 'red' ) +
ggtitle( 'ddpi' ) +
theme_classic()
grid.arrange( p1, p2, p3, p4, ncol = 2 )That looks a little more even, but not satisfactory.
Normality
Q-Q plots: compare the residuals to the ‘ideal’ normal observations
par(mfrow=c(1,2))
qqnorm( savings_mod_aug$.resid, ylab='Residuals',main='')
qqline( savings_mod_aug$.resid )
hist( savings_mod_aug$.resid, xlab = 'Residuals', main='' ) normal residuals should follow the line approximately. Here, the residuals look normal.
The Shapiro-Wilk test is a formal test for normality:
shapiro.test( savings_mod_aug$.resid )##
## Shapiro-Wilk normality test
##
## data: savings_mod_aug$.resid
## W = 0.98698, p-value = 0.8524
the \(H_0\) is that the residuals are normal. Since the p-value is large, we do not reject the null hypothesis
Finding Unusual Observations
Leverage
leverage points are extreme in the predictor space.
half normal plot identifying unusually large values of leverage.
#we are looking for outliers in third plot:
countries <- row.names( savings )
hatv <- hatvalues( savings_mod )
head( hatv )## Australia Austria Belgium Bolivia Brazil Canada
## 0.06771343 0.12038393 0.08748248 0.08947114 0.06955944 0.15840239
sum( hatv )## [1] 5
halfnorm( hatv, labs = countries, ylab = 'Leverages' ) If we standardize the residuals we can see how many standard deviations points are from the distribution center
qqnorm( rstandard( savings_mod ) )
abline( 0,1 )Outliers
A point that does not fit the current model well.
Simulate some data to take a look at how outliers may impact a model fit:
testdata <- data.frame( x = 1:10, y = 1: 10+rnorm( 10 ) )
lmod <- lm( y ~ x, testdata )
#an outlier with a central predictor
p1 <- rbind( testdata, c( 5.5, 12 ) )
lmod1 <- lm( y ~ x, p1 )
plt1 <- ggplot( p1, aes( x = x, y = y ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = "lm", alpha = .15 ) +
geom_smooth( data = testdata, method = 'lm', alpha = 0.15 ) +
theme_classic() +
geom_point( shape = 19, alpha = 0.5, size = 3, color = 'red', aes( x = 5.5, y = 12 ) ) +
ggtitle( 'Outlier with little leverage' ) +
xlim( c( 0, 15 ) ) +
ylim( c( 0, 18 ) )
plt1#an extra point well outside the range of the data
p2 <- rbind( testdata, c( 15, 15.1 ) )
lmod2 <- lm( y ~ x, p2 )
plt2 <- ggplot( p2, aes( x = x, y = y ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = "lm", alpha = .15 ) +
geom_smooth( data = testdata, method = 'lm', alpha = 0.15 ) +
theme_classic() +
geom_point( shape = 19, alpha = 0.5, size = 3, color = 'red', aes( x = 15, y = 15.1 ) ) +
ggtitle( 'Large leverage but is not an Outlier' ) +
xlim( c( 0, 15 ) ) +
ylim( c( 0, 18 ) )
#an outlier with leverage
p3 <- rbind( testdata, c( 15, 5.1 ) )
lmod3 <- lm( y ~ x, p3 )
plt3 <- ggplot( p3, aes( x = x, y = y ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = "lm", alpha = .15 ) +
geom_smooth( data = testdata, method = 'lm', alpha = 0.15 ) +
theme_classic() +
geom_point( shape = 19, alpha = 0.5, size = 3, color = 'red', aes( x = 15, y = 5.1 ) ) +
ggtitle( 'Large leverage and an Outlier' ) +
xlim( c( 0, 15 ) ) +
ylim( c( 0, 18 ) )
grid.arrange( plt1, plt2, plt3, ncol = 2 )Bonferroni Correction
stud <- rstudent( lmod )
stud[ which.max( abs( stud ) ) ]## 1
## -2.699505
#compute the Bonferroni critical value:
qt( 0.05 / ( 50*2 ), 44 )## [1] -3.525801
A dataset with multiple outliers
data( star, package = 'faraway' )
plt1 <- ggplot( star, aes( x = temp, y = light ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = 'lm' )
theme_classic()## List of 93
## $ line :List of 6
## ..$ colour : chr "black"
## ..$ size : num 0.5
## ..$ linetype : num 1
## ..$ lineend : chr "butt"
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ rect :List of 5
## ..$ fill : chr "white"
## ..$ colour : chr "black"
## ..$ size : num 0.5
## ..$ linetype : num 1
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ text :List of 11
## ..$ family : chr ""
## ..$ face : chr "plain"
## ..$ colour : chr "black"
## ..$ size : num 11
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : num 0
## ..$ lineheight : num 0.9
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ title : NULL
## $ aspect.ratio : NULL
## $ axis.title : NULL
## $ axis.title.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.75points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x.top :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 2.75points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.x.bottom : NULL
## $ axis.title.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.75points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.title.y.left : NULL
## $ axis.title.y.right :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : num -90
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 2.75points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "grey30"
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 2.2points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x.top :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : num 0
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 2.2points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.x.bottom : NULL
## $ axis.text.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 1
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 2.2points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.text.y.left : NULL
## $ axis.text.y.right :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 0points 2.2points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ axis.ticks :List of 6
## ..$ colour : chr "grey20"
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ axis.ticks.x : NULL
## $ axis.ticks.x.top : NULL
## $ axis.ticks.x.bottom : NULL
## $ axis.ticks.y : NULL
## $ axis.ticks.y.left : NULL
## $ axis.ticks.y.right : NULL
## $ axis.ticks.length : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## $ axis.ticks.length.x : NULL
## $ axis.ticks.length.x.top : NULL
## $ axis.ticks.length.x.bottom: NULL
## $ axis.ticks.length.y : NULL
## $ axis.ticks.length.y.left : NULL
## $ axis.ticks.length.y.right : NULL
## $ axis.line :List of 6
## ..$ colour : chr "black"
## ..$ size : 'rel' num 1
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ axis.line.x : NULL
## $ axis.line.x.top : NULL
## $ axis.line.x.bottom : NULL
## $ axis.line.y : NULL
## $ axis.line.y.left : NULL
## $ axis.line.y.right : NULL
## $ legend.background :List of 5
## ..$ fill : NULL
## ..$ colour : logi NA
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ legend.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
## ..- attr(*, "unit")= int 8
## $ legend.spacing : 'simpleUnit' num 11points
## ..- attr(*, "unit")= int 8
## $ legend.spacing.x : NULL
## $ legend.spacing.y : NULL
## $ legend.key : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.key.size : 'simpleUnit' num 1.2lines
## ..- attr(*, "unit")= int 3
## $ legend.key.height : NULL
## $ legend.key.width : NULL
## $ legend.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.text.align : NULL
## $ legend.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ legend.title.align : NULL
## $ legend.position : chr "right"
## $ legend.direction : NULL
## $ legend.justification : chr "center"
## $ legend.box : NULL
## $ legend.box.just : NULL
## $ legend.box.margin : 'margin' num [1:4] 0cm 0cm 0cm 0cm
## ..- attr(*, "unit")= int 1
## $ legend.box.background : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ legend.box.spacing : 'simpleUnit' num 11points
## ..- attr(*, "unit")= int 8
## $ panel.background :List of 5
## ..$ fill : chr "white"
## ..$ colour : logi NA
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ panel.border : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.spacing : 'simpleUnit' num 5.5points
## ..- attr(*, "unit")= int 8
## $ panel.spacing.x : NULL
## $ panel.spacing.y : NULL
## $ panel.grid :List of 6
## ..$ colour : chr "grey92"
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ lineend : NULL
## ..$ arrow : logi FALSE
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_line" "element"
## $ panel.grid.major : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.grid.minor : list()
## ..- attr(*, "class")= chr [1:2] "element_blank" "element"
## $ panel.grid.major.x : NULL
## $ panel.grid.major.y : NULL
## $ panel.grid.minor.x : NULL
## $ panel.grid.minor.y : NULL
## $ panel.ontop : logi FALSE
## $ plot.background :List of 5
## ..$ fill : NULL
## ..$ colour : chr "white"
## ..$ size : NULL
## ..$ linetype : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ plot.title :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 1.2
## ..$ hjust : num 0
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.title.position : chr "panel"
## $ plot.subtitle :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : num 0
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 0points 0points 5.5points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.caption :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 0.8
## ..$ hjust : num 1
## ..$ vjust : num 1
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 5.5points 0points 0points 0points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.caption.position : chr "panel"
## $ plot.tag :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : 'rel' num 1.2
## ..$ hjust : num 0.5
## ..$ vjust : num 0.5
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ plot.tag.position : chr "topleft"
## $ plot.margin : 'margin' num [1:4] 5.5points 5.5points 5.5points 5.5points
## ..- attr(*, "unit")= int 8
## $ strip.background :List of 5
## ..$ fill : chr "white"
## ..$ colour : chr "black"
## ..$ size : 'rel' num 2
## ..$ linetype : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_rect" "element"
## $ strip.background.x : NULL
## $ strip.background.y : NULL
## $ strip.placement : chr "inside"
## $ strip.text :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : chr "grey10"
## ..$ size : 'rel' num 0.8
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : NULL
## ..$ lineheight : NULL
## ..$ margin : 'margin' num [1:4] 4.4points 4.4points 4.4points 4.4points
## .. ..- attr(*, "unit")= int 8
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.text.x : NULL
## $ strip.text.y :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : num -90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## $ strip.switch.pad.grid : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## $ strip.switch.pad.wrap : 'simpleUnit' num 2.75points
## ..- attr(*, "unit")= int 8
## $ strip.text.y.left :List of 11
## ..$ family : NULL
## ..$ face : NULL
## ..$ colour : NULL
## ..$ size : NULL
## ..$ hjust : NULL
## ..$ vjust : NULL
## ..$ angle : num 90
## ..$ lineheight : NULL
## ..$ margin : NULL
## ..$ debug : NULL
## ..$ inherit.blank: logi TRUE
## ..- attr(*, "class")= chr [1:2] "element_text" "element"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi TRUE
## - attr(*, "validate")= logi TRUE
plt1## `geom_smooth()` using formula 'y ~ x'
assess the data for outliers:
range( rstudent( lm( light ~ temp, star ) ) )## [1] -2.049393 1.905847
star2 <- star %>% filter( temp >3.6 )
plt1 + geom_smooth( data = star2, method = 'lm' )## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
When the four data point that represent four stars that are gas giants. then they are removed, the linear relationship between light intensity and core temperature is described by very different coefficients.
Influential Observations
Influential Data Points: cause a shift in the relationship of a model when removed from the data set.
the Cook statistic is a good way to show influential data points and can be visualized as a half-normal plot:
sumary( savings_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.56608654 7.35451611 3.8842 0.0003338
## pop15 -0.46119315 0.14464222 -3.1885 0.0026030
## pop75 -1.69149768 1.08359893 -1.5610 0.1255298
## dpi -0.00033690 0.00093111 -0.3618 0.7191732
## ddpi 0.40969493 0.19619713 2.0882 0.0424711
##
## n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34
cook <- cooks.distance( savings_mod )
halfnorm( cook, labs = countries, ylab = "Cook's distances" ) Now we evaluate the fit after removing the data point with the largest cook’s distance: Libya
savings_modi <- lm( sr ~ pop15 + pop75 + dpi + ddpi,
savings,
subset = ( cook < max( cook ) ) )
sumary( savings_modi )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.52404598 8.22402631 2.9820 0.004655
## pop15 -0.39144013 0.15790949 -2.4789 0.017084
## pop75 -1.28086692 1.14518206 -1.1185 0.269430
## dpi -0.00031890 0.00092933 -0.3432 0.733119
## ddpi 0.61027903 0.26877842 2.2706 0.028122
##
## n = 49, p = 5, Residual SE = 3.79481, R-Squared = 0.36
plot( savings_modi )Checking the Structure of the Model:
Partial Regression Plots: a look at the marginal relationship between the response and predictor variable after the effect of the other predictors has been removed. The partial regression plot allows us to focus on the relationship between one predictor and the response as in a simple regression. Better for outlier/influential point detection
Partial Residuals Plot: the partial regression of a variable plotted against the variable. Better for non-linearity detection
d <- residuals( lm( sr ~ pop75 + dpi + ddpi, savings ) )
m <- residuals( lm( pop15 ~ pop75 + dpi + ddpi, savings ) )
p <- predict( savings_mod, type = 'term', term = 'pop15' )
r <- residuals( savings_mod, type = 'partial' )[,'pop15']
dat <- data.frame( 'd' = d, 'm' = m, 'p' = p, 'r' = r, 'pop15' = savings$pop15 )
preg <- ggplot( dat, aes( x = m, y = d, ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = 'lm', se = FALSE ) +
xlab( 'pop15 residuals' ) +
ylab( 'Savings residuals' ) +
ggtitle( 'Partial Regression / Added Variable Plot' ) +
theme_classic()
pres <- ggplot( dat, aes( x = pop15.1, y = r ) ) +
geom_point( fill = NA, shape = 21, alpha = 0.5, size = 3 ) +
geom_smooth( method = 'lm', se = FALSE ) +
ylab( 'Partial for pop15' ) +
xlab( 'pop15' ) +
ggtitle( 'Partial Residual Plot' ) +
theme_classic()
grid.arrange( preg, pres, ncol = 2 )## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
Exercise 6.1
Using the sat dataset, fit a model with the total SAT score as the response and expend, salary, ratio and takers as predictors. Perform regression diagnostics on this model to answer the following questions. Display any plots that are relevant. Do not provide any plots about which you have nothing to say. Suggest possible improvements or corrections to the model where appropriate.
data( sat )
glimpse( sat )## Rows: 50
## Columns: 7
## $ expend <dbl> 4.405, 8.963, 4.778, 4.459, 4.992, 5.443, 8.817, 7.030, 5.718, …
## $ ratio <dbl> 17.2, 17.6, 19.3, 17.1, 24.0, 18.4, 14.4, 16.6, 19.1, 16.3, 17.…
## $ salary <dbl> 31.144, 47.951, 32.175, 28.934, 41.078, 34.571, 50.045, 39.076,…
## $ takers <int> 8, 47, 27, 6, 45, 29, 81, 68, 48, 65, 57, 15, 13, 58, 5, 9, 11,…
## $ verbal <int> 491, 445, 448, 482, 417, 462, 431, 429, 420, 406, 407, 468, 488…
## $ math <int> 538, 489, 496, 523, 485, 518, 477, 468, 469, 448, 482, 511, 560…
## $ total <int> 1029, 934, 944, 1005, 902, 980, 908, 897, 889, 854, 889, 979, 1…
sat_mod <- lm( total ~ expend + salary + ratio + takers, data = sat )
sumary( sat_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1045.97154 52.86976 19.7839 < 2.2e-16
## expend 4.46259 10.54653 0.4231 0.6742
## salary 1.63792 2.38725 0.6861 0.4962
## ratio -3.62423 3.21542 -1.1271 0.2657
## takers -2.90448 0.23126 -12.5594 2.607e-16
##
## n = 50, p = 5, Residual SE = 32.70199, R-Squared = 0.82
- Check the constant variance assumption for the errors
plot( fitted( sat_mod ), residuals( sat_mod ),
xlab = 'Fitted', ylab = 'Residuals' )
abline( h = 0 ) Some signs of nonlinearity? nonconstant variance?…
plot( fitted( sat_mod ), sqrt( abs( residuals( sat_mod ) ) ),
xlab = 'Fitted', ylab = expression( sqrt( hat( epsilon ))) )
abline( h = 0 ) This looks more satisfactory?
- Check the normality assumption
qqnorm( residuals( sat_mod ) )
qqline( residuals( sat_mod ) ) That looks reasonably normal
- Check for large leverage points
halfnorm( hatvalues( sat_mod ) )qqnorm( rstandard( sat_mod ) )
abline( 0,1 )- Check for outliers
- Check for influential points
cook <- cooks.distance( sat_mod )
halfnorm( cook,5 ) 44 appears to be an outlier with leverage
- Check the structure of the relationship between the predictors and the response
termplot( sat_mod, partial.resid = TRUE, terms = 3 )Transformation
Transformations of the response and/or predictors can improve the fit and correst violations of model assumptions.
When you use a log transformation on the response, the regression coefficients hace a particular interpretatio: \[\mbox{log}\hat{y} = \hat{\beta}_0 + \hat{\beta}_1x_1 + \cdots + \hat{\beta}_px_p\] \[\hat{y} = e^{\hat{\beta}_0}e^{\hat{\beta}_1x_1} \cdots e^{\hat{\beta}_px_p}\]
The Box-cox transformation is a popular way to determine a transformation on the response.
Transforming the response can make the model harder to interpret, so we do not want to do it unless it is really necessary Here is an example with the savings data:
data( savings )
glimpse( savings )## Rows: 50
## Columns: 5
## $ sr <dbl> 11.43, 12.07, 13.17, 5.75, 12.88, 8.79, 0.60, 11.90, 4.98, 10.78…
## $ pop15 <dbl> 29.35, 23.32, 23.80, 41.89, 42.19, 31.72, 39.74, 44.75, 46.64, 4…
## $ pop75 <dbl> 2.87, 4.41, 4.43, 1.67, 0.83, 2.85, 1.34, 0.67, 1.06, 1.14, 3.93…
## $ dpi <dbl> 2329.68, 1507.99, 2108.47, 189.13, 728.47, 2982.88, 662.86, 289.…
## $ ddpi <dbl> 2.87, 3.93, 3.82, 0.22, 4.56, 2.43, 2.67, 6.51, 3.08, 2.80, 3.99…
savings_mod <- lm( sr ~ pop15 + pop75 + dpi + ddpi, savings )
boxcox( savings_mod, plotit = T )boxcox( savings_mod, plotot = T, lambda = seq( 0.5, 1.5, by = 0.1 ) )## Warning: In lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
## extra argument 'plotot' will be disregarded
Transforming the Predictors
Broken Stick Regression: Sometimes we have reason to believe that different linear regression models apply to different regions of the data. Broken stick uses two basis functions to describe the distribution: \[y = \beta_0 + \beta_1\mbox{B}_1(x) + \beta_2\mbox{B}_2(x) + \epsilon\]
Polynomials. response surfaces.
Splines
B-spline basis functions
funky <- function( x ) sin( 2 * pi * x^3 )^3
x = seq( 0,1,by = 0.01 )
y = funky( x ) + 0.1*rnorm( 101 )
matplot( x, cbind( y, funky(x) ), type = "pl", pch = 20, lty = 1, col = 1 )look at the fits for orthogonal polynomial bases of order 4 and 12:
g4 <- lm( y ~ poly( x, 4 ) )
g12 <- lm( y ~ poly( x, 12 ) )
matplot( x, cbind( y, g4$fit, g12$fit ), type = "pll", pch = 20, lty = c( 1,2 ), col = 1 ) The p4 does not fit the data well enough. However, the p14 appears to overfit.
try with splines:
knots <- c( 0, 0, 0, 0, 0.2, 0.4, 0.5, 0.6, 0.7, 0.8, 0.85, 0.9, 1, 1, 1, 1 )
bx <- splineDesign( knots, x )
lmodb <- lm( y ~ bx -1 )
matplot( x, bx, type = 'l', col = 1 )matplot( x, cbind( y, lmodb$fit ), type = 'pl', pch = 20, lty = 1, col = 1 )Smoothing spline
ssf <- smooth.spline( x,y )
matplot( x, cbind( y, ssf$y ), type = 'pl', lty = 1, pch = 20, col = 1 )Missing Data
Types of Missing Data
- Missing Cases fail to observe a complete case
- Incomplete Values not a full series of data
- Missing Values missed incidences
Types of Missingness:
- Missing Completely at Random (MCAR) randomly missed cases or values
- Missing at Random (MAR) systematically missed cases or values
- Missing Not at Random (MNAR) missing due to an unobserved variable
Deletion
data( chmiss, package = 'faraway')
glimpse( chmiss )## Rows: 47
## Columns: 6
## $ race <dbl> 10.0, 22.2, 19.6, 17.3, 24.5, 54.0, 4.9, 7.1, 5.3, 21.5, 43.1…
## $ fire <dbl> 6.2, 9.5, 10.5, 7.7, 8.6, 34.1, 11.0, 6.9, 7.3, 15.1, 29.1, 2…
## $ theft <dbl> 29, 44, 36, 37, 53, 68, 75, 18, 31, NA, 34, 14, 11, 11, 22, N…
## $ age <dbl> 60.4, 76.5, NA, NA, 81.4, 52.6, 42.6, 78.5, 90.1, 89.8, 82.7,…
## $ involact <dbl> NA, 0.1, 1.2, 0.5, 0.7, 0.3, 0.0, 0.0, NA, 1.1, 1.9, 0.0, 0.0…
## $ income <dbl> 11.744, 9.323, 9.948, 10.656, 9.730, 8.231, 21.480, 11.104, 1…
standard summaries are a handy first peek to notice missing variables
summary( chmiss )## race fire theft age
## Min. : 1.00 Min. : 2.00 Min. : 3.00 Min. : 2.00
## 1st Qu.: 3.75 1st Qu.: 5.60 1st Qu.: 22.00 1st Qu.:48.30
## Median :24.50 Median : 9.50 Median : 29.00 Median :64.40
## Mean :35.61 Mean :11.42 Mean : 32.65 Mean :59.97
## 3rd Qu.:57.65 3rd Qu.:15.10 3rd Qu.: 38.00 3rd Qu.:78.25
## Max. :99.70 Max. :36.20 Max. :147.00 Max. :90.10
## NA's :4 NA's :2 NA's :4 NA's :5
## involact income
## Min. :0.0000 Min. : 5.583
## 1st Qu.:0.0000 1st Qu.: 8.564
## Median :0.5000 Median :10.694
## Mean :0.6477 Mean :10.736
## 3rd Qu.:0.9250 3rd Qu.:12.102
## Max. :2.2000 Max. :21.480
## NA's :3 NA's :2
the dataset is small enough, why not peek at the missing values by record (row)
rowSums( is.na( chmiss ) )## 60626 60640 60613 60657 60614 60610 60611 60625 60618 60647 60622 60631 60646
## 1 0 1 1 0 0 0 0 1 1 0 0 1
## 60656 60630 60634 60641 60635 60639 60651 60644 60624 60612 60607 60623 60608
## 0 0 1 0 0 0 1 1 0 0 1 0 1
## 60616 60632 60609 60653 60615 60638 60629 60636 60621 60637 60652 60620 60619
## 1 0 1 0 0 0 1 0 1 0 0 1 0
## 60649 60617 60655 60643 60628 60627 60633 60645
## 1 1 0 0 1 0 0 1
Look for patterns in the missing data:
image( is.na( chmiss ), axes = FALSE, col = gray(1:0) )
axis( 2, at = 0:5/5, labels = colnames( chmiss ) )
axis( 1, at = 0:46/46, labels = row.names( chmiss ), las = 2 ) The missing values seem to be randomly peppered throughout the data.
One strategy to deal with this is to remove all the missing values. However, this reduces the number of observations from 47 to 27. This change in sample size results in larger standard errors for a regression fit as demonstrated here:
modmiss <- lm( data = chmiss, involact ~ . )
sumary( modmiss )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.1164827 0.6057615 -1.8431 0.0794750
## race 0.0104867 0.0031283 3.3522 0.0030180
## fire 0.0438757 0.0103190 4.2519 0.0003557
## theft -0.0172198 0.0059005 -2.9184 0.0082154
## age 0.0093766 0.0034940 2.6837 0.0139041
## income 0.0687006 0.0421558 1.6297 0.1180775
##
## n = 27, p = 6, Residual SE = 0.33822, R-Squared = 0.79
data( chredlin, package = 'faraway' )
modfull <- lm( involact ~ ., data = chredlin )
sumary( modfull )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.6290917 0.5119934 -1.2287 0.226360
## race 0.0088999 0.0026376 3.3743 0.001655
## fire 0.0390701 0.0086378 4.5232 5.334e-05
## theft -0.0102104 0.0029224 -3.4938 0.001178
## age 0.0084194 0.0029192 2.8841 0.006292
## income 0.0246958 0.0320917 0.7695 0.446094
## sides 0.0240311 0.1250544 0.1922 0.848585
##
## n = 47, p = 7, Residual SE = 0.33913, R-Squared = 0.75
Deleting cases is a simple strategy, but we end up throwing away information that might allow for a more precise inference.
Single Imputation
a tempting solution is to impute using the mean
cmeans <- colMeans( chmiss, na.rm = TRUE )
mchm <- chmiss
for( i in c( 1:4,6 ) ) mchm[ is.na( chmiss[,i]),i ] <- cmeans[i]
imod <- lm( involact ~ ., mchm )
sumary( imod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0708021 0.5094531 0.1390 0.890203
## race 0.0071173 0.0027057 2.6305 0.012245
## fire 0.0287418 0.0093855 3.0624 0.004021
## theft -0.0030590 0.0027457 -1.1141 0.272242
## age 0.0060795 0.0032079 1.8952 0.065695
## income -0.0270917 0.0316782 -0.8552 0.397791
##
## n = 44, p = 6, Residual SE = 0.38412, R-Squared = 0.68
Notice how the coefficients are dragged closer to 0. Imputing with the mean has the effect of flattening the data. This has the effect of reducing the magnitude of the importance of feature variables. Mean imputation is not recommended except where the fraction of filled values is small.
Using regression methods to predict values of covariates
lmodr <- lm( logit( race/100 ) ~ fire + theft + age + income, data = chmiss )
ilogit( predict( lmodr, chmiss[ is.na( chmiss$race ),]))*100## 60646 60651 60616 60617
## 0.4190909 14.7320193 84.2653995 21.3121261
compare these predicted values to the real data
chredlin$race[ is.na( chmiss$race ) ]## [1] 1.0 13.4 62.3 36.4
The regression fill-in will also introduce a bias toward zero in the coefficients and reduce variance, but it’s not as accute as imputation by the mean.
Multiple Imputation
The problem with single imputation is that the imputed value, be it a mean or a regression-predicted value, tends to be less variable than the value we would have seen because the imputed value does not include the error variation that would normally be seen in observed data. Multiple imputation reincludes the error.
Here we use the Amelia package to perform multiple Imputation
set.seed( 123 )
chimp <- amelia( chmiss, m=25 )## -- Imputation 1 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34
##
## -- Imputation 2 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74
##
## -- Imputation 3 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
##
## -- Imputation 4 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
##
## -- Imputation 5 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 6 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
##
## -- Imputation 7 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43
##
## -- Imputation 8 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 81 82 83
##
## -- Imputation 9 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 10 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
##
## -- Imputation 11 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51
##
## -- Imputation 12 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
##
## -- Imputation 13 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 141 142 143
##
## -- Imputation 14 --
##
## 1 2 3 4 5 6 7
##
## -- Imputation 15 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25
##
## -- Imputation 16 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22
##
## -- Imputation 17 --
##
## 1 2 3 4 5 6 7 8 9
##
## -- Imputation 18 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54
##
## -- Imputation 19 --
##
## 1 2 3 4 5 6 7 8 9 10
##
## -- Imputation 20 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 121 122 123 124 125 126 127 128 129 130
##
## -- Imputation 21 --
##
## 1 2 3 4 5 6 7 8
##
## -- Imputation 22 --
##
## 1 2 3 4 5 6 7 8
##
## -- Imputation 23 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
##
## -- Imputation 24 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29
##
## -- Imputation 25 --
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 21 22 23 24 25 26 27 28 29 30
now fit a linear model to the 25 imputed datasets and store the results:
betas <- NULL
ses <- NULL
for(i in 1:chimp$m ){
lmod <- lm( involact ~ race + fire + theft + age, chimp$imputations[[i]])
betas <- rbind( betas, coef( lmod ) )
ses <- rbind( ses, coef( summary( lmod ) )[,2])
}use the mi.meld() function to find the beta estimate as well as the standard error.
cr <- mi.meld( q = betas, se = ses )
cr## $q.mi
## (Intercept) race fire theft age
## [1,] -0.2364977 0.007759897 0.03523587 -0.008675917 0.007214505
##
## $se.mi
## (Intercept) race fire theft age
## [1,] 0.163568 0.002042961 0.008428774 0.004698103 0.00251655
Exercises
13.1
13.2
Problems with Predictors
Errors in Predictors
The measurement error issue.
data( cars )
cars_mod <- lm( dist ~ speed, data = cars )
cars_p1 <- ggplot( cars, aes( x = speed, y = dist ) ) +
geom_point() +
geom_smooth( method = 'lm', se = F, color = 'black' ) +
theme_classic()
cars_p1## `geom_smooth()` using formula 'y ~ x'
the effect of adding measurement error to the predictor
cars_mod1 <- lm( dist ~ I( speed + rnorm( dim( cars )[1] )), data = cars )
cars_mod2 <- lm( dist ~ I( speed + 2*rnorm( dim( cars )[1] )), data = cars )
cars_mod5 <- lm( dist ~ I( speed + 5*rnorm( dim( cars )[1] )), data = cars )
cars_p1 +
geom_abline(slope = cars_mod1$coefficients[2], intercept = cars_mod1$coefficients[1], color = 'yellow' ) +
geom_abline(slope = cars_mod2$coefficients[2], intercept = cars_mod2$coefficients[1], color = 'orange' ) +
geom_abline(slope = cars_mod5$coefficients[2], intercept = cars_mod5$coefficients[1], color = 'red' )## `geom_smooth()` using formula 'y ~ x'
The slope gets shallower as the error increases.
Here we simulate the effects of adding normal random error:
vv <- rep( 1:5/10, each=1000 )
slopes <- numeric( 5000 )
for ( i in 1:5000 ) slopes[i] <- lm( dist ~ I( speed + sqrt( vv[i] )*rnorm( dim( cars )[1] )), data = cars )$coef[2]
#plot the mean slopes for each variance
betas <- c( coef( cars_mod )[2], colMeans( matrix( slopes, nrow=1000 ) ) )
variances <- c( 0, 1:5/10 ) + 0.5
plot( variances, betas, xlim = c( 0,1 ), ylim = c( 3.86,4 ) )
#fit a linear model and extrapolate to 0 variance
gv <- lm( betas ~ variances )
coef( gv )## (Intercept) variances
## 3.9949761 -0.1276996
points( 0, gv$coef[1], pch=3 )The predicted value of \(\hat{\beta}\) at variance equal to zero, that is no measurement error, is 4.0.
Here is a prediction result using the simex package:
set.seed( 123 )
lmod <- lm( dist ~ speed, cars, x = T )
simout <- simex( lmod, "speed", 0.5, B=1000 )
simout##
## Naive model:
## lm(formula = dist ~ speed, data = cars, x = T)
##
## SIMEX-Variables: speed
## Number of Simulations: 1000
##
## Coefficients:
## (Intercept) speed
## -18.01 3.96
Changes in Scale
most methods work more reliably when variables are on roughly similar scales.
sumary( savings_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.56608654 7.35451611 3.8842 0.0003338
## pop15 -0.46119315 0.14464222 -3.1885 0.0026030
## pop75 -1.69149768 1.08359893 -1.5610 0.1255298
## dpi -0.00033690 0.00093111 -0.3618 0.7191732
## ddpi 0.40969493 0.19619713 2.0882 0.0424711
##
## n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34
rescale to measure income in thousands instead of dollars
savings_mod_scale <- lm( sr ~ pop15 + pop75 + I(dpi/1000) + ddpi, data = savings )
sumary( savings_mod_scale )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.56609 7.35452 3.8842 0.0003338
## pop15 -0.46119 0.14464 -3.1885 0.0026030
## pop75 -1.69150 1.08360 -1.5610 0.1255298
## I(dpi/1000) -0.33690 0.93111 -0.3618 0.7191732
## ddpi 0.40969 0.19620 2.0882 0.0424711
##
## n = 50, p = 5, Residual SE = 3.80267, R-Squared = 0.34
another approach is the use the function scale() the regression coefficients now represent the effect of a one standard unit increase in the predictor on the response in standard units
savings_scaled <- data.frame( scale( savings ) )
savings_mod_scale2 <- lm( sr ~ ., savings_scaled )
sumary( savings_mod_scale2 )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8091e-16 1.2003e-01 0.0000 1.000000
## pop15 -9.4204e-01 2.9545e-01 -3.1885 0.002603
## pop75 -4.8731e-01 3.1218e-01 -1.5610 0.125530
## dpi -7.4508e-02 2.0592e-01 -0.3618 0.719173
## ddpi 2.6243e-01 1.2567e-01 2.0882 0.042471
##
## n = 50, p = 5, Residual SE = 0.84873, R-Squared = 0.34
Now, with the predictors on a common scale, it is helpful to construct plot estimates with confidence intervals
edf <- data.frame( coef( savings_mod_scale2 ), confint( savings_mod_scale2 ) )[-1,]
names( edf ) <- c( 'Estimate', 'lb', 'ub' )
p <- ggplot( aes( y = Estimate, ymin = lb, ymax = ub, x = row.names( edf ) ), data = edf ) +
geom_pointrange() +
coord_flip() +
xlab( 'Predictor' ) +
geom_hline( yintercept = 0, col = gray( 0.75 ) ) +
ggtitle( 'standardized coefficients', subtitle = 'confidence intervals shown' )
pScaling with binary predictors:
#take a look at the variable pop15 for age
ggplot( aes( x = pop15 ), data = savings ) +
geom_histogram()## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The data is bimodal. we can create a binary variable for age that divides at pop15=35
savings$age <- ifelse( savings$pop15 > 35, 0, 1 )A binary predictor taking the values of 0/1 with equal probability has a standard deviation of one half. This suggests scaling the other continuous predictors by two SDs rather than one:
savings$dpis <- (savings$dpi - mean( savings$dpi ))/(2*sd( savings$dpi ) )
savings$ddpis <- (savings$ddpi - mean( savings$ddpi ))/(2*sd( savings$ddpi ) )
sumary( lm( sr ~ age + dpis + ddpis, savings ) )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.8176 1.0106 6.7464 2.19e-08
## age 5.2841 1.5849 3.3341 0.001697
## dpis -1.5642 1.6093 -0.9720 0.336127
## ddpis 2.4681 1.1082 2.2272 0.030866
##
## n = 50, p = 4, Residual SE = 3.79990, R-Squared = 0.32
Colinearity
colinearity: when some predictors are linear combinations of others. \(X^TX\) is singular
multicolinearity: \(X^TX\) is close to singular but not exactly
detecting collinearity:
- correlation matrix values are close to $$1, indicating large pairwise collinearities
- a regression of \(x_i\) on all other predictors gives \(R^2_i\). \(R^2_i \sim 1\) is bad and means one predictor can be be predicted by a linear combination of other predictors.
- Examine the eigenvalues. eigenvalues close to 0 denote collinearity.
data( seatpos, package = 'faraway' )
seatpos_mod <- lm( hipcenter ~ ., data = seatpos )
sumary( seatpos_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 436.432128 166.571619 2.6201 0.01384
## Age 0.775716 0.570329 1.3601 0.18427
## Weight 0.026313 0.330970 0.0795 0.93718
## HtShoes -2.692408 9.753035 -0.2761 0.78446
## Ht 0.601345 10.129874 0.0594 0.95307
## Seated 0.533752 3.761894 0.1419 0.88815
## Arm -1.328069 3.900197 -0.3405 0.73592
## Thigh -1.143119 2.660024 -0.4297 0.67056
## Leg -6.439046 4.713860 -1.3660 0.18245
##
## n = 38, p = 9, Residual SE = 37.72029, R-Squared = 0.69
none of the predictors are significant, yet, the \(R^2\) is fairly large.
now to look at the pairwise correlations
seatpos_cor <- round( cor( seatpos[,-9] ), 2 )
seatpos_cor## Age Weight HtShoes Ht Seated Arm Thigh Leg
## Age 1.00 0.08 -0.08 -0.09 -0.17 0.36 0.09 -0.04
## Weight 0.08 1.00 0.83 0.83 0.78 0.70 0.57 0.78
## HtShoes -0.08 0.83 1.00 1.00 0.93 0.75 0.72 0.91
## Ht -0.09 0.83 1.00 1.00 0.93 0.75 0.73 0.91
## Seated -0.17 0.78 0.93 0.93 1.00 0.63 0.61 0.81
## Arm 0.36 0.70 0.75 0.75 0.63 1.00 0.67 0.75
## Thigh 0.09 0.57 0.72 0.73 0.61 0.67 1.00 0.65
## Leg -0.04 0.78 0.91 0.91 0.81 0.75 0.65 1.00
ok, that’s not very intuitive.
ggcorrplot( seatpos_cor ) There are several large pairwise correlations.
now to check the eigenvalue decomposition
x <- model.matrix( seatpos_mod )[,-1]
e <- eigen( t(x) %*% x )
e$val## [1] 3.653671e+06 2.147948e+04 9.043225e+03 2.989526e+02 1.483948e+02
## [6] 8.117397e+01 5.336194e+01 7.298209e+00
sqrt( e$val[1]/e$val )## [1] 1.00000 13.04226 20.10032 110.55123 156.91171 212.15650 261.66698
## [8] 707.54911
compute the variance inflation factors:
vif( x )## Age Weight HtShoes Ht Seated Arm Thigh
## 1.997931 3.647030 307.429378 333.137832 8.951054 4.496368 2.762886
## Leg
## 6.694291
let’s look at the correlation for all the length measurements:
seatpos_length_cor <-round( cor( x[,3:8] ), 2 )
seatpos_length_cor## HtShoes Ht Seated Arm Thigh Leg
## HtShoes 1.00 1.00 0.93 0.75 0.72 0.91
## Ht 1.00 1.00 0.93 0.75 0.73 0.91
## Seated 0.93 0.93 1.00 0.63 0.61 0.81
## Arm 0.75 0.75 0.63 1.00 0.67 0.75
## Thigh 0.72 0.73 0.61 0.67 1.00 0.65
## Leg 0.91 0.91 0.81 0.75 0.65 1.00
ggcorrplot( seatpos_length_cor )let’s take a look at model performance when we use just one length predictor, height
seatpos_mod2 <- lm( hipcenter ~ Age + Weight + Ht, data = seatpos )
sumary( seatpos_mod2 )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 528.2977290 135.3129471 3.9043 0.0004257
## Age 0.5195041 0.4080385 1.2732 0.2115928
## Weight 0.0042707 0.3117199 0.0137 0.9891491
## Ht -4.2119048 0.9990562 -4.2159 0.0001737
##
## n = 38, p = 4, Residual SE = 36.48611, R-Squared = 0.66
We get the same explanatory power but with a simpler model
Model Selection
The problem of selecting the best subset of predictors.
Occam’s Razor: among several plausible explanations for a phenomenon, the simplest is the best.
Hierarchical Models
Do not remove the lower order terms in the presence of higher order terms.
Testing baes procedures
- Backwards Elimination - start with the kitchen sink. one-by-one remove the predictors with the highest p-values that are \(\gt \alpha\)
- Forwards Elimination - start with nothing. try a model fit with each variable and progress with the variables that results in the smallest p-value. repeat this process until there are no more choices that lead to a better p-value.
- Stepwise Regression - a combination of forwards and backwards elimination.
data( state )
statedata = data.frame( state.x77, row.names = state.abb )
state_mod <- lm( Life.Exp ~ ., statedata )
sumary( state_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0943e+01 1.7480e+00 40.5859 < 2.2e-16
## Population 5.1800e-05 2.9187e-05 1.7748 0.08318
## Income -2.1804e-05 2.4443e-04 -0.0892 0.92934
## Illiteracy 3.3820e-02 3.6628e-01 0.0923 0.92687
## Murder -3.0112e-01 4.6621e-02 -6.4590 8.68e-08
## HS.Grad 4.8929e-02 2.3323e-02 2.0979 0.04197
## Frost -5.7350e-03 3.1432e-03 -1.8246 0.07519
## Area -7.3832e-08 1.6682e-06 -0.0443 0.96491
##
## n = 50, p = 8, Residual SE = 0.74478, R-Squared = 0.74
remove the predictor with the highest p-val:
state_mod <- update( state_mod, . ~ . - Area )
sumary( state_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.0989e+01 1.3875e+00 51.1652 < 2.2e-16
## Population 5.1883e-05 2.8788e-05 1.8023 0.07852
## Income -2.4440e-05 2.3429e-04 -0.1043 0.91740
## Illiteracy 2.8459e-02 3.4163e-01 0.0833 0.93400
## Murder -3.0182e-01 4.3344e-02 -6.9634 1.454e-08
## HS.Grad 4.8472e-02 2.0667e-02 2.3454 0.02369
## Frost -5.7758e-03 2.9702e-03 -1.9446 0.05839
##
## n = 50, p = 7, Residual SE = 0.73608, R-Squared = 0.74
remove the predictor with the highest p-val:
state_mod <- update( state_mod, . ~ . - Illiteracy )
sumary( state_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.1066e+01 1.0289e+00 69.0669 < 2.2e-16
## Population 5.1149e-05 2.7095e-05 1.8878 0.06566
## Income -2.4771e-05 2.3160e-04 -0.1070 0.91531
## Murder -3.0001e-01 3.7042e-02 -8.0992 2.907e-10
## HS.Grad 4.7758e-02 1.8591e-02 2.5689 0.01367
## Frost -5.9099e-03 2.4678e-03 -2.3948 0.02095
##
## n = 50, p = 6, Residual SE = 0.72773, R-Squared = 0.74
remove the predictor with the highest p-val:
state_mod <- update( state_mod, . ~ . - Income )
sumary( state_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.02712853 0.95285296 74.5415 < 2.2e-16
## Population 0.00005014 0.00002512 1.9960 0.052005
## Murder -0.30014880 0.03660946 -8.1987 1.775e-10
## HS.Grad 0.04658225 0.01482706 3.1417 0.002968
## Frost -0.00594329 0.00242087 -2.4550 0.018018
##
## n = 50, p = 5, Residual SE = 0.71969, R-Squared = 0.74
remove the predictor with the highest p-val:
state_mod <- update( state_mod, . ~ . - Population )
sumary( state_mod )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 71.0363788 0.9832622 72.2456 < 2.2e-16
## Murder -0.2830652 0.0367313 -7.7064 8.039e-10
## HS.Grad 0.0499487 0.0152011 3.2859 0.001950
## Frost -0.0069117 0.0024475 -2.8240 0.006988
##
## n = 50, p = 4, Residual SE = 0.74267, R-Squared = 0.71
However, it is important to understand that the variables omitted from the model may still be related to the response.
Drawbacks to step-based procedures
- it is possible to miss the optimal model
- the p-vals should not be treated too literally
- variables that get dropped may still be correlated to the response of interest
- step-wise variable selection methods tend to pick models that are smaller than desirable for prediction purposes
Criterion-Based Procedures
Choosing a model that optimizes a criterion.
Kullback-Leiber information (distance): \(I(f,g)=\int f(x)log(\frac{f(x)}{g(x|\theta)})\)
but we don’t know \(f\). sad face. … Akaike Information Criterion: \(AIC = -2L(\hat{\theta}) +2p\)
Choose the model that minimizes the AIC. AIC naturally balances between fit and simplicity in model selection
Bayed Information Criterion (BIC)
b <- regsubsets( Life.Exp ~ ., data = statedata )
rs <- summary( b )
rs$which## (Intercept) Population Income Illiteracy Murder HS.Grad Frost Area
## 1 TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE
## 2 TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
## 3 TRUE FALSE FALSE FALSE TRUE TRUE TRUE FALSE
## 4 TRUE TRUE FALSE FALSE TRUE TRUE TRUE FALSE
## 5 TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
## 6 TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## 7 TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
AIC <- 50*log(rs$rss/50)+(2:8)*2
plot( AIC ~ I(1:7), ylab = 'AIC', xlab="Number of Predictors") AIC is minimized by a choice of 4 predictors
Using \(R^2_{adj}\)
plot( 2:8, rs$adjr2, xlab = 'No. of parameters', ylab = 'Adjusted R-squared')which.max( rs$adjr2 )## [1] 4
Mallow’s Cp statistic
plot( 2:8, rs$cp, xlab = 'No. of parameters', ylab = 'Cp statistic')
abline( 0,1 )step()
state_mod <- lm( Life.Exp ~ ., data = statedata )
step( state_mod )## Start: AIC=-22.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
## Frost + Area
##
## Df Sum of Sq RSS AIC
## - Area 1 0.0011 23.298 -24.182
## - Income 1 0.0044 23.302 -24.175
## - Illiteracy 1 0.0047 23.302 -24.174
## <none> 23.297 -22.185
## - Population 1 1.7472 25.044 -20.569
## - Frost 1 1.8466 25.144 -20.371
## - HS.Grad 1 2.4413 25.738 -19.202
## - Murder 1 23.1411 46.438 10.305
##
## Step: AIC=-24.18
## Life.Exp ~ Population + Income + Illiteracy + Murder + HS.Grad +
## Frost
##
## Df Sum of Sq RSS AIC
## - Illiteracy 1 0.0038 23.302 -26.174
## - Income 1 0.0059 23.304 -26.170
## <none> 23.298 -24.182
## - Population 1 1.7599 25.058 -22.541
## - Frost 1 2.0488 25.347 -21.968
## - HS.Grad 1 2.9804 26.279 -20.163
## - Murder 1 26.2721 49.570 11.569
##
## Step: AIC=-26.17
## Life.Exp ~ Population + Income + Murder + HS.Grad + Frost
##
## Df Sum of Sq RSS AIC
## - Income 1 0.006 23.308 -28.161
## <none> 23.302 -26.174
## - Population 1 1.887 25.189 -24.280
## - Frost 1 3.037 26.339 -22.048
## - HS.Grad 1 3.495 26.797 -21.187
## - Murder 1 34.739 58.041 17.456
##
## Step: AIC=-28.16
## Life.Exp ~ Population + Murder + HS.Grad + Frost
##
## Df Sum of Sq RSS AIC
## <none> 23.308 -28.161
## - Population 1 2.064 25.372 -25.920
## - Frost 1 3.122 26.430 -23.877
## - HS.Grad 1 5.112 28.420 -20.246
## - Murder 1 34.816 58.124 15.528
##
## Call:
## lm(formula = Life.Exp ~ Population + Murder + HS.Grad + Frost,
## data = statedata)
##
## Coefficients:
## (Intercept) Population Murder HS.Grad Frost
## 7.103e+01 5.014e-05 -3.001e-01 4.658e-02 -5.943e-03
The sequence of variable removal is the same as with backward elimination and the model selected is the same as for AIC.
Check for high leverage:
h <- lm.influence( state_mod )$hat
names(h) <- state.abb
rev( sort( h ) )## AK CA HI NV NM TX NY
## 0.80952226 0.40885689 0.37876168 0.36524576 0.32472157 0.28416353 0.25694978
## WA OR ND LA CT UT RI
## 0.22268196 0.22183076 0.21968606 0.19494860 0.19362508 0.19097575 0.17082010
## MD AL AZ MS FL IL PA
## 0.16407127 0.16110483 0.15890792 0.15571602 0.14856937 0.13743337 0.13209846
## NJ SD ME SC MI WY VT
## 0.12993167 0.12528365 0.12181903 0.11757971 0.11725238 0.11638093 0.11504122
## MA GA MO KY AR CO WV
## 0.11274391 0.11028889 0.11026698 0.10909216 0.10447656 0.10250875 0.09964990
## NC DE NH ID OH MT MN
## 0.09361765 0.09322149 0.08980829 0.08755753 0.08751277 0.08638978 0.07617370
## TN WI VA OK IA NE KS
## 0.07006383 0.06835416 0.06393987 0.06349971 0.06200255 0.05749338 0.05537644
## IN
## 0.05198210
Alaska has high leverage. retry fitting without Alaska:
b <- regsubsets( Life.Exp ~ ., data = statedata, subset = ( state.abb!="AK"))
rs <- summary( b )
rs$which[ which.max( rs$adjr2 ),]## (Intercept) Population Income Illiteracy Murder HS.Grad
## TRUE TRUE FALSE FALSE TRUE TRUE
## Frost Area
## TRUE TRUE
AREA is now included in the model
Now observe the data and see if transformations are necessary
stripchart( data.frame( scale( statedata ) ), method = 'jitter', las = 2, vertical = TRUE )both AREA and Population are skewed. re-evaluate the model:
b <- regsubsets( Life.Exp ~ log( Population ) + Income + Illiteracy + Murder + HS.Grad + Frost + log( Area ), data = statedata )
rs <- summary( b )
rs$which[which.max( rs$adjr2),]## (Intercept) log(Population) Income Illiteracy Murder
## TRUE TRUE FALSE FALSE TRUE
## HS.Grad Frost log(Area)
## TRUE TRUE FALSE
While selecting models, think about:
- Do the models have similar qualitative consequences?
- Do they make similar predictions?
- What is the cost of measuring the predictors?
- Which has the best diagnostics?
Shrinkage Methods
More predictors in a model mean more information. This chapter looks at how to shrink the additional information into a more useful (interpretable) form.
Principal Components
Principal Component Analysis (PCA): finding low-dimensional linear structure in higher dimensional data.
find the orthogonal directions of the greatest variation in the data
cfat <- fat[,9:18]
prfat <- prcomp( cfat, scale = TRUE )
dim( prfat$rot )## [1] 10 10
dim( prfat$x )## [1] 252 10
summary( prfat )## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.6498 0.85301 0.81909 0.70114 0.54708 0.52831 0.45196
## Proportion of Variance 0.7021 0.07276 0.06709 0.04916 0.02993 0.02791 0.02043
## Cumulative Proportion 0.7021 0.77490 0.84199 0.89115 0.92108 0.94899 0.96942
## PC8 PC9 PC10
## Standard deviation 0.40539 0.27827 0.2530
## Proportion of Variance 0.01643 0.00774 0.0064
## Cumulative Proportion 0.98586 0.99360 1.0000
round( prfat$rot[,1],2 )## neck chest abdom hip thigh knee ankle biceps forearm wrist
## 0.33 0.34 0.33 0.35 0.33 0.33 0.25 0.32 0.27 0.30
Mahalanobis Distance: is a measure of the distance of a point from the mean that adjusts for the correlation in the data.
robfat <- cov.rob( cfat )
md <- mahalanobis( cfat, center = robfat$center, cov=robfat$cov )
n <- nrow( cfat ); p <- ncol( cfat )
plot( qchisq( 1:n/(n+1), p), sort(md),
xlab = expression( paste( chi^2,"quantiles")),
ylab = "Sorted Mahalanobis Distances")
abline( 0,1 ) use the 1st 2 PCs to model the data
lmodpcr <- lm( fat$brozek ~ prfat$x[,1:2] )
sumary( lmodpcr )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.93849 0.32913 57.5416 < 2.2e-16
## prfat$x[, 1:2]PC1 1.84198 0.12446 14.8003 < 2.2e-16
## prfat$x[, 1:2]PC2 -3.55053 0.38661 -9.1837 < 2.2e-16
##
## n = 252, p = 3, Residual SE = 5.22473, R-Squared = 0.55
Take a few representative predictors based on the two largest coefficients seen in the PCs: from the first 2 PCs, go with Abdomen and Ankle-Abdomen abdomen wasn’t the largest coef in PC1, but going with it keeps the model down to 2 features whereas going with the largest (Hip) adds a 3rd predictor.
lmodr <- lm( fat$brozek ~ scale( abdom ) + I( scale( ankle ) - scale( abdom ) ), data = cfat )
sumary( lmodr )## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.93849 0.27937 67.7891 < 2.2e-16
## scale(abdom) 5.76286 0.32840 17.5485 < 2.2e-16
## I(scale(ankle) - scale(abdom)) -0.99504 0.31403 -3.1686 0.001723
##
## n = 252, p = 3, Residual SE = 4.43492, R-Squared = 0.68
There we go: we have a simple model that fits as well as the 10 predictor model. We can interpret it similarly to the previous model but it is easier to explain to others. Future studies might be done more cheaply because we might only need these two measures.
Build a model to predict the fat content of new samples using the 100 absorbances which can be measured more easily:
data( meatspec, package = 'faraway' )
#glimpse( meatspec )
trainmeat <- meatspec[ 1:172, ]
testmeat <- meatspec[ 173: 215,]
meat_mod <- lm( fat ~ ., trainmeat )
# r^2 for the kitchen sink
summary( meat_mod )$r.squared## [1] 0.9970196
compare the rmse for the train and test sets
rmse <- function( x,y ) sqrt( mean( (x-y )^2 ) )
rmse( fitted( meat_mod ), trainmeat$fat )## [1] 0.6903167
rmse( predict( meat_mod, testmeat ), testmeat$fat )## [1] 3.814
Performance is much worse for the test data.
OKay, so there are 100 features in this dataset. We probably don’t need them all. Next try a stepwise feature selection method:
~28 features were dropped and performance was increased a bit.
Now let’s go with PCR
meatpca <- prcomp( trainmeat[, -101 ] )
round( meatpca$sdev, 3 )## [1] 5.055 0.511 0.282 0.168 0.038 0.025 0.014 0.011 0.005 0.003 0.002 0.002
## [13] 0.001 0.001 0.001 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [25] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [37] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [49] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [61] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [73] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [85] 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000
## [97] 0.000 0.000 0.000 0.000
matplot( 1:100, meatpca$rot[,1:3], type = 'l', xlab = 'Frequency', ylab = '', col = 1 ) PCR
meatpcr <- pcr( fat ~ ., data = trainmeat, ncomp = 50 )
rmse( predict( meatpcr, ncomp = 4 ), trainmeat$fat )## [1] 4.064745
par(mfrow=c(1,2))
#plot the 100 slope coefs for the kitchen sink
plot( meat_mod$coef[-1], xlab = 'Frequency', ylab = 'Coefficient', type = 'l' )
coefplot( meatpcr, ncomp = 4, xlab = 'Frequency', main = '' ) we used 4 PCs, why? How to select the number: skree plot (plot of stds)
plot( meatpca$sdev[ 1:10 ], type = 'l', ylab = 'SD of PC', xlab = 'PC number' ) General rule of thumb: use PCs up to the last ‘elbow’ in the skree plot
How well does 4 PCs work with the test data?
rmse( predict( meatpcr, testmeat, ncomp = 4 ), testmeat$fat )## [1] 4.533982
pcrmse <- RMSEP( meatpcr, newdata = testmeat )
plot( pcrmse, main = '' )which.min( pcrmse$val )## [1] 28
pcrmse$val[ 28 ]## [1] 1.854858
The best result occurs fro 28 PCs.
Crossvalidation
set.seed( 123 )
meatpcr <- pcr( fat ~ ., data = trainmeat, validation = 'CV', ncomp = 50 )
meatCV <- RMSEP( meatpcr, estimate = 'CV' )
plot( meatCV, main = '' )minPCs <- which.min( meatCV$val )
ypred <- predict( meatpcr, testmeat, ncomp = minPCs )
rmse( ypred, testmeat$fat )## [1] 2.28158
This gives a much more improved RMSE
Partial Least Squares
a method for relating a set of input variables and outputs
apply PLS using CV to select components:
par(mfrow=c(1,2))
set.seed( 123 )
meatpls <- plsr( fat ~ ., data = meatspec[1:172,], ncomp=50, validation = "CV" )
coefplot( meatpls, ncomp = 4, xlab = "Frequency", main = "" )
plsCV <- RMSEP( meatpls, estimate = 'CV' )
plot( plsCV, main="" ) From the right hand figure, we see we need around 15 components
see how it performs on the train set:
ypred <- predict( meatpls, ncomp = 15 )
rmse( ypred, trainmeat$fat )## [1] 1.889838
see how it performs on the test set:
ytpred <- predict( meatpls, testmeat, ncomp = 15 )
rmse( ytpred, testmeat$fat )## [1] 1.971828
We did slightly better than PCR.
Ridge Regression
Suppose that the predictors have been centered by their means and scaled by their standard deviations and that the response has been centered.
Lasso Regression
Lasso regression is pretty good when there are lots of predictors. Lasso is particularly good when we believe the effects are sparse; that few of the predictors have an effect.
trainy <- trainmeat$fat
trainx <- as.matrix( trainmeat[,-101] )
meatlasso <- lars( trainx, trainy )now the CV
set.seed( 123 )
cvout <- cv.lars( trainx, trainy )cvout$index[ which.min( cvout$cv ) ]## [1] 0.01010101
testx <- as.matrix( testmeat[,-101] )
predlars <- predict( meatlasso, testx, s=0.0101, mode='fraction')
rmse( testmeat$fat, predlars$fit )## [1] 2.132265
Thats the best RMSE so far.
predlars <- predict( meatlasso, s = 0.0101, type = "coef", mode="fraction" )
plot( predlars$coef, type = 'h', ylab="Coefficient" )sum( predlars$coef != 0 )## [1] 20