Load Packages and Dataset
Packages used to import and analyse the data include
broom - elegant view of the linear regression model’s results
sigr - succinct and correct stat summaries for reports
dplyr - for manipulating data
WVplots - for plotting gain curve (for model evaluation)
vtreat - for cross-validation on data
A simple demo on LM
note that LM was already covered in previous course so below coding parts are relatively straightforward with limited commentaries.
a couple of notes: instead of writing the y~x inside the lm model, we can also define their relationships in as.formula
lm function is from the stats package which is by default loaded
summary(unemployment)## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
fmla <- as.formula(female_unemployment ~ male_unemployment)
fmla## female_unemployment ~ male_unemployment
unemployment_model <- lm(fmla, data = unemployment)
unemployment_model##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
Examining a model
# Print unemployment_model
unemployment_model##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Coefficients:
## (Intercept) male_unemployment
## 1.4341 0.6945
# Call summary() on unemployment_model to get more details
summary(unemployment_model)##
## Call:
## lm(formula = fmla, data = unemployment)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.77621 -0.34050 -0.09004 0.27911 1.31254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.43411 0.60340 2.377 0.0367 *
## male_unemployment 0.69453 0.09767 7.111 1.97e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared: 0.8213, Adjusted R-squared: 0.8051
## F-statistic: 50.56 on 1 and 11 DF, p-value: 1.966e-05
# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.821 0.805 0.580 50.6 0.0000197 1 -10.3 26.6 28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."
model itself reports coefficients and the model build (dependent and independent variables)
summary function provides more details on fitness of the model
glance - get the details in a tidier form
wrapFTest (within the sigr package) provides the results from F Test
Making some predictions
- note that we used newrates in the prediction to see how much female_unemployment rate is when there is a given unemployment rate for male.
# unemployment is in your workspace
summary(unemployment)## male_unemployment female_unemployment
## Min. :2.900 Min. :4.000
## 1st Qu.:4.900 1st Qu.:4.400
## Median :6.000 Median :5.200
## Mean :5.954 Mean :5.569
## 3rd Qu.:6.700 3rd Qu.:6.100
## Max. :9.800 Max. :7.900
# newrates is in your workspace
newrates## male_unemployment
## 1 5
# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model, unemployment)
# load the ggplot2 package
library(ggplot2)
# Make a plot to compare predictions to actual (prediction on x axis).
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) +
geom_point() +
geom_abline(color = "blue")# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred## 1
## 4.906757
Multivariate linear regression
This part was covered in previous exercise too - therefore the walk-through is relatively simple below.
head(bloodpressure)## blood_pressure age weight
## 1 132 52 173
## 2 143 59 184
## 3 153 67 194
## 4 162 73 211
## 5 154 64 196
## 6 168 74 220
summary(bloodpressure)## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
Fitting the model
# bloodpressure is in the workspace
summary(bloodpressure)## blood_pressure age weight
## Min. :128.0 Min. :46.00 Min. :167
## 1st Qu.:140.0 1st Qu.:56.50 1st Qu.:186
## Median :153.0 Median :64.00 Median :194
## Mean :150.1 Mean :62.45 Mean :195
## 3rd Qu.:160.5 3rd Qu.:69.50 3rd Qu.:209
## Max. :168.0 Max. :74.00 Max. :220
# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla## blood_pressure ~ age + weight
# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)
# Print bloodpressure_model and call summary()
bloodpressure_model##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Coefficients:
## (Intercept) age weight
## 30.9941 0.8614 0.3349
summary(bloodpressure_model)##
## Call:
## lm(formula = fmla, data = bloodpressure)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4640 -1.1949 -0.4078 1.8511 2.6981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 30.9941 11.9438 2.595 0.03186 *
## age 0.8614 0.2482 3.470 0.00844 **
## weight 0.3349 0.1307 2.563 0.03351 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared: 0.9768, Adjusted R-squared: 0.9711
## F-statistic: 168.8 on 2 and 8 DF, p-value: 2.874e-07
Visualization
# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)
# plot the results
ggplot(bloodpressure, aes(prediction, blood_pressure)) +
geom_point() +
geom_abline(color = "blue")Evaluating a model graphically
plot ground truth vs predictions
the residual plot
the gain curve
A demo
Make a prediction
unemployment$predictions <- predict(unemployment_model, unemployment)Plot against the ground truth
ggplot(unemployment, aes(predictions, female_unemployment)) +
geom_point() +
geom_abline()calculate the residuals (actual results - predicted)
unemployment <- unemployment %>%
mutate(residuals = female_unemployment - predictions)Plot the residuals - evaluate model predictions by comparing them to ground truth and by examining prediction error.
ggplot(unemployment, aes(predictions, residuals)) +
geom_pointrange(aes(ymin = 0, ymax = residuals)) +
geom_hline(yintercept = 0, linetype = 3) +
ggtitle("residuals vs. linear model prediction")Gain curve to evaluate the model
For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.
GainCurvePlot(frame, xvar, truthvar, title)
frame - dataframe
xvar, truthvar - strings naming the prediction and actual outcome columns of the frame. please note both of them are STRINGS hence quoted in quotes which in my opinion does not make sense.
title - title of the plot
GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")- a relative gini coefficient close to one shows that model correctly sorts high unemployment situations from lower ones.
Root mean squared error RMSE
- typical prediction error of the prediction model on the data
rename residual - use dplyr’s rename function rename(df, newname = oldname)
unemployment <- unemployment %>%
rename(res = residuals)calculate the RMSE, compare it to standard deviation of female_unemployment.
unemployment %>%
summarize(rmse = sqrt(mean(res^2)), std = sd(female_unemployment))## rmse std
## 1 0.5337612 1.314271
- an RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.
R Squared
R_prep <- unemployment %>%
summarize(fe_mean = mean(female_unemployment),
tss = sum((female_unemployment- fe_mean)^2),
rss = sum(res^2),
rho = cor(predictions, female_unemployment))
R_prep$rho^2## [1] 0.8213157
The linear correlation of two variables, x and y measures the strength of the linear relationship between them. When x and y are respectively:
the outcomes of a regression model that minimizes squared-error (i.e. linear regression) and
the true outcomes of the training data
Then the square of the correlation is the same as \(R^2\) . Below shows the verification.
R_prep %>%
summarize(rsq = 1 - rss/tss)## rsq
## 1 0.8213157
Get R-squared from glance for checking.
(rsq_glance <- glance(unemployment_model)$r.squared)## [1] 0.8213157
The calculated R is the same as the one from glance. and an R-squared close to one suggests a model that predicts well.
Properly Training a Model
over-fitting: if high \(R^2\) results on training data but rather low \(R^2\) results on test data
split the dataset to two - randomly select rows - for training and for testing datasets
if not enough data for splitting into training and test - use cross-validation to estimate a model’s out of sample performance. need a package called vtreat.
n-fold cross-validation, partition the data into n-subsets
for three partitions A, B, and C for example, train a model on A and B to predict C; train a model on A and C to predict B, and train a model on B and C to predict A.
Split of datasets
We use the mpg dataset from ggplot2 package.
summary(mpg)## manufacturer model displ year
## Length:234 Length:234 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :2004
## Mean :3.472 Mean :2004
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:234 Length:234 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.889 Mean :16.86
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :35.00
## hwy fl class
## Min. :12.00 Length:234 Length:234
## 1st Qu.:18.00 Class :character Class :character
## Median :24.00 Mode :character Mode :character
## Mean :23.44
## 3rd Qu.:27.00
## Max. :44.00
dim(mpg)## [1] 234 11
(N <- nrow(mpg))## [1] 234
Calculate how many rows 75% of N should be. Round() is used here to get an integer result.
target <- round(0.75 * N, 0)Create vector of N uniform random variables: gp
gp <- runif(N)Create the 75%/25% split for training and testing datasets
mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]Examine the number of rows in each datasets. Additional note is that a random split won’t always produce sets of exactly X% and (100-X)% of the data (due to rounding), but it should be close.
nrow(mpg_train)## [1] 175
nrow(mpg_test)## [1] 59
Train a model using test/train split
summary(mpg_train)## manufacturer model displ year
## Length:175 Length:175 Min. :1.600 Min. :1999
## Class :character Class :character 1st Qu.:2.400 1st Qu.:1999
## Mode :character Mode :character Median :3.300 Median :1999
## Mean :3.457 Mean :2003
## 3rd Qu.:4.600 3rd Qu.:2008
## Max. :7.000 Max. :2008
## cyl trans drv cty
## Min. :4.000 Length:175 Length:175 Min. : 9.00
## 1st Qu.:4.000 Class :character Class :character 1st Qu.:14.00
## Median :6.000 Mode :character Mode :character Median :17.00
## Mean :5.817 Mean :16.93
## 3rd Qu.:8.000 3rd Qu.:19.00
## Max. :8.000 Max. :33.00
## hwy fl class
## Min. :12.00 Length:175 Length:175
## 1st Qu.:18.00 Class :character Class :character
## Median :25.00 Mode :character Mode :character
## Mean :23.62
## 3rd Qu.:27.00
## Max. :44.00
create a formula to express y ~ x (cty: city fuel efficiency; hwy: highway fuel efficiency)
(fmla <- as.formula(cty ~ hwy))## cty ~ hwy
use lm to build a model from mpg_train that predicts cty from hwy
mpg_model <- lm(fmla, data = mpg_train)
summary(mpg_model)##
## Call:
## lm(formula = fmla, data = mpg_train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8445 -0.8159 -0.1079 0.7032 2.6096
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.33686 0.37731 3.543 0.000508 ***
## hwy 0.66031 0.01552 42.550 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.186 on 173 degrees of freedom
## Multiple R-squared: 0.9128, Adjusted R-squared: 0.9123
## F-statistic: 1811 on 1 and 173 DF, p-value: < 2.2e-16
Application of the model
Define a few functions for convenience of calculating intermediate results
rmse <- function(predcol, ycol) {
res = predcol - ycol
sqrt(mean(res^2))
}r_squared <- function(predcol, ycol) {
tss = sum((ycol - mean(ycol))^2)
rss = sum((predcol - ycol)^2)
1 - rss/tss
}Predict cty from hwy for both training and test set
mpg_train$pred <- predict(mpg_model, mpg_train)
mpg_test$pred <- predict(mpg_model, mpg_test)Use rmse to evaluate rmse for both the test and training sets to check if performance is similar.
(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))## [1] 1.179452
(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))## [1] 1.457028
Evaluate the r-squared on both training and test data.
(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))## [1] 0.912781
(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))## [1] 0.9121356
plot the predictions against the outcome on the test data
ggplot(mpg_test, aes(x = pred, y = cty)) +
geom_point(color = "red") +
geom_abline(linetype = 3)Create a cross validation plan
there are many ways to implement an n-fold cross validation plan. we will use vtreat to create one and examine it.
kWayCrossValidation() creates a cross validation plan with the following call:
splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)
where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.
Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.
The resulting splitPlan is a list of nSplits elements; each element contains two vectors:
train: the indices ofdframethat will form the training setapp: the indices ofdframethat will form the test (or application) set
In below exercise we will create a 3-fold cross-validation plan for the data set mpg.
Coding of the plan
get number of rows in mpg
nRows <- nrow(mpg)implement 3-fold cross-fold plan with vtreat. two other arguments are not used here so just set as NULL.
splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)
str(splitPlan)## List of 3
## $ :List of 2
## ..$ train: int [1:156] 2 3 5 8 9 10 11 12 13 14 ...
## ..$ app : int [1:78] 92 32 154 181 36 210 221 161 193 231 ...
## $ :List of 2
## ..$ train: int [1:156] 1 2 4 5 6 7 8 10 11 12 ...
## ..$ app : int [1:78] 58 179 136 121 112 86 149 54 187 230 ...
## $ :List of 2
## ..$ train: int [1:156] 1 3 4 6 7 9 13 14 15 16 ...
## ..$ app : int [1:78] 26 17 124 234 27 11 47 170 5 141 ...
## - attr(*, "splitmethod")= chr "kwaycross"
Evaluating a modelling procedure using n-fold cross-validation
we will use splitPlan, the 3-fold cross validation plan to make predictions from a model that predicts mpg\$cty from mpg$\hwy.
run a 3-fold cross validation plan from splitplan - this seems to be overly complicated (thought there must be a more elegant way to carry out a cross-validation).
k <- 3 # Number of folds
mpg$pred.cv <- 0
for(i in 1:k) {
split <- splitPlan[[i]]
model <- lm(cty ~ hwy, data = mpg[split$train, ])
mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}Predict from a full model
mpg$pred <- predict(lm (cty ~ hwy, data = mpg))rmse of the full model’s predictions
rmse(mpg$pred, mpg$cty)## [1] 1.247045
rmse of the cross-validation predictions
rmse(mpg$pred.cv, mpg$cty)## [1] 1.249317
Note that cross-validation validates the modeling process, not the actual model.