ML_Regression Recap_Supervised Learning

Load Packages and Dataset

Packages used to import and analyse the data include

broom - elegant view of the linear regression model’s results
sigr - succinct and correct stat summaries for reports
dplyr - for manipulating data
WVplots - for plotting gain curve (for model evaluation)
vtreat - for cross-validation on data

A simple demo on LM

note that LM was already covered in previous course so below coding parts are relatively straightforward with limited commentaries.
a couple of notes: instead of writing the y~x inside the lm model, we can also define their relationships in as.formula
lm function is from the stats package which is by default loaded

summary(unemployment)

##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900

fmla <- as.formula(female_unemployment ~ male_unemployment)
fmla

## female_unemployment ~ male_unemployment

unemployment_model <- lm(fmla, data = unemployment)
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

Examining a model

# Print unemployment_model
unemployment_model

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Coefficients:
##       (Intercept)  male_unemployment  
##            1.4341             0.6945

# Call summary() on unemployment_model to get more details
summary(unemployment_model)

## 
## Call:
## lm(formula = fmla, data = unemployment)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.77621 -0.34050 -0.09004  0.27911  1.31254 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.43411    0.60340   2.377   0.0367 *  
## male_unemployment  0.69453    0.09767   7.111 1.97e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5803 on 11 degrees of freedom
## Multiple R-squared:  0.8213, Adjusted R-squared:  0.8051 
## F-statistic: 50.56 on 1 and 11 DF,  p-value: 1.966e-05

# Call glance() on unemployment_model to see the details in a tidier form
glance(unemployment_model)

## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic   p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>     <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.821         0.805 0.580      50.6 0.0000197     1  -10.3  26.6  28.3
## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

# Call wrapFTest() on unemployment_model to see the most relevant details
wrapFTest(unemployment_model)

## [1] "F Test summary: (R2=0.8213, F(1,11)=50.56, p=1.966e-05)."

model itself reports coefficients and the model build (dependent and independent variables)
summary function provides more details on fitness of the model
glance - get the details in a tidier form
wrapFTest (within the sigr package) provides the results from F Test

Making some predictions

note that we used newrates in the prediction to see how much female_unemployment rate is when there is a given unemployment rate for male.

# unemployment is in your workspace
summary(unemployment)

##  male_unemployment female_unemployment
##  Min.   :2.900     Min.   :4.000      
##  1st Qu.:4.900     1st Qu.:4.400      
##  Median :6.000     Median :5.200      
##  Mean   :5.954     Mean   :5.569      
##  3rd Qu.:6.700     3rd Qu.:6.100      
##  Max.   :9.800     Max.   :7.900

# newrates is in your workspace
newrates

##   male_unemployment
## 1                 5

# Predict female unemployment in the unemployment data set
unemployment$prediction <- predict(unemployment_model, unemployment)

# load the ggplot2 package
library(ggplot2)

# Make a plot to compare predictions to actual (prediction on x axis). 
ggplot(unemployment, aes(x = prediction, y = female_unemployment)) + 
  geom_point() +
  geom_abline(color = "blue")

# Predict female unemployment rate when male unemployment is 5%
pred <- predict(unemployment_model, newrates)
# Print it
pred

##        1 
## 4.906757

Multivariate linear regression

This part was covered in previous exercise too - therefore the walk-through is relatively simple below.

head(bloodpressure)

##   blood_pressure age weight
## 1            132  52    173
## 2            143  59    184
## 3            153  67    194
## 4            162  73    211
## 5            154  64    196
## 6            168  74    220

summary(bloodpressure)

##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220

Fitting the model

# bloodpressure is in the workspace
summary(bloodpressure)

##  blood_pressure       age            weight   
##  Min.   :128.0   Min.   :46.00   Min.   :167  
##  1st Qu.:140.0   1st Qu.:56.50   1st Qu.:186  
##  Median :153.0   Median :64.00   Median :194  
##  Mean   :150.1   Mean   :62.45   Mean   :195  
##  3rd Qu.:160.5   3rd Qu.:69.50   3rd Qu.:209  
##  Max.   :168.0   Max.   :74.00   Max.   :220

# Create the formula and print it
fmla <- blood_pressure ~ age + weight
fmla

## blood_pressure ~ age + weight

# Fit the model: bloodpressure_model
bloodpressure_model <- lm(fmla, data = bloodpressure)

# Print bloodpressure_model and call summary() 
bloodpressure_model

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Coefficients:
## (Intercept)          age       weight  
##     30.9941       0.8614       0.3349

summary(bloodpressure_model)

## 
## Call:
## lm(formula = fmla, data = bloodpressure)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4640 -1.1949 -0.4078  1.8511  2.6981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  30.9941    11.9438   2.595  0.03186 * 
## age           0.8614     0.2482   3.470  0.00844 **
## weight        0.3349     0.1307   2.563  0.03351 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.318 on 8 degrees of freedom
## Multiple R-squared:  0.9768, Adjusted R-squared:  0.9711 
## F-statistic: 168.8 on 2 and 8 DF,  p-value: 2.874e-07

Visualization

# predict blood pressure using bloodpressure_model :prediction
bloodpressure$prediction <- predict(bloodpressure_model, bloodpressure)

# plot the results
ggplot(bloodpressure, aes(prediction, blood_pressure)) + 
    geom_point() +
    geom_abline(color = "blue")

Evaluating a model graphically

plot ground truth vs predictions
the residual plot
the gain curve

A demo

Make a prediction

unemployment$predictions <- predict(unemployment_model, unemployment)

Plot against the ground truth

ggplot(unemployment, aes(predictions, female_unemployment)) +
  geom_point() +
  geom_abline()

calculate the residuals (actual results - predicted)

unemployment <- unemployment %>% 
  mutate(residuals = female_unemployment - predictions)

Plot the residuals - evaluate model predictions by comparing them to ground truth and by examining prediction error.

ggplot(unemployment, aes(predictions, residuals)) +
  geom_pointrange(aes(ymin = 0, ymax = residuals)) +
  geom_hline(yintercept = 0, linetype = 3) +
  ggtitle("residuals vs. linear model prediction")

Gain curve to evaluate the model

For situations where order is more important than exact values, the gain curve helps you check if the model’s predictions sort in the same order as the true outcome.

GainCurvePlot(frame, xvar, truthvar, title)

frame - dataframe
xvar, truthvar - strings naming the prediction and actual outcome columns of the frame. please note both of them are STRINGS hence quoted in quotes which in my opinion does not make sense.
title - title of the plot

GainCurvePlot(unemployment, "predictions", "female_unemployment", "Unemployment model")

a relative gini coefficient close to one shows that model correctly sorts high unemployment situations from lower ones.

Root mean squared error RMSE

typical prediction error of the prediction model on the data

rename residual - use dplyr’s rename function rename(df, newname = oldname)

unemployment <- unemployment %>% 
  rename(res = residuals)

calculate the RMSE, compare it to standard deviation of female_unemployment.

unemployment %>% 
 summarize(rmse = sqrt(mean(res^2)), std = sd(female_unemployment))

##        rmse      std
## 1 0.5337612 1.314271

an RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.

R Squared

R_prep <- unemployment %>% 
  summarize(fe_mean = mean(female_unemployment),
            tss = sum((female_unemployment- fe_mean)^2), 
            rss = sum(res^2), 
            rho = cor(predictions, female_unemployment))
R_prep$rho^2

## [1] 0.8213157

The linear correlation of two variables, x and y measures the strength of the linear relationship between them. When x and y are respectively:

the outcomes of a regression model that minimizes squared-error (i.e. linear regression) and
the true outcomes of the training data

Then the square of the correlation is the same as $R^2$ . Below shows the verification.

R_prep %>% 
  summarize(rsq = 1 - rss/tss)

##         rsq
## 1 0.8213157

Get R-squared from glance for checking.

(rsq_glance <- glance(unemployment_model)$r.squared)

## [1] 0.8213157

The calculated R is the same as the one from glance. and an R-squared close to one suggests a model that predicts well.

Properly Training a Model

over-fitting: if high $R^2$ results on training data but rather low $R^2$ results on test data
split the dataset to two - randomly select rows - for training and for testing datasets
if not enough data for splitting into training and test - use cross-validation to estimate a model’s out of sample performance. need a package called vtreat.
- n-fold cross-validation, partition the data into n-subsets
- for three partitions A, B, and C for example, train a model on A and B to predict C; train a model on A and C to predict B, and train a model on B and C to predict A.

Split of datasets

We use the mpg dataset from ggplot2 package.

summary(mpg)

##  manufacturer          model               displ            year     
##  Length:234         Length:234         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :2004  
##                                        Mean   :3.472   Mean   :2004  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:234         Length:234         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.889                                         Mean   :16.86  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :35.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:234         Length:234        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :24.00   Mode  :character   Mode  :character  
##  Mean   :23.44                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

dim(mpg)

## [1] 234  11

(N <- nrow(mpg))

## [1] 234

Calculate how many rows 75% of N should be. Round() is used here to get an integer result.

target <- round(0.75 * N, 0)

Create vector of N uniform random variables: gp

gp <- runif(N)

Create the 75%/25% split for training and testing datasets

mpg_train <- mpg[gp < 0.75, ]
mpg_test <- mpg[gp >= 0.75, ]

Examine the number of rows in each datasets. Additional note is that a random split won’t always produce sets of exactly X% and (100-X)% of the data (due to rounding), but it should be close.

nrow(mpg_train)

## [1] 175

nrow(mpg_test)

## [1] 59

Train a model using test/train split

summary(mpg_train)

##  manufacturer          model               displ            year     
##  Length:175         Length:175         Min.   :1.600   Min.   :1999  
##  Class :character   Class :character   1st Qu.:2.400   1st Qu.:1999  
##  Mode  :character   Mode  :character   Median :3.300   Median :1999  
##                                        Mean   :3.457   Mean   :2003  
##                                        3rd Qu.:4.600   3rd Qu.:2008  
##                                        Max.   :7.000   Max.   :2008  
##       cyl           trans               drv                 cty       
##  Min.   :4.000   Length:175         Length:175         Min.   : 9.00  
##  1st Qu.:4.000   Class :character   Class :character   1st Qu.:14.00  
##  Median :6.000   Mode  :character   Mode  :character   Median :17.00  
##  Mean   :5.817                                         Mean   :16.93  
##  3rd Qu.:8.000                                         3rd Qu.:19.00  
##  Max.   :8.000                                         Max.   :33.00  
##       hwy             fl               class          
##  Min.   :12.00   Length:175         Length:175        
##  1st Qu.:18.00   Class :character   Class :character  
##  Median :25.00   Mode  :character   Mode  :character  
##  Mean   :23.62                                        
##  3rd Qu.:27.00                                        
##  Max.   :44.00

create a formula to express y ~ x (cty: city fuel efficiency; hwy: highway fuel efficiency)

(fmla <- as.formula(cty ~ hwy))

## cty ~ hwy

use lm to build a model from mpg_train that predicts cty from hwy

mpg_model <- lm(fmla, data = mpg_train)
summary(mpg_model)

## 
## Call:
## lm(formula = fmla, data = mpg_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8445 -0.8159 -0.1079  0.7032  2.6096 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.33686    0.37731   3.543 0.000508 ***
## hwy          0.66031    0.01552  42.550  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.186 on 173 degrees of freedom
## Multiple R-squared:  0.9128, Adjusted R-squared:  0.9123 
## F-statistic:  1811 on 1 and 173 DF,  p-value: < 2.2e-16

Application of the model

Define a few functions for convenience of calculating intermediate results

rmse <- function(predcol, ycol) {
  res = predcol - ycol
  sqrt(mean(res^2))
}

r_squared <- function(predcol, ycol) {
  tss = sum((ycol - mean(ycol))^2)
  rss = sum((predcol - ycol)^2)
  1 - rss/tss
}

Predict cty from hwy for both training and test set

mpg_train$pred <- predict(mpg_model, mpg_train)
mpg_test$pred <- predict(mpg_model, mpg_test)

Use rmse to evaluate rmse for both the test and training sets to check if performance is similar.

(rmse_train <- rmse(mpg_train$pred, mpg_train$cty))

## [1] 1.179452

(rmse_test <- rmse(mpg_test$pred, mpg_test$cty))

## [1] 1.457028

Evaluate the r-squared on both training and test data.

(rsq_train <- r_squared(mpg_train$pred, mpg_train$cty))

## [1] 0.912781

(rsq_test <- r_squared(mpg_test$pred, mpg_test$cty))

## [1] 0.9121356

plot the predictions against the outcome on the test data

ggplot(mpg_test, aes(x = pred, y = cty)) + 
  geom_point(color = "red") + 
  geom_abline(linetype = 3)

Create a cross validation plan

there are many ways to implement an n-fold cross validation plan. we will use vtreat to create one and examine it.

kWayCrossValidation() creates a cross validation plan with the following call:

splitPlan <- kWayCrossValidation(nRows, nSplits, dframe, y)

where nRows is the number of rows of data to be split, and nSplits is the desired number of cross-validation folds.

Strictly speaking, dframe and y aren’t used by kWayCrossValidation; they are there for compatibility with other vtreat data partitioning functions. You can set them both to NULL.

The resulting splitPlan is a list of nSplits elements; each element contains two vectors:

train: the indices of dframe that will form the training set
app: the indices of dframe that will form the test (or application) set

In below exercise we will create a 3-fold cross-validation plan for the data set mpg.

Coding of the plan

get number of rows in mpg

nRows <- nrow(mpg)

implement 3-fold cross-fold plan with vtreat. two other arguments are not used here so just set as NULL.

splitPlan <- kWayCrossValidation(nRows, 3, NULL, NULL)
str(splitPlan)

## List of 3
##  $ :List of 2
##   ..$ train: int [1:156] 2 3 5 8 9 10 11 12 13 14 ...
##   ..$ app  : int [1:78] 92 32 154 181 36 210 221 161 193 231 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 2 4 5 6 7 8 10 11 12 ...
##   ..$ app  : int [1:78] 58 179 136 121 112 86 149 54 187 230 ...
##  $ :List of 2
##   ..$ train: int [1:156] 1 3 4 6 7 9 13 14 15 16 ...
##   ..$ app  : int [1:78] 26 17 124 234 27 11 47 170 5 141 ...
##  - attr(*, "splitmethod")= chr "kwaycross"

Evaluating a modelling procedure using n-fold cross-validation

we will use splitPlan, the 3-fold cross validation plan to make predictions from a model that predicts mpg\$cty from mpg$\hwy.

run a 3-fold cross validation plan from splitplan - this seems to be overly complicated (thought there must be a more elegant way to carry out a cross-validation).

k <- 3 # Number of folds
mpg$pred.cv <- 0 
for(i in 1:k) {
  split <- splitPlan[[i]]
  model <- lm(cty ~ hwy, data = mpg[split$train, ])
  mpg$pred.cv[split$app] <- predict(model, newdata = mpg[split$app, ])
}

Predict from a full model

mpg$pred <- predict(lm (cty ~ hwy, data = mpg))

rmse of the full model’s predictions

rmse(mpg$pred, mpg$cty)

## [1] 1.247045

rmse of the cross-validation predictions

rmse(mpg$pred.cv, mpg$cty)

## [1] 1.249317

Note that cross-validation validates the modeling process, not the actual model.