Week 3: Class 14/02/2003


Today we’ll learn a way to evaluate the predictive performance of a model trying to simulate what happens in real life: it is called cross validation


When the output is a binary variable

We start by considering models where the output /target variable is binary


Example

Let us create a toy example. A very small data set to control everything before we use an [R] library.

DF<-data.frame(y=c(0,1,0,1,0,0,1,1,0,1,0,1,0,0,1),x=c(5,8,4,9,3,2,8,8,6,3,2,7,1,2,0))

The idea of cross-validation is to split the data set in two parts. One part will be called training sample and will be used to estimate the corresponding predictive model. The other part is called test sample and is used to try the estimated model in this sample not used in the process of estimation.

Why do we do it?

Because in real life you have a data set to estimate the model and then, you use the model to predict (what your customers, for instance, will do). So, forecasting means to make errors. This exercise tries to simulate real life errors, in order to assess the “errors you could make if you receive the information of new people not used in the model”.

index<-c(1,3,5,6,8,9,11,12,14,15)
train<-DF[index,]
test<-DF[-index,]

And we get the training and test subsamples

Next, we run a model with the training data set

model<-glm(y~x, data=train, family="binomial")
summary(model)
## 
## Call:
## glm(formula = y ~ x, family = "binomial", data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0335  -0.8007  -0.6455   0.6563   2.0709  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)  -2.0198     1.5574  -1.297    0.195
## x             0.2786     0.3071   0.907    0.364
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12.217  on 9  degrees of freedom
## Residual deviance: 11.322  on 8  degrees of freedom
## AIC: 15.322
## 
## Number of Fisher Scoring iterations: 3

and now, we apply this model to the “test” data set. Would the model predict properly the \(y\) variable?

pred<-predict(model,newdata=test, type="response")
pred
##         2         4         7        10        13 
## 0.5519784 0.6194524 0.5519784 0.2343125 0.1491564

Create, by hand a cross table with the prediction error. In this very simplistic example, it is easy to see that we only have one bad classified observation

driving us to the classification table:

  • Compute the accuracy of the model. What do you think? Is is a good predictive model?

Note that the accuracy can be computed as: \(Acc=\frac{3+1}{5}=0.8\) seeming a good model. Although the sample is so small that it only is useful as a toy example :)

How to do it in R?

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(ggplot2)
library(lattice)
train_control<- trainControl(method="cv", number=5,p=0.75, savePredictions = TRUE)

model<- train(as.factor(y)~x, data=DF, trControl=train_control, method = "glm",family = "binomial" )
model
## Generalized Linear Model 
## 
## 15 samples
##  1 predictor
##  2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12, 12, 11, 12, 13 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.7333333  0.48

Week 3: Class 16/02/2003


Today we’ll learn what is the Kappa measure and how to cross validate linear regression models.


To learn the idea of the Kappa index, first we need to recall the classification table:

As you can see, we can have four different status:

  • True Positive and True Negative (when our prediction is good)
  • False Negative and False positive (when our prediction is bad)

The Kappa Index try to take into account all these measures. Its expression results to be:

\[ Kappa=\frac{2\times(TP\times TN- FN\times FP)}{(TP+FP)\times(FP+TN)+(TP+FN)\times (FN+TN)} \]

Obviously, you don’t need to know it by heart (is completely useless: save space in your memory for better and more important things).

  • Interpretation There is not a standardized interpretation of the kappa statistic. We can understand it as a measure of predictive ability complementary to accuracy. According to Landis and Koch considers 0-0.20 as slight, 0.21-0.40 as fair, 0.41-0.60 as moderate, 0.61-0.80 as substantial, and 0.81-1 as almost perfect. Fleiss considers kappas > 0.75 as excellent, 0.40-0.75 as fair to good, and < 0.40 as poor. It is important to note that both scales are somewhat arbitrary.

Week 3: Class 16/02/2003

Cross-validation in a linear regression model

What if our target variable (the \(y\)) is quantitative continuous? One simple model to train is a linear regression just like

\[ y=\beta_0+\beta_1x+u \]

how to measure the error?

  • Here we do not implement a cross table, since the target variable can have a wide range of possible values!

DF<-data.frame(y=c(10,7,11,6,11,13,9,9.5,11.5,12,14,7,15,12.5,12.5),x=c(5,8,4,9,3,2,8,8,6,3,2,7,1,2,0))

Just like in the previous class, we randomly select a training and test set

index<-c(2,4,5,6,9,10,12,14,15)
train<-DF[index,]
test<-DF[-index,]

Thus we have split our data set in two pieces as in the figure

Now, we train the model

model<-lm(y~x,data=train)
summary(model)
## 
## Call:
## lm(formula = y ~ x, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.3636 -0.6023 -0.4091  0.5568  2.4773 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.8636     0.7606  18.227  3.7e-07 ***
## x            -0.8068     0.1426  -5.657 0.000769 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.261 on 7 degrees of freedom
## Multiple R-squared:  0.8205, Adjusted R-squared:  0.7949 
## F-statistic: 32.01 on 1 and 7 DF,  p-value: 0.0007685

And we predict the test sample using the model we have trained before:

pred<-predict(model, newdata=test)
pred
##         1         3         7         8        11        13 
##  9.829545 10.636364  7.409091  7.409091 12.250000 13.056818

Now, we have to evaluate the model performance. Again, we have several error measures. Here, we’ll depict two of them

  • MSE (Mean Squared error) is the average of the test set’s errors squared. Why squared? Because, in general (not in this toy-example) we’ll have positive and negative errors so the average will tend to zero. By squaring them, we can get rid of this important problem (that would drive us to think our model has a smaller error)…
  • … But once we have the MSE, we usually take the squared root. Doing so we can interpret the error in term of the units of \(y\).
  • Another used measure is the MAE: mean absolute error. We take each error (in absolute values) and compute its average

In [R], using CARET can be implemented as follows:

library(caret)
library(ggplot2)
library(lattice)
train_control<- trainControl(method="cv", number=5,p=0.75, savePredictions = TRUE)

model<- train(y~x, data=DF, trControl=train_control, method = "lm" )  #feel the difference with respect to the previous example
model
## Linear Regression 
## 
## 15 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 11, 11, 13, 13, 12 
## Resampling results:
## 
##   RMSE      Rsquared   MAE     
##   1.386579  0.9187232  1.228895
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

Exercises

    1. Use, from ISLR package the data called “Advertising”. Find information about the meaning of the variables and the units of measurement.
library(readr)
Advertising <- read_csv("Advertising.csv")
## New names:
## Rows: 200 Columns: 5
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," dbl
## (5): ...1, TV, radio, newspaper, sales
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
View(Advertising)

The Advertising data set. Sales are in thousands of units. TV, radio, and newspaper budgets are in thousands of dollars, for 200 different markets.

    1. Use a graphical tool to discus which variable seems to be a greater predictive variable of sales: TV, Newspapers or Radio?
par(mfrow=c(2, 2))
plot(Advertising$TV,Advertising$sales)
plot(Advertising$radio,Advertising$sales)
plot(Advertising$newspaper,Advertising$sales)

  • We can appreciate that, according to the plots, TV advertising has the strongest relationship with sales. We can also expect radio advertising to have a weaker relationship with sales while newspapers don’t seem to be profitable in terms of sales.
    1. Perform a cross validation exercise with these three models:

\[ sales=\beta_0+\beta_1\times TV+u \]

library(caret)
library(ggplot2)
library(lattice)
train_control<- trainControl(method="cv", number=10,p=0.75, savePredictions = TRUE)

model<- train(sales~TV, data=Advertising, trControl=train_control, method = "lm" )  
model
## Linear Regression 
## 
## 200 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 182, 180, 180, 180, 180, ... 
## Resampling results:
## 
##   RMSE     Rsquared   MAE     
##   3.25084  0.6125928  2.562669
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

\[ sales=\beta_0+\beta_1\times Radio+u \]

train_control<- trainControl(method="cv", number=10,p=0.75, savePredictions = TRUE)

model<- train(sales~radio, data=Advertising, trControl=train_control, method = "lm" )  
model
## Linear Regression 
## 
## 200 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 179, 179, 180, 180, 180, 181, ... 
## Resampling results:
## 
##   RMSE      Rsquared   MAE    
##   4.248898  0.3771947  3.35248
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

\[ sales=\beta_0+\beta_1\times Newspaper+u \]

train_control<- trainControl(method="cv", number=10,p=0.75, savePredictions = TRUE)

model<- train(sales~newspaper, data=Advertising, trControl=train_control, method = "lm" )  
model
## Linear Regression 
## 
## 200 samples
##   1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 180, 180, 181, 180, 180, 180, ... 
## Resampling results:
## 
##   RMSE      Rsquared    MAE     
##   5.078061  0.09944607  4.171885
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE
    1. Compare the results with the p-value obtained by each model and the R-squared.
mod1<-lm(sales~TV, data=Advertising)
mod2<-lm(sales~radio, data=Advertising)
mod3<-lm(sales~newspaper, data=Advertising)

summary(mod1)
## 
## Call:
## lm(formula = sales ~ TV, data = Advertising)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
summary(mod2)
## 
## Call:
## lm(formula = sales ~ radio, data = Advertising)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16
summary(mod3)
## 
## Call:
## lm(formula = sales ~ newspaper, data = Advertising)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148
  • First comparison: statistical significance and predictive power: the three variables have a p-value under 0.05: so all of them are statistically significant. However, the RMSE is different for each variable. For instance, the smallest RMSE is for TV advertising and the highest is for newspaper advertising (being almost 1.5 times the error compared to TV advertising). So we can guess that there is not a clear relationship between statistical significance and predictive power.
  • Second comparison: R squared versus predictive power: The R-squared is a fitting measure in the sample while the predictive power (RMSE or MAE) are measures out of sample. We can see some relationship between the R squared and the RMSE (MAE). However, is not proportional (for instance, the \(R^2\) for the TV’s model is 0.61 and the \(R^2\) for the newspaper’s model is 0.05 (almost 12 times lower) )