2 Data Visualization and Exploration

2.1 Fundamentals of ggplot

Installing and loading ggplot:

# CHUNK 1
#install.packages("ggplot2")  # uncomment this line the first time you use ggplot2
library(ggplot2)

Breaking down the example below and the general structure of a ggplot:

  1. ggplot() function intializes the plot and sources the data used to generate the plot.
  2. Mappings are specified by the aes() function – determining the role of different variables in the generated plot.
  3. Geom functions specify which objects to include in the plot (e.g. lines, bars, histograms, box plots)

CHUNK 2 should generate a scatterplot of weight versus mpg from the mtcars dataset:

# CHUNK 2
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point()

2.1.1 Basic Features

This section covers:

  • Mapping aesthetics
  • Useful Geoms
  • Faceting

Mapping Aesthetics

Note that if you include the color argument within the aes() function, ggplot will try to map in the color aesthetic with the specified variable name. So in this example, it is trying to map in a variable called “blue” as the color. Since none exisits, it creates one and takes the first color (which is red).

The second plot shows how to fill in the color with blue – we must include it outside of the aes() and specify it in the geom() function.

# CHUNK 3 and CHUNK 4
p1 <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg, color = "blue")) +
  geom_point() + theme(legend.position = "none")  # wrong, including color in aes()

p2 <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point(color = "blue")  # correct, if we wanted to specify the geom color to be blue.

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

Next we use color as a way to map in Automatic/Manual – note how the plots differ when the am variable is treated as a categorical factor versus a numeric:

# CHUNK 5 and 6
mtcars$am.fac <- factor(mtcars$am, levels = c(0, 1),
                        labels = c("Automatic", "Manual"))

p1 <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg, color = am.fac)) +
  geom_point()  # am treated as a categorical var/factor

p2 <- ggplot(data = mtcars, mapping = aes(x = wt, y = mpg, color = am)) +
  geom_point()  # am treated as a numeric var

grid.arrange(p1, p2, ncol = 2)

Useful Geoms

Note some other geoms to remember include:

  • geom_bar() - bar chat, for univariate/categorical; fill, alpha
  • geom_boxplot() - boxplot, for univariate/numerical to compare spread/distribution and outliers; bivariate numerical and categorical; fill, alpha
  • geom_histogram() - histogram, for univariate/numerical to compare shape and skewness of distr; bivariate numerical and categorical; fill, alpha, bins
  • geom_point() - scatterplot, for bivariate/numerical to visualize relationship; color, alpha, shape, size
  • geom_smooth() - smoothed line, for drawing apparent signal/pattern; color, fill, method, se

The examples below show an application of geom_point() and geom_smooth() and how varying arguments affect the ultimate plots. Note that the biggest difference here is that the color argument is moved to the geom_point() function so that the “smoothing” line will now apply over the entire data set. The reason is because variables mapped to color() (fill has been omitted) is specifically directed to the point geom.

# CHUNK 7 and 8
p1 <- 
  ggplot(data = mtcars, mapping = aes(x = wt, y = mpg,
                                    color = am.fac, fill = am.fac)) +
  geom_point(size = 2, alpha = 0.4) +
  geom_smooth(method = "lm", alpha = 0.2)  # includes fill in the aesthetic so the points will match the se shading

p2 <- 
  ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point(aes(color = am.fac), size = 2, alpha = 0.4) +  # note that we isolate the color map to the point geom only
  geom_smooth(method = "lm")

grid.arrange(p1, p2, ncol = 2)

CHUNKS 9-10 removed, see manual

Faceting

Next, we explore faceting (side-by-side) our plots which is a useful way to display two + groups of obs in separate/small multiple plots. Note that:

  • facet_wrap() used when there is only one faceting variable as seen below:
# CHUNK 11 - Scatterplots of Wt vs MPG, faceted by am.fac
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point(size = 5, alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_wrap(~ am.fac)  # first argument/facet var always followed by tilde sign, the second argument would be ncol

  • facet_grid() used when there are two faceting variables to produce a two-dimensional facet grid:
# CHUNK 12 - Scatterplots of Wt vs MPG, faceted by am.fac and vs.fac
mtcars$vs.fac <- factor(mtcars$vs, levels = c(0, 1),
                        labels = c("V-shaped", "straight"))
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point(size = 5, alpha = 0.4) +
  geom_smooth(method = "lm", se = FALSE) +
  facet_grid(vs.fac ~ am.fac)  # note that the var on the left side of the ~ goes up the y-axis

CHUNKS 13 removed, see manual

2.1.2 Customizing Plots

We can further customize our plots by adding:

  • Axes
  • Titles, subtitles and captions
  • Displaying multiple plots at once, using gridExtra

Axes

Axes can be adjusted so that scales between variables are properly set or to zoom in/out of plots in order to get the best interpretation out of our generated plots. Some functions that can help us do that include:

  • xlim(), ylim() take on two-element vectors to indicate the lower and upper limits of the axes
  • scale_x_log10() which will automatically adjust the scale of the axis to log10.
    • Useful for highly skewed variables (high dollar amounts).
    • Can also include labels = scales::dollars to assign units
  • scale_y_log19() same as above, on the y-axis

Titles, subtitles and captions

The labs() function takes the following arguments:

  • x
  • y
  • title
  • subtitle

Alternatively, we can add them in using xlab(), ylab() and ggtitle() respectively.

See CHUNK 14 below where labels and limits on the x-axis are applied:

# CHUNK 14
ggplot(data = mtcars, mapping = aes(x = wt, y = mpg)) +
  geom_point() +
  labs(title = "Automobile Dataset",
       x = "Weight",
       y = "Miles per Gallon") +
  xlim(2, 4)  # you can also use xlim(c(2, 4))
## Warning: Removed 8 rows containing missing values (geom_point).

Displaying multiple graphs

Using the grid.arrange() function from the gridExtra package allows us to place several ggplots in a single figure for comparison.

See CHUNK 15 below for an example of two plots displayed side-by-side: wt vs mpg and disp vs mpg

# CHUNK 15
# uncomment the following line if you haven't installed the package
#install.packages("gridExtra")
library(gridExtra)
p1 <- ggplot(data = mtcars, aes(x = wt, y = mpg)) +
  geom_point() +
  xlab("Weight") +
  ylab("Miles per Gallon")
p2 <- ggplot(data = mtcars, aes(x = disp, y = mpg)) +
  geom_point() +
  xlab("Displacement") +
  ylab("Miles per Gallon")
grid.arrange(p1, p2, ncol = 2)

2.2 Data Exploration

We will now apply our data visualization techniques – as well as implement basic summary statistics to perform basic exploratory data analysis. For example, we may want to:

  • Explore the relationship each variable a given dataset has on our target
  • Describe the observed distribution of each variable and determine if any transformations may be appropriate before fitting a model

EDA is particularly useful for:

  • Perform “sniff” tests to ensure that the observed data is usable (i.e. no egregious data errors)
  • Detect any errors/outliers that may derail our analysis
  • Drawing any relationships between variables
  • Determining if any transformations are necessary
  • Decide what kind of model would be most appropriate for the given data and problem

In the following sections, we divide up two basic types of EDA:

  • Univariate data exploration (focused on one variable)
  • Bivariate data exploration (focused on the relationship between two variables)

2.2.1 Univariate Data Exploration

Univariate data exploration focuses on the distribution of one variable at a time – taking on two forms:

  1. Statistical summaries:
    • Summarize via numerical statistics (min, med, mean, max, variance, freq); Provides a broad overview of the distribution of the variable
    • Easy to compare different variables
  2. Graphical summaries:
    • Visualizes the distribution of a variable via histograms, box plots and bar charts
    • Offers a quick glimpse of the overall distribution
    • Often more informative than statistical summaries
    • May be better at revealing outliers

Two types of variables we’ll often encounter are numeric and categorical.

Numeric variables

Statisitcal summaries will often entail:

  • Describing the central tendency via the mean or median, using the summary() function
  • Dispersion/spread can be measured via variance/standard deviation and IQR

Graphical summaries will often entail:

  • Primarily creating histograms and box plots
  • Histograms divide obds into equally spaced bins to visualize the count or rel freq of each bin
    • Mostly used to learn about the overall shape of the distribution
  • Box plots visualize the distribution by marking the 25%, 75% quantiles, as well as the median and calls out any outliers (defined as 1.5 x the IQR)
    • Useful for summarizing key numeric statistics (quartiles/median, outliers, range) and comparing the relative magnitude of these summary statistics across different levels of another categorical variable

Using the following case study (Personal injury insurance claims):

# CHUNK 1
persinj <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/persinj_a.csv")  # load persona injury data

The chunk below returns the summary stats of the observed settled claim amount:

# CHUNK 2
summary(persinj$amt) 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10    6297   13854   38367   35123 4485797

We note that the amt variable takes on wide range, and its mean is much larger than the median, indiciating a right skew. The examples below further investigate this by splitting the amt variable between two groups: with legal representation and without:

# CHUNK 3 (Example 2.2.1)
# Group by legrep, Summary stats
persinj.0 <- persinj[persinj$legrep == 0, ]
persinj.1 <- persinj[persinj$legrep == 1, ]
summary(persinj.0$amt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10    4061   11164   32398   29641 2798362
summary(persinj.1$amt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      20    7305   15309   41775   38761 4485797
sd(persinj.0$amt)
## [1] 77820.33
sd(persinj.1$amt)
## [1] 97541.38
# CHUNK 4, histogram of amount
library(ggplot2)
ggplot(persinj, aes(x = amt)) +
  geom_histogram(fill = "blue", alpha = 0.7) +
  xlim(0, 100000)  # note the geom_histogram only requires an x aes

# CHUNK 5, attempt to correct for skew by log-transforming; play with bins argument
p1 <- ggplot(persinj, aes(x = log(amt))) +
  geom_histogram(fill = "blue", alpha = 0.7) +
  ggtitle("Default value")
p2 <- ggplot(persinj, aes(x = log(amt))) +
  geom_histogram(bins = 20, fill = "blue", alpha = 0.7) +
  ggtitle("Bins = 20")
p3 <- ggplot(persinj, aes(x = log(amt))) +
  geom_histogram(bins = 40, fill = "blue", alpha = 0.7) +
  ggtitle("Bins = 40")
p4 <- ggplot(persinj, aes(x = log(amt))) +
  geom_histogram(bins = 80, fill = "blue", alpha = 0.7) +
  ggtitle("Bins = 80")
library(gridExtra)
grid.arrange(p1, p2, p3, p4, ncol = 2)

In the initial histogram of amt supported the right skew we suspected from the summary stats. The following histograms show how logging the var creates a more symmetrical/less skewed shape. The bins argument is worth playing with as it can reveal different insights into the shape of the distribution.

Boxplots for amt are drawn in CHUNK 6 below. We create two versions, one raw and one log-transformed to demonstrate how a large skewness can make the raw box plot hard to interpret beyond the skewness. We note those in the raw box plot, there are still so large outliers.

# CHUNK 6
p1 <- ggplot(persinj, aes(y = amt)) +
  geom_boxplot()  # note that box plots require a y aes() but not an x aes()
p2 <- ggplot(persinj, aes(y = log(amt))) +
  geom_boxplot()
grid.arrange(p1, p2, ncol = 2)

Categorical Variables

Statisitcal summaries will often entail:

  • Frequency tables – relative freq are more useful than summary stats such as min, max and med since categorical variables do not always have a natural order.
    • Can use the table() function in R

Graphical summaries will often entail:

  • Bar charts – good for comparing relative magnitude/frequency across multiple levels within the catewgorical variable
    • Since freq tables can be hard to read for multiple levels

Revisting the following case study (Personal injury insurance claims), we first generate freq and rel frequency tables for the injury code variable (inj). Then we generate bar charts for our visualization.

# CHUNK 7 - freq and rel freq tables
table(persinj$inj)  # frequency/obs of different injury code
## 
##     1     2     3     4     5     6     9 
## 15638  3376  1133   189   188   256  1256
table(persinj$inj)/nrow(persinj)  # relative frequency of different injury codes
## 
##           1           2           3           4           5           6 
## 0.709656925 0.153203848 0.051415865 0.008576874 0.008531494 0.011617353 
##           9 
## 0.056997640
# CHUNK 8 - bar charts for inj codes
# first convert inj and legrep to factors (original data type is integer)
persinj$inj <- as.factor(persinj$inj)
persinj$legrep <- as.factor(persinj$legrep)

p1 <- ggplot(persinj, aes(x = inj)) +
  geom_bar()  # bar chart using counts on y-axis

p2 <- ggplot(persinj, aes(x = inj)) +
  geom_bar(aes(y = ..prop.., group = 1))  # bar chart using rel proportion as the y aes, need group = 1

grid.arrange(p1, p2, ncol = 2)

Note that both freq tables and the bar charts show us:

  • Most obs faull under inj type 1 (no injury)
  • No fatal injuries are observed (type 8)
  • There are a few non-recorded (type 9)

2.2.2. Bivariate Data Exporation

EDA is more interesting when we analyze multiple variables together to reveal any relationships, patterns, outliers. Three types of bivariate combinations we can explore include:

  • Numeric vs Numeric
  • Numeric vs Cateogorical
  • Categorical vs Categorical

Numeric vs Numeric

Typicall involves generating a scatterplot to give us a sense of the nature/shape of the relationship (e.g. increasing/decreasing, quadratic/curved).

CHUNK 9 below shows two exaples – one with a raw y-var and another with the y-var transformed. We see how log-transformations often reveal a more linear relationship b/w two variables.

# CHUNK 9
p1 <- ggplot(persinj, aes(x = op_time, y = amt)) +
  geom_point(alpha = 0.05) +
  geom_smooth(method = "lm", se = FALSE)  # raw amt

p2 <- ggplot(persinj, aes(x = op_time, y = log(amt))) +
  geom_point(alpha = 0.05) +
  geom_smooth(method = "lm", se = FALSE)  # log(amt)

grid.arrange(p1, p2, ncol = 2)

CHUNK 10 below introduces a third, categorical variable (in this case legrep represented by the color of each points) in order to inspect how the relationship between two numeric vars change or vary at different levels of the third/categorical variable – aka, an interaction effect. Note how:

  • The main effect b/w amt and op_time: positive, represented by the grey line
  • The main effect of legrep on amt: those with legrep, amt is starts higher
  • The interaction effect - how the slope is flatter for those with legrep, doesn’t increase as fast as those without for greater op_time
    • Legrep effect on amt disappears for large op_time
# CHUNK 10

persinj$legrep <- factor(persinj$legrep, levels = c(0, 1))

ggplot(persinj, aes(x = op_time, y = log(amt))) +
  geom_point(aes(color = legrep), alpha = 0.25) +
  geom_smooth(aes(color = legrep), method = "lm", se = FALSE) +
  geom_smooth(method = "lm", se = FALSE, color = "grey40")

Numeric vs Categorical

Best way to convey the relationship is to use split box plots – series of box plots of numeric variables, split by different levels of a categorical variable. This is the conditional distribution of the observed numeric, given a specified level of another categorical variable.

See CHUNKS 11-12 to see:

  • How the distribution of log(amt) changes at different levels of injury code
    • log(amt) increases up to inj code 4, then decreases
  • How the distribution of log(amt) changes b/w those with and without legrep
    • log(amt) higher for those with legrep vs without
  • All three variables are visualized in the third plot where fill represents legrep groups and the boxplots are split by inj code
# CHUNK 11
persinj$inj <- factor(persinj$inj)

p1 <- ggplot(persinj, aes(x = inj, y = log(amt), fill = inj)) +
  geom_boxplot()

p2 <- ggplot(persinj, aes(x = legrep, y = log(amt), fill = legrep)) +
  geom_boxplot() 

grid.arrange(p1, p2, ncol = 2)

# CHUNK 12
ggplot(persinj, aes(x = inj, y = log(amt), fill = legrep)) +
  geom_boxplot()  # combines the two plots into one, using fill for legrep

We can also use histograms although they’re not as effective as split box plots to visualize grouped distributions. We can either create:

  • Stacked histograms – highlights contribution of each categorical level to the overall distribution
  • Faceted histograms – better for side-by-side comparisons

See CHUNK 13 for an example of both:

# CHUNK 13
p1 <- ggplot(persinj, aes(x = log(amt), fill = legrep)) +
  geom_histogram(bins = 50)  # stacked

p2 <- ggplot(persinj, aes(x = log(amt))) +
  geom_histogram(fill = "blue") +
  facet_wrap(~ legrep)  # faceted

grid.arrange(p1, p2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Categorical vs Categorical

We can construct:

  • Two-way frequency tables: shows # of obs in each combination
    • uses the table() function with two arguments for each cat variable
    • see CHUNK 14 for an example
  • Visually we can use:
    • Stacked bar charts: highlights contribution, although scales may be confusing
    • Side-by-side/dodged bar charts: retains scale, better for comparing between groups = see CHUNK 15 for an example both
# CHUNK 14
table(persinj$legrep, persinj$inj)
##    
##         1     2     3     4     5     6     9
##   0  5571  1152   374    56    85   121   649
##   1 10067  2224   759   133   103   135   607
# CHUNK 15
p1 <- ggplot(persinj, aes(x = inj, fill = legrep)) +
  geom_bar()  # stacked bar by default

p2 <- ggplot(persinj, aes(x = inj, fill = legrep)) +
  geom_bar(position = "dodge")  # dodge/side-by-side

grid.arrange(p1, p2, ncol = 2)

3 Linear Models

3.1 Fundamental PA Concepts

Some of the fundamental predictive analytics concepts highlighted in this exam include:

  • Basic terminology used to describe different predictive models and considerations
  • Model validation to evaluate a model’s predictive accuracy
  • Bias-variance tradeoff
  • Feature generation and selection

3.1.1 Basic Terminology

Purpose of Predictive Analytics

Essentially, we want to “predict” a target variable based on a set of closely related variables. We’re interested in answering business problems or questions with a “data-driven” answer. Examples include:

  • What demographic variables (age, race, education) if any influence salary? Is there a relationship to begin with?
  • How can an insurer’s experience influence what it’ll set its premium at?
  • What key factors drive injuries/accidents?

Classification of Variables

There are two ways we can classify a variable:

  • By role: (1) the target/response and (2) the explanatory/predictor or feature
  • By nature: (1) numeric vs (2) categorical
    • Numeric variables are numbers with an associated range; usually continuous but can be discrete;
    • Categorical variables take predefined values (called levels); variables do not necessarily have to be ordinal

Supervised vs Unsupervised Learning

  • Supervised learning refers to where there is a target variable that “supervises” or guides our analysis.
    • The goal is to understand the relationship b/w the variable and the predictors and/or make accurate predictions based on the predictors
    • Examples include linear models, GLMs and decision trees
  • Unsupervised learning refers to when a target variable is abset.
    • The goal is to extract relationships/structures between different variables in the data.
    • Examples include principal components analysis and cluster analyses
    • Used in conjunction with supervised learning techniques (e.g. as an EDA tool)

Regression vs Classification Problems

  • Regression problems – typically supervised learning problems with a numeric/continuous target;
    • Notable exception: logistic regression where the target is binary which is considered a classification problem
  • Classification problems – typically a supervised learning problem with a categorical/discrete target
    • Predictive models are often referred to as a classifier
    • Examples include logistic regression and poisson regression
    • Decision trees – classification trees specifically

3.1.2 Model Validation

One of our main objectives is to ensure that our predictive model not only performs well against the data used to generate and select that model, but also on unseen data. I.e. how well will our model work when future, unseen data is thrown at it?

Traning, Validating and Testing Sets

In a predictive analytics context, it’s common to partition our dataset into 2 or 3 subsets (different from classical stats where we fit our model on the entire dataset)

  • Training: Typically the largest that’s used to develop our predictive model
    • Used to estimate the signal/pattern \(f(\boldsymbol{X})\) and model parameters/hyperparameters
    • For multiple/competing models, the same training set should be used to fit each model so that they’re all comparable
  • Validation: Used to assess the predictive performance of our predictive model(s) using a performance metric (e.g. AIC, BIC, RMSE, AUC)
    • Since the validation set wasn’t used to develop our model, it should be more objective in assessing model performance
    • For multiple/competing models, the same validation set should be used to compare the models against each other via some metric (i.e. which one performs best on a validation set)
    • Can be used to tune hyperparameters, help select a model and features
  • Testing: Ultimate set to evaluate the predictive performance of the chosen model and obtain an independent measure of its prediction accuracy on unseen data
    • Used at the END of the model development process
    • The final word on how well the selected model performs
  • Common splits include 70/20/10% – with anything between 60 - 80% for the training set

Cross-validation

  • Often useful when the data size is small so splitting between three sets may result in undermining the credibility of the model
  • Cross-validation offers a useful alternative where only a training and test data set is split (e.g. 80/20) and:
    • The remaining training set is split into \(k\) folds, where one will serve as a validation set
    • The model is fit iteratively \(k\) times using the rest of the \(k-1\) folds
    • Each fold is used exactly once to generate predicted values/validate the trained model
    • A performance metric for each of these iterations is saved and aggregated (often as an average)

Performance Metrics

The metrics to evaluate/compare competing models will often depend on the natur of the target variable (numeric vs categorical) and/or the nature of the prediction problem (regression vs classification), as well as the type of model used (GLM vs decision tree). Note that within cross-validation, these metrics would be calculated and saved for each model and \(k\)the iteration and then aggregated (e.g. averaged) to repsent the overall metric for that specific model.

  • For Regression problems: We can use RMSE (root mean squared error) which represents the size of the difference between our observed target values and the predicted target values. Essentially the square root of the sum of squared prediction errors (on the validation set)
    • The smaller, the better
    • Note that RMSE’s advantage over MSE is that it’s on a more similar scale to the observed target variable

\[\begin{equation} RMSE = \sqrt{\frac{1}{n_{test set}}\sum_{test set}(y - \hat{y})^{2}} \end{equation}\]

  • For Classification problems: the predicted values are labels, not values so we can’t handle them algebraically.
    • Prediction errors such as \(y - \hat{y}\) may not be as useful or make much sense
    • Rather, use the Classification Error Rate which is defined as the proportion of obs in the test set that are incorrectly classified (i.e. false positives plus false negatives)

Note that both RMSE and classification errors do not rely on the distribution of the target variable. If we have prior info on the target variable’s distribution, we can also consider a likelihood based metric (e.g. AIC, BIC)

3.1.3 Bias-Variance Tradeoff

THE MAIN IDEA

  • A COMPLEX MODEL \(\nRightarrow\) A PREDICTIVE MODEL
    • Prediction accuracy is not the same as goodness-of-fit
    • Complex models, typically fit well to the training data but will often perform poorly against unseen/future data (i.e. when we apply our validation or test data sets).
    • Less complexity (less overfitting) will usually lead to more prediction accuracy (less variance)

Decomposition of the Expected Test Prediction Error

The loss function that we are often trying to minimize when selecting a well-performing model can be broken down into the following components:

  • Bias: the difference between the expected value of what we’re trying to predict and what the actual target will be (i.e. the difference between our prediction using the estimated signal versus the actual future targets)
    • In general, bias will be low in a complex model since it has more flexibility to capture the underlying signal in the training data (not necessarily the TRUE signal in the unforseen data)
  • Variance: The amount by which our predictions would change if we estimated it using a different training set
    • Ideally, our estimated signal function and model would perform with good stability across different training sets
    • More complex models tend to have higher variance though since small changes in the training data will cause large changes in our estimated signal function/model (i.e. overfit the data)
  • Irreducible/inherent error: This is the variance of the noise which is independent of the predictive model but inherent in the target variable
    • Cannot be reduced no matter how good our model is
    • The combined bias and variance components make up the reducible error by improving or using a more appropriate model

An intrinsince conflict exists between a model with low bias and a model with low variance. More complex models tend to have low bias, high variance compared to less flexible/more general models. Complex models also tend to have interpretability and computational issues. This relationship is known as the bias-variance tradeoff. At first, typically complicating a model will cause bias to drop faster than variance rises so the expected loss will decrease, but eventually variance will increase higher than the decrease in bias causing the expected loss to eventually tick back up – this is the point at which the model is overfittiting – where it is capturing the noise in the data instead of the signal.

CHUNK 1 and 2 removed – see manual for an example demonstrating the bias-variance tradeoff

3.1.4 Feature Generation and Selection

To best strike the balance between having a parsimonious model that produces stable predictions but flexible enough that it’s able to capture complex signals in the data, we need to be able to control model complexity achieved through feature generation and selection.

Variables vs Features

  • Variables: Raw measurement that is recorded and sourced from the original dataset before any transformations
  • Features: Derivations from the original variables; provides an alternative, more useful view of the information contained in the dataset

Feature Generation

Feature generation is the process in which new features are developed from the existing variables in the data. Features serve as the final inputs/predictors in our model. The goal (with the bias-variance tradeoff in mind) is to generate features that enhance the flexibility of the model that would lower the bias of the predictions, at the expense of increasing variance.

Some examples include:

  • Using Issue Year to form Policy Duration
  • Using Age to form an indicator that flags whether an observation is for a person under or over retirement age
  • Using Gender and Smoker status to form a combined feature (e.g. male smoker, female nonsmoker)
  • Using any “generic variables” and applying PCA to form a PC or linear combination that does a better job at enhancing our model

Feature generation typically is more vital in the GLM model building process. Other practical examples include:

  • Transformations (E.g. log-transforming dollar amounts)
  • Dummy/binarized variables, representing categorical variables
  • Interaction terms

Feature Selection

Feature selection can be seen as the opposite to generation, where we try to remove/drop features that don’t add much preditive power. Dropping unnecessary features reduce the dimension of the data and places a check on model complexity and prevents overfitting.

Feature selection is vital to both contructing GLMs and decision trees – and particularly relevant for categorical predictors, where each unique level is its own feature. This can lead to high dimensionality and undermine the prediction precision (inflating variance). Two strategies to reduce the dimensions of categorical predictors include:

  • Combining similar level categories
    • The idea categorical variable will have homogenous/similar obs in terms of target within each level and different across levels (typical with respect to the mean/median)
    • If two levels are more similar than different, then potentially they can be combined without losing too much information
  • Combining sparse categories with others
    • If a categorical level has too few observations, then it is a natural candidate to combine with other levels
    • Small sample sizes may make eit difficult to estiamte the effects of these categories (won’t be statistically significant) against the target variable reliably
    • Advisable to fold sparce levels into other levels that exhibit the same behavir (with respect to the target variable); if not we back bucket them in an “other” level
    • ON THE EXAM: JUSTIFY COMBINING LEVELS WITH A BETTER REASONING THAN JUST “LOW OBSERVATIONS”

3.2 Linear Models: Theory

Note that while for this exam we’ll use R to fit, select and evaluate our models without worry too much about the technicalities behind the scenes, we’re still expected to convey our conceptual understanding of what is done at each step and provide a high-level description of our model, feature selection and model validation process.

3.2.1 Model Formulation

Model Equation

The linear (regression) model postulates that a target/response variable \(Y\) (assumed to be continuous) is related to \(p\) predictors \(X_{1}, X_{2}, ..., X_{p}\) via an approximately linear relationship:

\[\begin{equation} Y = \beta_{0} + \beta_{1} X_{1} + ... + \beta_{p} X_{p} + \varepsilon \end{equation}\]

where

  • \(p\) is the number of predictors and
  • \(\beta_{0}, ..., \beta_{p}\) are the regression coefficients/parameters
  • \(\varepsilon\) is the unobservale zero-mean random error term

The actual model takes the form of:

\[\begin{equation} \mathbb{E}[Y] = \beta_{0} + \beta_{1} X_{1} + ... + \beta_{p} X_{p} \end{equation}\]

Note that \(\mathbb{E}[Y]\) is linear with respect to the coefficients, not necessarily the predictors \(X_{1}, ..., X_{p}\), so a quadratic model like \(Y = \beta_{0} + \beta_{1} X_{1}^{2} + ... + \beta_{p} X_{p} + \varepsilon\) would still be considered a linear model.

  • \(\beta_0\) is the intercept, i.e. the expected value of \(Y\) when all \(X_j\)’s are \(0\)
  • \(\beta_j\) is the regression or slope of the \(j\)th predictor. They represent the expected effect of the predictors on Y, holding all other predictors fixed/constant

Model Fitting by OLS

One way of estimating the unknown coefficients (represented here as a vector) \(\boldsymbol{\hat{\beta}}\) is through OLS where we assume that the linear model of:

\[\begin{equation} Y_i = \beta_{0} + \beta_{1} X_{i1} + ... + \beta_{p} X_{ip} + \varepsilon_i, i = 1, ..., n \end{equation}\]

We also assume that:

\[\begin{equation} \varepsilon_i \stackrel{iid}{\sim} \mathcal{N}(0, \sigma^2) \end{equation}\]

Note that this leads to a few of our model assumptions:

  • The error term has a zero mean and constant variance \(\sigma^2\)
    • The constant variance part is referred to as our homoscedastic assumption (i.e. variance of the errors do not change for varying observations)
  • The conditional ditribution of \(Y_i\) given our observations \(\boldsymbol{X}\) is a normal/gaussian with mean equal to the linear predictor and variance being some constant, \(\sigma^2\).

Ordinary Least squares (OLS) estimates our parameters \(\boldsymbol{\hat{\beta}}\) by choosing parameters that minimize the sum of squared differences between the observed target and the fitted/predicted values (i.e. \(\sum_{i} (y_i - \hat{y_i})^2)\)).

See manual for additional notes on the matrix representation and solution of OLS but for reference a quick summary of the matrix form is shown below

\[\begin{equation} \begin{bmatrix} Y_1\\ Y_2\\ \vdots\\ Y_n\\ \end{bmatrix} = \begin{bmatrix} 1 & X_{11} & X_{12} & \cdots & X_{1p}\\ 1 & X_{21} & X_{22} & \cdots & X_{2p}\\ \vdots & \vdots & \vdots & \ddots & \vdots\\ 1 & X_{n1} & X_{n2} & \cdots & X_{np}\\ \end{bmatrix} \begin{bmatrix} \beta_1\\ \beta_2\\ \vdots\\ \beta_n\\ \end{bmatrix} + \begin{bmatrix} \varepsilon_1\\ \varepsilon_2\\ \vdots\\ \varepsilon_n\\ \end{bmatrix} \end{equation}\]

which can be written shorthand as:

\[\begin{equation} \boldsymbol{Y} = \boldsymbol{X}\boldsymbol{\beta} + \boldsymbol{\varepsilon} \end{equation}\]

where:

  • \(\boldsymbol{Y}\) is the \(n \times 1\) response vector
  • \(\boldsymbol{X}\) is the design matrix
  • \(\boldsymbol{\beta}\) is the vector of \(p + 1\) regression coefficients
  • \(\boldsymbol{\varepsilon}\) is the vector of our error terms

Model Quantities

After we run the OLS model in R and the least squares estimates of \(\boldsymbol{\beta}\) is computed, we should also review the model output/quantities and what they tell us about our linear model:

  • Predicted values: \(\boldsymbol{\hat{Y}} = \hat{\beta_0} + \hat{\beta_1}X_1^* + ... + \hat{\beta_p}X_p^*\) is our predicted value of interest and \((1, X^*_1, ..., X^*_p)\) are a particular set of predictor values.
    • When the \(\boldsymbol{X}\) comes from the training set, then the correspond \(\boldsymbol{\hat{Y}}\) vector contains our fitted values
    • Ideally, we want our fitted values to be similar to the actual observed targets but not overly close (overfitting)
    • On the test set, we want our predicted values to be as close to the observed target as possibile though, which would indicate high predictive power
  • Residual: \(e_i = Y_i - \hat{Y_i}\)
    • Raw residuals are the difference b/w the observed target and the predicted value (either training or test set)
    • Hint the pattern of the underlying population that our model fails to capture
    • Crucial in model diagnostics
  • t-statistic \(t(\hat{\beta_j}) = \hat{\beta_j}/(SE(\hat{\beta_j}))\)
    • Ratio of the corresponding least squares coefficient estiamte to its estimated standard deviation (standard error)
    • Used to hypothesis test \(\beta_j = 0\), i.e. statistical significance
    • Low p-values indicate that given all the other predictors, the corresponding predictor has a statistically significant effect on the response
    • P-values represent the probability that the given test-statistic or a more extreme value would occur, given the null hyp is true
  • Coefficient of determination, \(R^2\)
    • Proportion of variance of the target variable that can be explained by the fitted model
    • On the training set, it’s a measure from 0 to 1
    • BIG PROBLEM Always increases with the addition of predictors;
    • Adjusted \(R^2\) tries to alleviate this by adding a penalty term for the number of parameters, so the improvement in model quality needs to outweight the penalty of increasing parameters
  • F-statistic
    • Tests the joint significance of all the predictors, except the intercept term
    • The null hyp is \(\beta_1 = \beta_2 = \cdots = \beta_p = 0\)
    • Alt hyp is if at least one is significant
    • Does not indicate which predictor is significant or predictive if the null hyp is rejected at small p-values

3.2.2 Model Evaluation

After we fit our model, we should (1) evaluate its quality using some performance metrics and (2) diagnose it for any abnormalities:

Performance Metrics

Since the target of a linear model numeric/continuous, one common measure is its TEST RMSE (root mean squared error)

  • Other performance measures include the loglikelihood on the TEST set and adjusted \(R^2\) (again on the TEST set)
  • Note that the RMSE, Adjusted \(R^2\) and loglikelihood are all functions of the sum of ssquared prediction errors \(\sum(Y_i - \hat{Y_i})^2\) so they’re all equivalent metrics
  • RMSE is probably the most interpretable since it is on the same unit scale as the target and gives us a better sense of the prediction error

Loglikelihood-based Performance Metrics:

  • For linear and GLM models, the target variable is assumed to follow a specified distribution (e.g. gaussian, Poisson, gamma) so we can employ general performance metrics based on penalized likelihood
  • AIC, or the Akaike Information Criterion defined as \(-2l + 2p\), where \(l\) is the loglikelihood and \(p\) is the number of parameters
    • Note that the goodness of fit on the training data (through \(l\)) is offset by model complexity represented by \(p\)
    • The smaller the AIC, the better
  • BIC, or the Bayesian Information Criterion_ defined as \(-2l + \ln(n_{tr})p\) where \(n_{tr}\) is the size of our training set
    • Note that BIC is very similar to AIC, however the penalty term will typically be larger than in AIC (as long as the training set is larget than 10)
    • This means that BIC is typically more conservative than AIC, in that the improvement in model fit to the training data has to be larger for the additional parameter, versus AIC
    • tl;dr: BIC is more stringent than AIC for complex models and would be more conservative in feature selection

Model Diagnostics

Beyond evaluating the predictive performance of our model, we also should diagnose how well the model aligns with the model assumptions (error normality, zero-mean and constant variance). Model diagnostics are tools deployed to identify evidence against model assumptions and inform model refinements to improve how well it aligns to those assumptions.

  • E.g. residuals/standardized residuals to detect abnormalities (wrt target values), leverages to detect abnormalities wrt predictor values and Cook’s distance to detect influential observations (both target and predictor values)

For PA, however we’ll focus on the following plots:

  • Residuals vs Fitted:
    • Residuals plotted against the fitted values (predictions on the training set)
    • Used to check model specification
    • Used to check homoscedasticity (constant variance)
    • A good model would show: random errors posess the same variance (no fanning out on the plot, equally spread out/no pattern and symmetric in either direction
    • Any systematic patterns (e.g. U-shape) and/or significantly non-uniform spread in residuals indicate an inadequate model equation and/or heteroscedasticity respectively
  • Normal Q-Q Plots:
    • Normal quantile-quantile plots graph the quantiles of the standardized residuals (residualds / standard error) against the theoretical standard normal quantiles
    • Used to check the normality of the random errors
    • If errors are normal then the points should fall along a 45 degree line that passes through the origin, since this line represents how the points would lie if the residuals (as an estimate for errors) were perfectly normal
    • Usually departures (typically at the tails) suggest that the normality assumption isn’t completely fulfilled

3.2.3 Predictors and Interpreting Coefficients

Feature generation involves specifying exactly what linear model to use in the first place. Typically we want to fine tune the features/predictors we include in the model with the goal of improving model flexibility to best capture the signal in the data.

Predictors can be split between numeric and categorical:

Numeric Predictors

A numeric predictor, \(X_j\), is simple to handle in our model (for the most part). The regression coefficient can be rewritten as \(\beta_j = \frac{\partial \mathbb{E}[Y]}{\partial X_j}\) since:

\[\begin{equation} \mathbb{E}[Y] = \beta_0 + \beta_j X_j + \mbox{terms not involving } X_j \end{equation}\]

Therefore we can interpret \(\beta_j\) as: _the expected change in the target \(Y\) per unit increase in \(X_j\), holding all other predictors constant.

For the exam, it is important we understand how to interpret coefficients (wrt sign and magnitude). The statement/interpretation will often depend on:

  • Type of predictor (numeric or categorical)
  • Interaction Term or Not
  • In the GLM case, what link Function was used

If the relationship between \(Y\) and \(X_j\) is NOT linear, then we can fit a polynomial regression such as \(Y = \beta_0 + \beta_1 X_j + \beta_2 X_j^2 + ... + \varepsilon\). In this case: \(\frac{\partial \mathbb{E}[Y]}{\partial X_j} = \beta_1 + 2 \beta_2 X_j\), which means that the expected effect \(X_j\) has on \(Y\) now actually depends on \(X_j\).

Note that the tradeoff made here is that a polynomial regression gives our model more flexibility to capture the signal in the data better, but the regression coefficients are now harder to interpret.

Categorical Predictors

Categorical predictors have to be treated different than quantitative ones – we’ll use a smoking example where:

\[\begin{equation} \text{Smoking} = \begin{cases} \text{Smoker} \\ \text{Non-Smoker} \\ \text{Unknown} \\ \end{cases} \end{equation}\]

Categorical variables must be binarized before being incorporated into a linear model (although most model fitting functions in R will automatically binarize categorical variables for us). See below for an example using the smoking example above:

## [1] "Smoker before Binarizing"
##   observation     smoker
## 1           1     Smoker
## 2           2 Non-Smoker
## 3           3     Smoker
## 4           4 Non-Smoker
## 5           5 Non-Smoker
## 6           6    Unknown
## [1] "Smoker - Binarized"
##   smoker.Non-Smoker smoker.Smoker smoker.Unknown
## 1                 0             1              0
## 2                 1             0              0
## 3                 0             1              0
## 4                 1             0              0
## 5                 1             0              0
## 6                 0             0              1

After binarizing, the categorical variable enters our linear model via the following form:

\[\begin{equation} Y = \beta_0 + \beta_1 \times \text{smoker.Smoker} + \beta_2 \times \text{smoker.Unknown} + \varepsilon \end{equation}\]

Note the following on model form and interpertation:

  • The smoker.Non-Smoker is left out – two reasons follow:
    • Intuitive: There’s duplicate info. We don’t need it since \(0\) values in the other variables imply that the obs is flagged as a non-Smoker
    • Technical: The perfect linear relationship between the predictors (in this case the dummy variables) destabilizes the mathematical procedure used to fit the model and would also undermine the precision of the coefficient estiamtes (high errors)
    • We always leave one level out as the “baseline”; R will autoassign the baseline level as the first alpha-numeric level, HOWEVER we should be sure that the baseline typically contains the most obs (in order to ensure the significance statistics are the most accurate)
  • A categorical predictow tih \(k\) levels will be represented in the model with \(k - 1\) predictors
  • Coefficients can be interepretated as "holding all other predictors constant, the expected target is different from the baseline level by \(\beta_j\)

Interactions

An interaction term in our linear model represents how the relationship between our target \(Y\) and one of our predictors \(X_i\) varies/depends on another predictors \(X_j\). Examples include:

  • The relationship between Income and Age may depend on Gender
  • The relationship between claim amount and age may vary by region
  • The relationship between test scores and income may depend on race

Two types of interactions we’ll encounter are (1) between continous and categorical and (2) between categorical and categorical

Interactions b/w Continuous and Categorical

  • If \(X_1\) is continuous and \(X_2\) is categorical then:

\[\mathbb{E}[Y] = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \beta_3 X_1 X_2\] \[= \begin{cases} \beta_0 + \beta_1 X_1 \text{ , if } X_2 = 0 \\ (\beta_0 + \beta_2) + (\beta_1 + \beta_3) X_1 \text{ , if } X_2 = 1 \end{cases}\]

  • Note that the coefficient of the interaction term \(\beta_3\) modifies the \(\beta_1\) if \(X_2 = 1\). This is the interaction effect (i.e. how our categorical variable changes the relationship between our target and the continous predictor)
  • We also note that \(\beta_2\) adjusts the intercept term \(\beta_0\) – this is the main effect our categorical variable \(X_2\) has on our response
  • If \(\beta_3\) ends up statistically insignificant, then we can state that there is no evidence of an interaction effect between \(X_1\) and \(X_2\)
  • Graphically, the presence of an interaction term would again mean the smoothed lines on a scatterplot of \(Y\) and \(X_1\) would be different for varying levels of the categorical variable \(X_2\)

Interactions b/w Two Categorical Vars

  • Similar to the algebra behind an interaction between a Continous and Categorical variable, where the interaction term is represented as product of the two categorical features
  • Intuitively, an interaction b/w two categorical features would mean the same: how the response is related to one categorical predictor is dependent on the level of a separate categorical variable
    • E.g. Income’s relationship to Race may be dependent on Gender, or the effect Race has on Income is affected by Gender
  • Graphically, we can use split box plots (split by one categorical predictor), faceted by a second categorical predictor – see an example below

3.2.4a Feature Selection

Given that we understand all the potential predictors that we could use in our model, the next goal is to decide on the form of the model – i.e. start cutting eliminating unnecessary features to arrive at a parsimonious model that doesn’t overfit the training data (lower variance) and perform better against a test dataset. There are a few ways we can perform feature selection including:

  • t-tests, dropping statistically insignificant features
  • Automated techniques such running stewise selection algorithms such as:
    • Backward selection: Start with a saturated model (all features, including interactions) and then work backwards by cutting off predictors with the least (stat) significance. Repeat until no features can be dropped to improve the model
    • Forward selection: Start off with no features and add features that improve the model the most until no features can be added to improve the model
    • Typically, a loglikelihood metric such as AIC or BIC is used to benchmark model performance
    • NOTE although stepwise algorithms are computationally efficient to execute, they may restrict the number of potentially well-performing models since there’s no guarantee that it will yield the best subset of features (due its dependence on the order of features being selected)
  • Another form of feature selection is to implement REGULARIZATION

3.2.4b Regularization

An alternative form of feature selection (versus stepwise algorithms) to reduce model complexity. Instead of iterating through features to either drop or add, regularization involves adjusting the estimated coefficients by shrinking the estimated magnitude of the effect on the target. Ideally, features with least predicitve power have their coefficients reduced towards zero.

How Regularization Works

When fitting a linear model using OLS, we estimate the regression coefficients \(\boldsymbol{\beta}\) by minimizing the residual sum of squares (RSS)

\[\begin{equation} \sum_{i = 1}^{n} [Y_i - (\beta_0 + \beta_1 X_{i1} + ... + \beta_p X_{ip})]^2 \end{equation}\]

With regularization, estimate the regression coefficients by minimizing a similar function, with an added regularization term:

\[\begin{equation} \sum_{i = 1}^{n} [Y_i - (\beta_0 + \beta_1 X_{i1} + ... + \beta_p X_{ip})]^2 + \lambda f_R(\boldsymbol{\beta}) \end{equation}\]

where

  • \(\boldsymbol{\beta}\) represents the vector of coefficients to be estiamted
  • \(\lambda \ge 0\) is our regularization paramter, controls how much to regulate and represents how much we prefer a simple model
  • \(f_R (\boldsymbol{\beta})\) is the regularization penalty – this is what distinguishes methods (e.g. Ridge, Lasso, elastic-net)
    • When \(f_R (\boldsymbol{\beta}) = \sum_{j-1}^{p} \beta_j^2\) represents Ridge regression
    • When \(f_R (\boldsymbol{\beta}) = \sum_{j-1}^{p} | \beta_j |\) represents Lasso
    • When \(f_R (\boldsymbol{\beta}) = (1- \alpha)\sum_{j-1}^{p} \beta_j^2 + \alpha \sum_{j-1}^{p} | \beta_j |\) represents elastic-net regression
    • Note that elastic-net is a general version of both Ridge and Lasso, where \(\alpha\) is a mixing parameter – increasing leans towards Lasso, decreasing towards Ridge
  • The spirit is to minimize RSS while also trying to minimize model complexity via minimizing the regularization penalty
    • Note that Ridge (1) will not shrink coefficients to 0 (even at a high \(\lambda\)), and (2) the penalty is largest for higher \(\beta_j\)’s (> 1)
    • Lasso (1) will shrink coefficients completely to 0 (if \(\lambda\) is high enough) – a form of feature selection, and is larger for lower \(\beta_j\)’s (< 1). Thus lasso produces simpler and more sparse models that carry fewer features – useful if we only want to find the KEY factors
    • ALSO, note that the intercept \(\beta_0\) is not part of the regularization penalty since it’s not associated with any predictors

By reducing model complexity, and thus the variance of the predictions of the model (at the expense of introducing a little more bias), the goal is to increase the predictive power of our model.

Effects of the regularization parameter

Note that unlike OLS, regularized regression produces a family of coefficient estimates that are dependent on the \(\lambda\) specified in the model (i.e. \(\boldsymbol{\hat{\beta_\lambda}})\) rather than \(\boldsymbol{\hat{\beta}}\). \(\lambda\) represents the tradeoff between model fit and model complexity:

  • \(\lambda = 0\) no regularization
  • \(\lambda\) increases the effect of regularization – coefficient estimates get closer to zero as model flexibility drops, variance decreases and bias increases

Hyperparameter Tuning

\(\lambda\) and \(\alpha\) are considered hyperparameters – pre-specified inputs that go into model fitting, but aren’t determined via the optimization procedure.

We typically “tune” these parameters via cross-validation by:

  • Construct a grid of possible values to use for \((\lambda, \alpha)\)
  • Run cross-validation, where given a set of \((\lambda, \alpha)\) fit the regularized regression on the TRAIN data, then use the TEST data to evaluate its accuracy across all folds of the cross-validation.
  • Select the \((\lambda, \alpha)\) that yields the smallest cross-validation error (e.g. average RMSE, AUC, etc)

Pros/Cons of Regularization Techniques

PROS

  • Can be implemented using the glmnet() function in R, which requires binarization of categorical variables – which is a benefit since it assesses the significance of EACH factor levels, not just the entire predictor
  • Computationally more efficient than stepwise algorithms

CONS

  • May not produce the most interpretable models (especially with Ridge, since all features will be retained)
  • Numeric features need to be standardized to make sure they’re on the same scale during the model fitting progress. This may lead to difficult interpretation of the coeffcients
  • glmnet() is restricted in terms of model forms – cannot accomdate all GLM distributions (like Gamma)

3.3 Case Study 1: Fitting a Linear Model in R

In this case study, our objective is to:

  • Fit a multiple linear regression model using lm() and interpret the summary() output
  • Generate additional features, including interaction and polynomial terms
  • Partition the data into training/test sets using the createDataPartition() function
  • Generate predictions using the predict() fucntion
  • Compare the predictive performance of different linear models

We’ll take a look at the ADVERTISING data set – from the str() function we see that there are 200 observations and 4 variables (we droped the index X variable)

# CHUNK 1
ad <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/Advertising.csv")
head(ad)
##   X    TV radio newspaper sales
## 1 1 230.1  37.8      69.2  22.1
## 2 2  44.5  39.3      45.1  10.4
## 3 3  17.2  45.9      69.3   9.3
## 4 4 151.5  41.3      58.5  18.5
## 5 5 180.8  10.8      58.4  12.9
## 6 6   8.7  48.9      75.0   7.2
# Remember to append the name of the dataset ("ad")
ad$X <- NULL

str(ad)  # first 3 variables represent the amount of budget in thousands of dollars, sales represents the unit sales
## 'data.frame':    200 obs. of  4 variables:
##  $ TV       : num  230.1 44.5 17.2 151.5 180.8 ...
##  $ radio    : num  37.8 39.3 45.9 41.3 10.8 48.9 32.8 19.6 2.1 2.6 ...
##  $ newspaper: num  69.2 45.1 69.3 58.5 58.4 75 23.5 11.6 1 21.2 ...
##  $ sales    : num  22.1 10.4 9.3 18.5 12.9 7.2 11.8 13.2 4.8 10.6 ...

Our objective here is to predict the number of sales using the advertisting dollars through different media channels as predictors. We want to produce recommendations/solutions to the company to boost sales.

Data Exploration

Our first step, should always be to get a basic understanding of the key characteristics of the variables in our data.

# CHUNK 2 - Summary statistics
summary(ad)
##        TV             radio          newspaper          sales      
##  Min.   :  0.70   Min.   : 0.000   Min.   :  0.30   Min.   : 1.60  
##  1st Qu.: 74.38   1st Qu.: 9.975   1st Qu.: 12.75   1st Qu.:10.38  
##  Median :149.75   Median :22.900   Median : 25.75   Median :12.90  
##  Mean   :147.04   Mean   :23.264   Mean   : 30.55   Mean   :14.02  
##  3rd Qu.:218.82   3rd Qu.:36.525   3rd Qu.: 45.10   3rd Qu.:17.40  
##  Max.   :296.40   Max.   :49.600   Max.   :114.00   Max.   :27.00

Only newspaper looks a little right-skewed since its mean is higher than its median – it also looks like there is an outlier at the max.

We’ll supplement our summary statistics with some graphs – since they’re all numeric variables, we’ll construct some histograms to guage the shape of each var’s distribution. (Box plots can also be drawn)

# CHUNK 3 - Univariate Data Exploration
library(ggplot2)
library(gridExtra)

p1 <- ggplot(ad, aes(x = sales)) +
  geom_histogram()

p2 <- ggplot(ad, aes(x = TV)) +
  geom_histogram()

p3 <- ggplot(ad, aes(x = radio)) +
  geom_histogram()

p4 <- ggplot(ad, aes(x = newspaper)) +
  geom_histogram()

grid.arrange(p1, p2, p3, p4, ncol = 2)

From the histograms we note:

  • Sales seems normally distributed (symmetric/bell-shaped) – our linear model seems approporiate
  • All other variables don’t look normal but not terribly skewed and no obvious outliers
  • Only the newspaper variable seems right-skewed, but we won’t transform or investigate outliers

In CHUNK 4, we perform some bivariate data exploration by generating some scatter plots to guage the relationships between our potential predictors and the sales variables:

# CHUNK 4 - Binariate data exploration
p1 <- ggplot(ad, aes(x = TV, y = sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

p2 <- ggplot(ad, aes(x = radio, y = sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

p3 <- ggplot(ad, aes(x = newspaper, y = sales)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

grid.arrange(p1, p2, p3, ncol = 2)

The scatterplots show a clear positive relationship between TV and sales. Radio and sales is also positive but the errors seem to fan out for higher values. The relationshup between sales and newspaper however isn’t clear.

3.3.1 Simple Linear Regression

We’ll first fit a “simple” linear regression model (just one predictor). From our plots, we see that there seems to be a strong positive linear relationship with TV, so we’ll begin with that.

# CHUNK 5 - fitting the SLR
model.slr <- lm(sales ~ TV, data = ad)
model.slr
## 
## Call:
## lm(formula = sales ~ TV, data = ad)
## 
## Coefficients:
## (Intercept)           TV  
##     7.03259      0.04754
# CHUNK 6 - interpreting  the summary output
s <- summary(model.slr)
s
## 
## Call:
## lm(formula = sales ~ TV, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.3860 -1.9545 -0.1913  2.0671  7.2124 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.032594   0.457843   15.36   <2e-16 ***
## TV          0.047537   0.002691   17.67   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.259 on 198 degrees of freedom
## Multiple R-squared:  0.6119, Adjusted R-squared:  0.6099 
## F-statistic: 312.1 on 1 and 198 DF,  p-value: < 2.2e-16
# CHUNK 7
confint(model.slr)
##                  2.5 %     97.5 %
## (Intercept) 6.12971927 7.93546783
## TV          0.04223072 0.05284256

Fitting the model and interpreting the summary

  • CHUNK 5: We fit the model. The object will return the coefficients and call the original model that was fit
    • The output can be written as: \(\widehat{\text{sales}} = 7.03259 + 0.04754 \times \text{TV}\)
    • We can interpret the intercept \(\beta_0\) as: when no amount is spent on TV advertisting, the expected amount of sales is estimated to be about 7033. Note that the interpretation of \(\beta_0\) may not always make sense (e.g. AGE = 0 for income)
    • The slope, \(\beta_{TV}\) can be interpreted as for every additional $1000 spent on TV advertising, the expected amount of sales is estimated to increase by 47.5 units.
    • Note that the lm object returned is a list with several useful components, such as: coeffcients, residuals, fitted.vales

There are also supporting functions for lm() objects that include:

  • summmary(): (See CHUNK 6 above) Returns a detailed analysis of the fitted model – see CHUNK 6
    • Note that the t-stat and p-values represent the significncae test for each predictor as discussed earlier
    • Overall model quality can be assessed by looking at the \(R^2\) and the F-statistic (if all predictors are insignificant/ at least one is significant)
  • coefficients() or coef(): Returns the coefficient of the estimates
  • confint(): (see CHUNK 7 above) Returns the confidence interval (95% by default) of the estimated parameters – see CHUNK 7
  • residuals() or resid(): Returns a vector of the raw residuals \(e_i = y_i - \hat{y_i}\)
  • anova(): Returns the ANOVA table for the fitted model; also can be used to compare against other linear models
  • AIC() or BIC(): Returns the AIC or BIC of the model
  • plot(): Returns four diagnostic plots for evaluating the approproiateness of the model

Making predictions using the predict() function

Given the estimated parameters, we can use our fitted model to produce predicted values for a given set of predictors \(\boldsymbol{X}\), a set of predicted or fitted values \(\boldsymbol{\hat{y}}\):

# CHUNK 8  - pull first few fitted values
head(predict(model.slr))
##         1         2         3         4         5         6 
## 17.970775  9.147974  7.850224 14.234395 15.627218  7.446162
# same as (since it is a simple linear model with no transformation)

head(fitted.values(model.slr))
##         1         2         3         4         5         6 
## 17.970775  9.147974  7.850224 14.234395 15.627218  7.446162
# CHUNK 9 - pull fitted values on a new dataset (useful to apply on TEST data)
df <- data.frame(TV = seq(0, 300, by = 100))
predict(model.slr, newdata = df)
##         1         2         3         4 
##  7.032594 11.786258 16.539922 21.293586

Additionally, the predict() function can return a confidence (for the mean of the target) or a prediction interval (for the individual target variables):

# CHUNK 10
pred.cint <- predict(model.slr, interval = "confidence", level = 0.95)
head(pred.cint)  # confidence intreval (error is narrower, comes from sampling error)
##         fit       lwr       upr
## 1 17.970775 17.337774 18.603775
## 2  9.147974  8.439101  9.856848
## 3  7.850224  7.024932  8.675515
## 4 14.234395 13.779384 14.689405
## 5 15.627218 15.138794 16.115642
## 6  7.446162  6.582865  8.309460
pred.int <- predict(model.slr, interval = "prediction", level = 0.95)
## Warning in predict.lm(model.slr, interval = "prediction", level = 0.95): predictions on current data refer to _future_ responses
head(pred.int)  # prediction interval (error is wider, comes from sampling and idiosyncratic error)
##         fit        lwr      upr
## 1 17.970775 11.5135459 24.42800
## 2  9.147974  2.6828666 15.61308
## 3  7.850224  1.3713181 14.32913
## 4 14.234395  7.7921786 20.67661
## 5 15.627218  9.1825560 22.07188
## 6  7.446162  0.9623058 13.93002
# plot prediction bands/interval
ad.pred <- cbind(ad, pred.int)  # append lwr and upper bounds of prediction intervals
ggplot(ad.pred, aes(x = TV, y = sales)) +
  geom_point() +
  geom_smooth(method = "lm") +
  geom_line(aes(y = lwr), color = "red", linetype = "dashed") +
  geom_line(aes(y = upr), color = "red", linetype = "dashed")

CHUNK 11 an additional exercise that finds the value of the TV value where the prediction interval is smallest:

# CHUNK 11 - determine the value of TV obs with the narrowest 95% pred interval for sales
ad.pred[which.min(ad.pred$upr - ad.pred$lwr), "TV"]
## [1] 147.3
mean(ad.pred$TV)  # standard error of the prediction error typically smallest for when x is closest to the sample mean of x, so this can serve as a check
## [1] 147.0425
#sum((ad.pred$TV - mean(ad.pred$TV))^2)

CHUNK 12 is an example where we fit sales against radio and fit another model using newspaper instead. Which medium would we choose, including the model with TV we fit earlier?

From the outputs below, probably TV – the adjusted \(R^2\) indicates that TV explains more of the variation in the trained obs of sales. Note that it is NOT a good idea to judge which predictor is most predictive based on the magnitude of the estimated \(\beta\) only since it does not take into consideration the variability/standard error the estimate.

# CHUNK 12
slr.radio <- lm(sales ~ radio, data = ad)
slr.newspaper <- lm(sales ~ newspaper, data = ad)

summary(slr.radio)
## 
## Call:
## lm(formula = sales ~ radio, data = ad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.7305  -2.1324   0.7707   2.7775   8.1810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.31164    0.56290  16.542   <2e-16 ***
## radio        0.20250    0.02041   9.921   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.275 on 198 degrees of freedom
## Multiple R-squared:  0.332,  Adjusted R-squared:  0.3287 
## F-statistic: 98.42 on 1 and 198 DF,  p-value: < 2.2e-16
summary(slr.newspaper)
## 
## Call:
## lm(formula = sales ~ newspaper, data = ad)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.2272  -3.3873  -0.8392   3.5059  12.7751 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.35141    0.62142   19.88  < 2e-16 ***
## newspaper    0.05469    0.01658    3.30  0.00115 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.092 on 198 degrees of freedom
## Multiple R-squared:  0.05212,    Adjusted R-squared:  0.04733 
## F-statistic: 10.89 on 1 and 198 DF,  p-value: 0.001148

3.3.2 Multiple Linear Regression

We now incorporate all three predictors now into our model for a more infromative model and also test various combinations of the three in our model to improve our prediction accuracy. We want to:

  • Develop linear models to predict sales on the basis of the three ad media and identify which media is most significant in affecting sales
  • Quantify the prediction accuracy of the linear models developed and make a recommendation as to which model to use

We’ll try 4 different models:

Model 1: All three media

# CHUNK 13
# Long way
model.1 <- lm(sales ~ TV + radio + newspaper, data = ad)
# OR the shorthand...
model.1 <- lm(sales ~ ., data = ad)  # use a . after the tilde for all predictors in the data set
s <- summary(model.1)
s
## 
## Call:
## lm(formula = sales ~ ., data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8277 -0.8908  0.2418  1.1893  2.8292 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.938889   0.311908   9.422   <2e-16 ***
## TV           0.045765   0.001395  32.809   <2e-16 ***
## radio        0.188530   0.008611  21.893   <2e-16 ***
## newspaper   -0.001037   0.005871  -0.177     0.86    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.686 on 196 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8956 
## F-statistic: 570.3 on 3 and 196 DF,  p-value: < 2.2e-16

We can interpret the coefficients here as:

  • When no amount is spent advertising of any of the three media type, the expected amount of sales is 2.94 (thousand).

  • Holding all other predictors/variables constant/fixed, every 1 (thousand) increase in TV spending leads to an expected in sales of 45.7 units.
    • Similar interpretations can be drawn form the other two variables.

We also note the following re: significance:

  • The p-values for TV and radio are almost zero so we conclude that there is enough evidence that these advertising dollars via these two media have a statistically significant effect on sales (a favorable one that is consistent with our expectations)
  • Additionally given TV and radio, newspaper is NOT statistically significant anymore (and the coefficient is actually negative – not positive).
  • Note that the RSE dropped to 1.686 compared to the SLM before and \(R^2\) has increased to 0.896
  • The F-stat is also high with a small p-value indicating that there is enough evidence that at least one of the advertising media is an important predictor of sales

(Technical aside): See CHUNK 14 to see how newspaper not appearing to be significant in our model is consistent with with our previous SLM where newspaper was when it was the sole predictor:

# CHUNK 14 (Example 3.3.3)
# Calculating partial correlation by definition
m1 <- lm(sales ~ TV + radio, data = ad)  #regress target on all except predictor of int
m2 <- lm(newspaper ~ TV + radio, data = ad) #regress predict of int on all other predictors

cor(m1$residuals, m2$residuals)  # partial cor of newspaper and sales almost 0
## [1] -0.01262147
# Alternative formula to calculate partial correlation
t <- s$coefficients["newspaper", "t value"]
p <- length(coef(model.1)) - 1  # number of predictors in full model
t / sqrt(t^2 + nrow(ad) - p - 1) # partial cor of newspaper and sales almost 0
## [1] -0.01262147
cor(ad$sales, ad$newspaper)  # ordinary correlation is higher (around 0.23)
## [1] 0.228299
cor(ad$radio, ad$newspaper)  # ordinary cor between radio and newspaper
## [1] 0.3541038

Note that the partial correlation between sales and newspaper is almost zero, even though the ordinary correlation is around 0.23 – this is due the presence and correlation newspaper has with radio.

Model 2: Only TV and Radio (drop newspaper)

In CHUNK 15 we refine our model by dropping newspaper since it proved to be insignificant with the other predictors available:

# CHUNK 15
model.2 <- lm(sales ~ TV + radio, data = ad)
summary(model.2)
## 
## Call:
## lm(formula = sales ~ TV + radio, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7977 -0.8752  0.2422  1.1708  2.8328 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.92110    0.29449   9.919   <2e-16 ***
## TV           0.04575    0.00139  32.909   <2e-16 ***
## radio        0.18799    0.00804  23.382   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.681 on 197 degrees of freedom
## Multiple R-squared:  0.8972, Adjusted R-squared:  0.8962 
## F-statistic: 859.6 on 2 and 197 DF,  p-value: < 2.2e-16
# OR regress sales on all media except newspaper
# model.2 <- lm(sales ~ . - newspaper, data = ad)

# Use anova() to compare saturated model and the model.2 (without newspaper)

anova(model.1, model.2)
## Analysis of Variance Table
## 
## Model 1: sales ~ TV + radio + newspaper
## Model 2: sales ~ TV + radio
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    196 556.83                           
## 2    197 556.91 -1 -0.088717 0.0312 0.8599

Note that:

  • No significant drop in \(R^2\), adjusted \(R^2\) seems to have increased as well as the F-statistic.
  • The ANOVA test reflects the same result as the t-test on newspaper – newspaper is insignificant and the P-value of the F-test is the same as the t-test

See manual for CHUNK 16 - shows 3D plot using scatter3d() that plots sales against TV and radio

Model 3: Model with TV and radio interaction

Next we test for an interaction effect b/w TV and radio on sales (e.g. the effect TV spending has on sales will depend on what spending on radio is)

# CHUNK 17
model.3 <- lm(sales ~ TV * radio, data = ad)

# OR
# model.3 <- lm(sales ~ TV + radio + TV:radio, data = ad)

summary(model.3)
## 
## Call:
## lm(formula = sales ~ TV * radio, data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3366 -0.4028  0.1831  0.5948  1.5246 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.750e+00  2.479e-01  27.233   <2e-16 ***
## TV          1.910e-02  1.504e-03  12.699   <2e-16 ***
## radio       2.886e-02  8.905e-03   3.241   0.0014 ** 
## TV:radio    1.086e-03  5.242e-05  20.727   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9435 on 196 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9673 
## F-statistic:  1963 on 3 and 196 DF,  p-value: < 2.2e-16

Note that:

  • Seems to be sufficient evidence that the effect money spent on TV or radio advertising depends on spending in the other type of media on sales (i.e. they’re not independent)
  • The positive sign on the interaction effect indicates that additional spending on either radio or TV boosts the effect the other type of advertising has on sales
  • The adjusted R-squared increases considerably to about 0.9673
  • THe RSE also sharply decreases to 0.9678, indicating a better fitting model
  • Model 3 so far is the superior model
  • Note it’s not always the case where the lower-order variables (e.g. the main effects) are significant in the presence of an interaction term. Fromthe hierarchical principle though, we prefer to keep the main effects in (usually so we maintain the typical interpretation of the interaction)

See manual for CHUNK 18, example finding the optimal ad dollar mix

Model 4: Model with TV and Radio + Interaction + Polynomial Terms

Now we add a polynomial term (for TV) to indicate a more curving relationship between TV and Sales given radio – note that the addition of the higher order term changes the interpreation of our coefficients (can no longer say that TV’s effect on sales can be quantified by its estimated \(\beta\) and the interaction effect, given radio, but it will also depend on the amount of TV spending as well):

\[ \frac{\partial}{\partial\text{TV}} \mathbb{E}[\text{sales}] = \beta_1 + 2 \beta_2 \times TV + \beta_3 \times radio \]

# CHUNK 19
model.4 <- lm(sales ~ TV * radio + I(TV^2), data = ad)  # I() means insulation
summary(model.4)
## 
## Call:
## lm(formula = sales ~ TV * radio + I(TV^2), data = ad)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.9949 -0.2969 -0.0066  0.3798  1.1686 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.137e+00  1.927e-01  26.663  < 2e-16 ***
## TV           5.092e-02  2.232e-03  22.810  < 2e-16 ***
## radio        3.516e-02  5.901e-03   5.959 1.17e-08 ***
## I(TV^2)     -1.097e-04  6.893e-06 -15.920  < 2e-16 ***
## TV:radio     1.077e-03  3.466e-05  31.061  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6238 on 195 degrees of freedom
## Multiple R-squared:  0.986,  Adjusted R-squared:  0.9857 
## F-statistic:  3432 on 4 and 195 DF,  p-value: < 2.2e-16

Note that:

  • The coefficient for \(TV^2\) is significant and reduces the RSE, with a small increase in the adjusted \(R^2\)
  • All coefficients are highly significant
  • Best model we’ve fit so far
  • We may consider to exclude \(TV^2\) though if we’d rather have a model where coefficients are easier to interpret then fitting the training data better.
    • Also we may want to test against a test dataset to measure how well the model performs on unseen data.

3.3.3 Evaluation of Linear Models

Although RSE and adjusted \(R^2\) can be used as indirect estimates of the test error the linear model, it’s best to measure the predictive performance by using an independent test data set.

Again, remember that:

  • A training set – data used to construct our model and estimate parameters
  • A test set – data used to evaluate and make predictions of the model fitted on the training set

Creating the Training/Test Split

# CHUNK 20
#install.packages("caret")  # uncomment this line the first time you use caret
library(caret)
set.seed(1)  # set the random seed so that the results are reproducible
partition <- createDataPartition(ad$sales, p = 0.7, list = FALSE)  # return a dataframe of random row indices of 0.7 of the specified dataset
train <- ad[partition, ]
test <- ad[-partition, ]
  • Note that creaDataPartition automatically generates a stratified sample based on the distribution of the target variable (instead of random sampling) – useful for ensuring that the training and test sets are both comparable and representative of the given dataset
    • Should point this out in the exam when describing how the training/test sets were determined

We can check for any imbalances in the partition by running summary statistics and comparing the distribution of variables between sets:

# CHUNK 21
print("TRAIN")
## [1] "TRAIN"
summary(train$sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.60   10.43   12.90   13.96   17.30   26.20
print("TEST")
## [1] "TEST"
summary(test$sales)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.20   10.32   12.60   14.19   17.40   27.00

Fitting on Training and Testing on Test

We’ll now refit the 4 prev models for sales and type of ad spending, using the train set. We’ll compare the models using test set, and see if model.4 still ends up the best.

# CHUNK 22 - Fit using Training set
model.1.tr <- lm(sales ~ TV + radio + newspaper, data = train)
model.2.tr <- lm(sales ~ TV + radio, data = train)
model.3.tr <- lm(sales ~ TV * radio, data = train)
model.4.tr <- lm(sales ~ TV * radio + I(TV^2), data = train)
model.5.tr <- lm(sales ~ TV * radio + I(TV^2) + I(radio^2), data = train)
# CHUNK 23 - Test
rmse <- function(observed, predicted) {
  sqrt(mean((observed - predicted)^2))
}

print("TRAIN")
## [1] "TRAIN"
rmse(train$sales, predict(model.1.tr))
## [1] 1.701546
rmse(train$sales, predict(model.2.tr))
## [1] 1.704158
rmse(train$sales, predict(model.3.tr))
## [1] 0.9645623
rmse(train$sales, predict(model.4.tr))
## [1] 0.6433654
rmse(train$sales, predict(model.5.tr))
## [1] 0.6394058
print("TEST")
## [1] "TEST"
rmse(test$sales, predict(model.1.tr, newdata = test))
## [1] 1.602705
rmse(test$sales, predict(model.2.tr, newdata = test))
## [1] 1.581734
rmse(test$sales, predict(model.3.tr, newdata = test))
## [1] 0.8645022
rmse(test$sales, predict(model.4.tr, newdata = test))
## [1] 0.5473428
rmse(test$sales, predict(model.5.tr, newdata = test))
## [1] 0.5589234

Interestingly, we note that the RMSE (a metric we previously did not use) of the four models using the training set produces a conclusion consistent with what we previously though: Model.4 has the best predictive power. THis also makes sense since these models are nesting models – that is have increasing levels of complexity.

HOWEVER on the TEST data set, we see that model.3 (interaction, but no polynomial) yields a lower RMSE, indicating that model.3 is slightly better compared to model.4. This indicates that the model’s predictive power decreases as the model gets unnecessarily complicated (overfit the train data).

3.4 Case Study 2: Feature Selection and Regularization

This next case study requires a more advanced/sophisticated linear model, building on the techniques employed in the previous model including :

  • Factoring in categorical predictors into the model
    • including binarizing categorical predictors explicitly using the dummyVars() function
  • Feature selection for a large set of predictors
    • including implementing tools like the stepAIC() function
  • Generate and interpret diagnostic plots to assess how well our model conforms to model assumptions
  • Perform regularization (to boost predictive power by decreasing model variance)
    • including emplying ridge and lasso regression via glmnet()
    • tuning hyperparameters using cross validation and the cv.glmnet() function

3.4.1 Prepatory Work

Data Description/Data Exploration: the Target

We’ll use the Credit dataset in this case study that includes a wide variety of variables – both quantitative/numeric and categorical:

# CHUNK 1 - load credit data
Credit <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/Credit.csv")
# OR load the data directly from the ISLR package and delete the first column
#library(ISLR)
#data(Credit)
#Credit$X <- NULL

summary(Credit)
##      Income           Limit           Rating          Cards      
##  Min.   : 10.35   Min.   :  855   Min.   : 93.0   Min.   :1.000  
##  1st Qu.: 21.01   1st Qu.: 3088   1st Qu.:247.2   1st Qu.:2.000  
##  Median : 33.12   Median : 4622   Median :344.0   Median :3.000  
##  Mean   : 45.22   Mean   : 4736   Mean   :354.9   Mean   :2.958  
##  3rd Qu.: 57.47   3rd Qu.: 5873   3rd Qu.:437.2   3rd Qu.:4.000  
##  Max.   :186.63   Max.   :13913   Max.   :982.0   Max.   :9.000  
##       Age          Education        Gender    Student   Married  
##  Min.   :23.00   Min.   : 5.00   Female:207   No :360   No :155  
##  1st Qu.:41.75   1st Qu.:11.00   Male  :193   Yes: 40   Yes:245  
##  Median :56.00   Median :14.00                                   
##  Mean   :55.67   Mean   :13.45                                   
##  3rd Qu.:70.00   3rd Qu.:16.00                                   
##  Max.   :98.00   Max.   :20.00                                   
##             Ethnicity      Balance       
##  African American: 99   Min.   :   0.00  
##  Asian           :102   1st Qu.:  68.75  
##  Caucasian       :199   Median : 459.50  
##                         Mean   : 520.01  
##                         3rd Qu.: 863.00  
##                         Max.   :1999.00
str(Credit)  # note that all categorical variables have already been converted as factors
## 'data.frame':    400 obs. of  11 variables:
##  $ Income   : num  14.9 106 104.6 148.9 55.9 ...
##  $ Limit    : int  3606 6645 7075 9504 4897 8047 3388 7114 3300 6819 ...
##  $ Rating   : int  283 483 514 681 357 569 259 512 266 491 ...
##  $ Cards    : int  2 3 4 3 2 4 2 2 5 3 ...
##  $ Age      : int  34 82 71 36 68 77 37 87 66 41 ...
##  $ Education: int  11 15 11 11 16 10 12 9 13 19 ...
##  $ Gender   : Factor w/ 2 levels "Female","Male": 2 1 2 1 2 2 1 2 1 1 ...
##  $ Student  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 2 ...
##  $ Married  : Factor w/ 2 levels "No","Yes": 2 2 1 1 2 1 1 1 1 2 ...
##  $ Ethnicity: Factor w/ 3 levels "African American",..: 3 2 2 2 3 3 1 2 3 1 ...
##  $ Balance  : int  333 903 580 964 331 1151 203 872 279 1350 ...
# CHUNK 2 - histogram of our target (Balance)
library(ggplot2)
ggplot(Credit, aes(x = Balance)) +
  geom_histogram()

# Check prop of obs where target/Balance is zero
nrow(Credit[Credit$Balance == 0, ])/nrow(Credit)
## [1] 0.225

Note the following:

  • There are 400 obs and 11 variables to consider
  • Balance (credit balance) is our target variable – qunatitative
    • Potentially right-skewed, may need to be transformed (confirmed in CHUNK 2)
    • Seems like most of our obs in this dataset have a 0 balance (almost 22%) which will make logging our target problematic (log 0 is has no solution) (see CHUNK 2)
    • THe objective is to interpret key financial and demographic factors that relate to higher/lower balance using a linear model
  • Income, Limit, Rating, Carsd, Age and Education are all quantitative predictors
  • Gender, Student, Married and Ethnicity are qualitative
  • Note that some predictors in practice may be sensistive to implement in our model (e.g. ethnicity, gender)

Data Exploration: Numeric Predictors

CHUNK 3 runs a for loop to generate scatterplots of all the numeric predictors against the Balance:

# CHUNK 3 - Bivariate data exploration
# first save the names of the numeric predictors as a vector
vars.numeric <- colnames(Credit[, 1:6])
for (i in vars.numeric) {
  plot <- ggplot(Credit, aes(x = Credit[, i], y = Balance)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE) +
    labs(x = i)
  print(plot)
}

Note that:

  • Limit and Rating have a strong positive linear relationship with Balance
  • Balance and Income look like linear and positively related, but not as strong
  • No real relationship between cards, age or education wrt Balance
  • Seems like only limit, rating and income may be important predictors, affecting Balance positively
  • NOTE: Graphs for Limit and Rating look almost exactly alike – in CHUNK 4 we check their correlation and plot them against each other and conclude that one of them should be eliminated to avoid multicollinearity issues (as well as we only need one to gain the same predictive power):
# CHUNK 4
cor(Credit$Limit, Credit$Rating)  # close to 100%
## [1] 0.9968797
ggplot(Credit, aes(x = Limit, y = Rating)) +
  geom_point()  # almost exactly linear

# we can also check for any other highly correlated predictors using a correlation matrix:
cor(Credit[, c(1:6, 11)])
##                Income       Limit      Rating       Cards         Age
## Income     1.00000000  0.79208834  0.79137763 -0.01827261 0.175338403
## Limit      0.79208834  1.00000000  0.99687974  0.01023133 0.100887922
## Rating     0.79137763  0.99687974  1.00000000  0.05323903 0.103164996
## Cards     -0.01827261  0.01023133  0.05323903  1.00000000 0.042948288
## Age        0.17533840  0.10088792  0.10316500  0.04294829 1.000000000
## Education -0.02769198 -0.02354853 -0.03013563 -0.05108422 0.003619285
## Balance    0.46365646  0.86169727  0.86362516  0.08645635 0.001835119
##              Education      Balance
## Income    -0.027691982  0.463656457
## Limit     -0.023548534  0.861697267
## Rating    -0.030135627  0.863625161
## Cards     -0.051084217  0.086456347
## Age        0.003619285  0.001835119
## Education  1.000000000 -0.008061576
## Balance   -0.008061576  1.000000000
Credit$Limit <- NULL

Data Exploration: Categorical Predictors

Next we explore the relationship between our Balance and the categorical predictors by first using a series of box plots:

# CHUNK 5 - Bivariate Data Exploration
# Save the names of the categorical predictors as a vector
vars.categorical <- c("Gender", "Student", "Married", "Ethnicity")
for (i in vars.categorical) {
  plot <- ggplot(Credit, aes(x = Credit[, i], y = Balance)) +
    geom_boxplot() +
    labs(x = i)
  print(plot)
}

We note:

  • Balance seems to only be different between whether the obs is a Student or not, with the median balance for students being higher than non-students
  • Overall, from the graphs only Income, Rating and Student seem like promising predictors of Balance

Selecting/Investigating Interactions

Interactions mean that the expected effect of one predictor on the target is dependent on another predictor. Note that:

  • We can either start by using prior knowledge to hypothesize which variables may have a likely interaction
  • Interactions can exist between predictors whose main effects aren’t significant
  • Still best to start with examining interactions through likely predictors using prior knowledge or by examining the “promising” predictors (in this case: Income, Rating and Student)

To investigate whether Student (our categorical predictor) interacts with Income or Rating’s effects on balance, we’ll generate scatterplots using color to group the student categories in CHUNK 6:

# CHUNK 6
p1 <- ggplot(Credit, aes(x = Income, y = Balance, color = Student)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

p2 <- ggplot(Credit, aes(x = Rating, y = Balance, color = Student)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)

library(gridExtra)
grid.arrange(p1, p2, ncol = 2)

Note that:

  • There seems to be an interaction between Student and Income – notable students’ credit balance seems to increase with income at a slower rate than non-students
    • This suggests we should include a \(Balance \times Student\) term in our model
  • No resemblance of an interaction at all with Student and Rating, as indicated by parallel lines
    • We may still include and remove through our selection process if proved to be insignificant

Create the traning/test sets before building our model

# CHUNK 7
library(caret)
set.seed(42)
partition <- createDataPartition(Credit$Balance, p = 0.75, list = FALSE)
train <- Credit[partition, ]
test <- Credit[-partition, ]

print("TRAIN")
## [1] "TRAIN"
mean(train$Balance)
## [1] 531.1362
print("TEST")
## [1] "TEST"
mean(test$Balance)  # close to train set -- not as close as we'd prefer, but not egregiously different. We can use a different seed and see if we get something different though, although the difference may be attributed to the skewed nature of Balance (many zeroes)
## [1] 486.202

3.4.2 Feature Selection

Fitting the First Model

We’ll initially start by fitting 11 features (our 9 predictors and the 2 interactions):

# CHUNK 8
model.full <- lm(Balance ~ . + Income:Student + Rating:Student, data = train)
summary(model.full)
## 
## Call:
## lm(formula = Balance ~ . + Income:Student + Rating:Student, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -204.83  -67.69  -10.17   60.36  300.13 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -568.87494   39.26643 -14.488  < 2e-16 ***
## Income               -7.41624    0.27847 -26.633  < 2e-16 ***
## Rating                3.93051    0.06245  62.936  < 2e-16 ***
## Cards                 5.52797    4.12314   1.341  0.18107    
## Age                  -0.73763    0.33635  -2.193  0.02910 *  
## Education            -0.30429    1.81060  -0.168  0.86665    
## GenderMale           10.56953   11.31508   0.934  0.35103    
## StudentYes          235.25194   53.25361   4.418 1.41e-05 ***
## MarriedYes           -4.01562   11.86499  -0.338  0.73528    
## EthnicityAsian       14.61313   16.33913   0.894  0.37187    
## EthnicityCaucasian   19.08945   14.06755   1.357  0.17585    
## Income:StudentYes    -2.47014    0.80176  -3.081  0.00226 ** 
## Rating:StudentYes     0.85412    0.20479   4.171 4.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.26 on 288 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9584 
## F-statistic: 577.2 on 12 and 288 DF,  p-value: < 2.2e-16

We note a few things re: our catewgorical predictors:

  • The lm() function will automatically binarize the categorical predictors and set a baseline for each based on alphanumeric order (e.g. MarriedNo is the baseline for the Married factors)
    • We can override this by defining the baseline level using the factor() function before fitting our model
  • The interpretation for categorical variables, such as StudentYes can be stated as: the expected balance for students is estimated to be 235.25 higher than the baseline (non-students), holding all other predictors constant.
  • Note that some of our other categorical predictors, such as Ethnicity have multiple levels (more than 2) and are insignificant. Sometimes (although probably not the case for this case study) high dimensional categorical predictors can cause coefficients to be insignificant due to small samples, leading to unreliable estimates with a high standard error. One solution is to combine levels as discussed previously

Releveling

Typically we also want to set the baseline to the level with the most observations (instead of the first level alpha-numerically), to produce more stable coefficient estimates and more accurate significant tests regarding the coefficients. A baseline with a low amount of observations could potentially lead to a level’s effect on the response to be insignificant solely due to the low sample size/potential high variability of the baseline.

Note that there is no actual impact to the predictions produced by the model, although the estimates may change since the baseline has changed.

We achieve releveling using the relevel function as shown in the for loop in CHUNK 9:

# CHUNK 9
for (i in vars.categorical){
  # Use the table() function to calculate the frequencies for each factor
  table <- as.data.frame(table(Credit[, i]))
  # Determine the level with the highest frequency
  max <- which.max(table[, 2])
  # Save the name of the level with the highest frequency
  level.name <- as.character(table[max, 1])
  # Set the baseline level to the most populous level
  Credit[, i] <- relevel(Credit[, i], ref = level.name)
}

summary(Credit[, vars.categorical])  # Note that the new baseline is first now (MarriedYes and EthnicityCaucasian)
##     Gender    Student   Married              Ethnicity  
##  Female:207   No :360   Yes:245   Caucasian       :199  
##  Male  :193   Yes: 40   No :155   African American: 99  
##                                   Asian           :102

Now we refit our model after releveling to levels with most obs:

# CHUNK 10

# To make sure factors in the training set are releveled
train <- Credit[partition, ]
test <- Credit[-partition, ]

# Refit model with releveled factors
model.full <- lm(Balance ~ . + Income:Student + Rating:Student, data = train)

summary(model.full)
## 
## Call:
## lm(formula = Balance ~ . + Income:Student + Rating:Student, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -204.83  -67.69  -10.17   60.36  300.13 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -553.80111   36.77148 -15.061  < 2e-16 ***
## Income                      -7.41624    0.27847 -26.633  < 2e-16 ***
## Rating                       3.93051    0.06245  62.936  < 2e-16 ***
## Cards                        5.52797    4.12314   1.341  0.18107    
## Age                         -0.73763    0.33635  -2.193  0.02910 *  
## Education                   -0.30429    1.81060  -0.168  0.86665    
## GenderMale                  10.56953   11.31508   0.934  0.35103    
## StudentYes                 235.25194   53.25361   4.418 1.41e-05 ***
## MarriedNo                    4.01562   11.86499   0.338  0.73528    
## EthnicityAfrican American  -19.08945   14.06755  -1.357  0.17585    
## EthnicityAsian              -4.47632   13.93650  -0.321  0.74830    
## Income:StudentYes           -2.47014    0.80176  -3.081  0.00226 ** 
## Rating:StudentYes            0.85412    0.20479   4.171 4.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.26 on 288 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9584 
## F-statistic: 577.2 on 12 and 288 DF,  p-value: < 2.2e-16

Note the following changes after reveling:

  • The estimates (coefficient, t-test and standard error) have all changes to reflect the new baselines
    • EthnicityAsian’s p-value is now the highest (before it was MarriedYes); this is important especially if our feature selection was determined by an algorithm where features with the highest p-values were first to drop.

Next Step: Removing non-Predictive Features

Since some of our features (maybe even most) may lack predictive power and result in overfitting (where the plurality of features improves the fit on our training data but becomes too complex to predict well on unseen data), we need to start removing some and decide which features to retain.

One way we can go about cutting features is by dropping statistically insignificant features from the model and retain the significant ones – key being: ONE feature at a time (Since removing one feature may make one significant)

Before we begin removing features, we should first Binarize our variables for a few reasons (despite functions like lm() automatically binarizing):

  • Explicit binarization makes feature selection more meaning full since functions like drop1() and stepAIC() will treat factor variables as a single feature (and either retain or drop the whole categorical predictor including all its levels)
  • Explicitly binarizing will allow these methods to drop/retain individual levels
    • Useful to determine what levels can be combined with the baseline or other levels (also potential combination)
  • We achieve this using the dummyVars() function as shown in CHUNK 11:
# CHUNK 11 (Binarization)
library(caret)
binarizer <- dummyVars(paste("~", paste(vars.categorical, collapse = "+")),  # no target var in the formula
                       data = Credit, fullRank = TRUE)  # full Rank will drop the baseline
# OR type out the categorical predictors one by one
#binarizer <- dummyVars(~ Gender + Student + Married + Ethnicity,
#                       data = Credit, fullRank = TRUE)
binarized_vars <- data.frame(predict(binarizer, newdata = Credit))
head(binarized_vars)
##   Gender.Male Student.Yes Married.No Ethnicity.African.American
## 1           1           0          0                          0
## 2           0           1          0                          0
## 3           1           0          1                          0
## 4           0           0          1                          0
## 5           1           0          0                          0
## 6           1           0          1                          0
##   Ethnicity.Asian
## 1               0
## 2               1
## 3               1
## 4               1
## 5               0
## 6               0

Note the following:

  • THe dummyVars() function works by feeding it a formula (no response) with the predictors to binarize, and a data.frame, along with specifying whether to include or exclude the baseline
    • Note that the PA modules and exam solutions say that dummyVars() requires categorical predictors to be characters (not factors), but that’s not true. We can run factors fine through dummyVars with no impact to our results
  • Note the fullRank argument represents whether to choose a full rank or less than full rank parameterization of the model matrix
    • TRUE will leave out the baseline, GOOD for regression
    • FALSE will include all levels, GOOD for PCA and cluster analyses
  • Only Ethnicity really needs to be binarizes since it has more than 2 levels
  • we combine the binarized vars to the original dataset in CHUNK 12
    • We MUST be careful to also drop the original predictors (DO NOT repeat predictors to avoid RANK DEFICIENCY)
# CHUNK 12
Credit.bin <- cbind(Credit, binarized_vars)  # combine binarized variables
head(Credit.bin)
##    Income Rating Cards Age Education Gender Student Married Ethnicity
## 1  14.891    283     2  34        11   Male      No     Yes Caucasian
## 2 106.025    483     3  82        15 Female     Yes     Yes     Asian
## 3 104.593    514     4  71        11   Male      No      No     Asian
## 4 148.924    681     3  36        11 Female      No      No     Asian
## 5  55.882    357     2  68        16   Male      No     Yes Caucasian
## 6  80.180    569     4  77        10   Male      No      No Caucasian
##   Balance Gender.Male Student.Yes Married.No Ethnicity.African.American
## 1     333           1           0          0                          0
## 2     903           0           1          0                          0
## 3     580           1           0          1                          0
## 4     964           0           0          1                          0
## 5     331           1           0          0                          0
## 6    1151           1           0          1                          0
##   Ethnicity.Asian
## 1               0
## 2               1
## 3               1
## 4               1
## 5               0
## 6               0
# remove original predictors
Credit.bin$Gender <- NULL
Credit.bin$Student <- NULL
Credit.bin$Married <- NULL
Credit.bin$Ethnicity <- NULL

head(Credit.bin)
##    Income Rating Cards Age Education Balance Gender.Male Student.Yes
## 1  14.891    283     2  34        11     333           1           0
## 2 106.025    483     3  82        15     903           0           1
## 3 104.593    514     4  71        11     580           1           0
## 4 148.924    681     3  36        11     964           0           0
## 5  55.882    357     2  68        16     331           1           0
## 6  80.180    569     4  77        10    1151           1           0
##   Married.No Ethnicity.African.American Ethnicity.Asian
## 1          0                          0               0
## 2          0                          0               1
## 3          1                          0               1
## 4          1                          0               1
## 5          0                          0               0
## 6          1                          0               0
#repartition after with binarized vars, note we dont need to generate new indices, just reuse the same ones
train.bin <- Credit.bin[partition, ]
test.bin <- Credit.bin[-partition, ]

Feature Selection: Using drop1()

drop1() allows us to conduct backward selection (start with a complicated model and cut out features). The function will return the AICs of each model with the specified feature dropped – remember, a lower AIC is preferable:

# CHUNK 13 - on NON (explicitly) Binarized model
drop1(model.full, k = 2)  # by default k = 2 (i.e. AIC col represents AIC)
## Single term deletions
## 
## Model:
## Balance ~ Income + Rating + Cards + Age + Education + Gender + 
##     Student + Married + Ethnicity + Income:Student + Rating:Student
##                Df Sum of Sq     RSS    AIC
## <none>                      2724393 2768.3
## Cards           1     17004 2741397 2768.2
## Age             1     45496 2769889 2771.3
## Education       1       267 2724660 2766.3
## Gender          1      8254 2732647 2767.2
## Married         1      1084 2725476 2766.4
## Ethnicity       2     17551 2741944 2766.2
## Income:Student  1     89791 2814184 2776.1
## Rating:Student  1    164545 2888938 2784.0
drop1(model.full, k = log(nrow(train)))  # AIC col represents BIC (k = logn instead of 2)
## Single term deletions
## 
## Model:
## Balance ~ Income + Rating + Cards + Age + Education + Gender + 
##     Student + Married + Ethnicity + Income:Student + Rating:Student
##                Df Sum of Sq     RSS    AIC
## <none>                      2724393 2816.5
## Cards           1     17004 2741397 2812.7
## Age             1     45496 2769889 2815.8
## Education       1       267 2724660 2810.8
## Gender          1      8254 2732647 2811.7
## Married         1      1084 2725476 2810.9
## Ethnicity       2     17551 2741944 2807.0
## Income:Student  1     89791 2814184 2820.6
## Rating:Student  1    164545 2888938 2828.4

Note that:

  • Dropped features drop the entire variable (E.g. ethnicity) since the model passed through the function was our NON binarized one
    • Ethnicity is related to 2 df since there are 2 non-baseline levels
    • Without explicit binarization the function treats Ethnicity as one predictor, ignoring the possibility that only one of the two non-baseline levels is insignificant (while the other one is significant wrt to the baseline)
  • Based on AIC, most improvement by dropping Ethnicity (same as BIC)
  • We’d then repeat feeding a new fitted model WITHOUT Ethnicity through drop1() and iterate until dropping a feature doesn’t improve our model fit anymore

Let’s now try drop1() on the binarized model:

# CHUNK 14 - on (explicitly) Binarized model
model.full.bin <- lm(Balance ~ . + Income:Student.Yes + Rating:Student.Yes,
                     data = train.bin)
summary(model.full.bin)
## 
## Call:
## lm(formula = Balance ~ . + Income:Student.Yes + Rating:Student.Yes, 
##     data = train.bin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -204.83  -67.69  -10.17   60.36  300.13 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                -553.80111   36.77148 -15.061  < 2e-16 ***
## Income                       -7.41624    0.27847 -26.633  < 2e-16 ***
## Rating                        3.93051    0.06245  62.936  < 2e-16 ***
## Cards                         5.52797    4.12314   1.341  0.18107    
## Age                          -0.73763    0.33635  -2.193  0.02910 *  
## Education                    -0.30429    1.81060  -0.168  0.86665    
## Gender.Male                  10.56953   11.31508   0.934  0.35103    
## Student.Yes                 235.25194   53.25361   4.418 1.41e-05 ***
## Married.No                    4.01562   11.86499   0.338  0.73528    
## Ethnicity.African.American  -19.08945   14.06755  -1.357  0.17585    
## Ethnicity.Asian              -4.47632   13.93650  -0.321  0.74830    
## Income:Student.Yes           -2.47014    0.80176  -3.081  0.00226 ** 
## Rating:Student.Yes            0.85412    0.20479   4.171 4.02e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.26 on 288 degrees of freedom
## Multiple R-squared:  0.9601, Adjusted R-squared:  0.9584 
## F-statistic: 577.2 on 12 and 288 DF,  p-value: < 2.2e-16
drop1(model.full.bin, k = 2)
## Single term deletions
## 
## Model:
## Balance ~ Income + Rating + Cards + Age + Education + Gender.Male + 
##     Student.Yes + Married.No + Ethnicity.African.American + Ethnicity.Asian + 
##     Income:Student.Yes + Rating:Student.Yes
##                            Df Sum of Sq     RSS    AIC
## <none>                                  2724393 2768.3
## Cards                       1     17004 2741397 2768.2
## Age                         1     45496 2769889 2771.3
## Education                   1       267 2724660 2766.3
## Gender.Male                 1      8254 2732647 2767.2
## Married.No                  1      1084 2725476 2766.4
## Ethnicity.African.American  1     17419 2741812 2768.2
## Ethnicity.Asian             1       976 2725369 2766.4
## Income:Student.Yes          1     89791 2814184 2776.1
## Rating:Student.Yes          1    164545 2888938 2784.0

We now note the following:

  • The summary output is identical to the automatically binarized model
  • The drop1() output for the explicitly binarized model however seems to suggest that it’s better to drop Education first, rather than one of the levels for Ethnicity
  • Ethnicity is also treated as two separate non-base levels, instead of just one single predictor
  • Note that the drop1() algorithm respects the hierarchy principle – only the interaction terms (e.g. Income:Student.Yes, Rating:Student.Yes) are offered to be dropped, not the main effects/individual features

Feature Selection: Using stepAIC() - PREFERRED

The stepAIC() function will run the algorithm to either run forward or backward selection based on AIC or BIC and terminate once the model fit can no longer be improved. Features that are either added or dropped are done so that selections that improve model fit the most from the initial model are made.

Note the specifications:

  • Selection process: Use the argument direction = “backward” or “forward”
  • Selection criterion: Use the argument k = 2 or k = log(nrow(<data.frame_test>))
  • See CHUNK 15 to see backwards selection using AIC implemented via the stepAIC() function
# CHUNK 15
#install.packages("MASS")  # uncomment this line the first time you use MASS
library(MASS)
model.backward <- stepAIC(model.full.bin)
## Start:  AIC=2768.3
## Balance ~ Income + Rating + Cards + Age + Education + Gender.Male + 
##     Student.Yes + Married.No + Ethnicity.African.American + Ethnicity.Asian + 
##     Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Education                   1       267 2724660 2766.3
## - Ethnicity.Asian             1       976 2725369 2766.4
## - Married.No                  1      1084 2725476 2766.4
## - Gender.Male                 1      8254 2732647 2767.2
## - Cards                       1     17004 2741397 2768.2
## - Ethnicity.African.American  1     17419 2741812 2768.2
## <none>                                    2724393 2768.3
## - Age                         1     45496 2769889 2771.3
## - Income:Student.Yes          1     89791 2814184 2776.1
## - Rating:Student.Yes          1    164545 2888938 2784.0
## 
## Step:  AIC=2766.33
## Balance ~ Income + Rating + Cards + Age + Gender.Male + Student.Yes + 
##     Married.No + Ethnicity.African.American + Ethnicity.Asian + 
##     Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Ethnicity.Asian             1      1045 2725705 2764.4
## - Married.No                  1      1103 2725763 2764.5
## - Gender.Male                 1      8155 2732815 2765.2
## - Cards                       1     17094 2741754 2766.2
## - Ethnicity.African.American  1     17600 2742260 2766.3
## <none>                                    2724660 2766.3
## - Age                         1     45305 2769965 2769.3
## - Income:Student.Yes          1     89550 2814210 2774.1
## - Rating:Student.Yes          1    164305 2888965 2782.0
## 
## Step:  AIC=2764.45
## Balance ~ Income + Rating + Cards + Age + Gender.Male + Student.Yes + 
##     Married.No + Ethnicity.African.American + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Married.No                  1      1259 2726964 2762.6
## - Gender.Male                 1      8349 2734054 2763.4
## - Ethnicity.African.American  1     16662 2742367 2764.3
## - Cards                       1     16801 2742506 2764.3
## <none>                                    2725705 2764.4
## - Age                         1     45195 2770900 2767.4
## - Income:Student.Yes          1     88669 2814374 2772.1
## - Rating:Student.Yes          1    163417 2889122 2780.0
## 
## Step:  AIC=2762.59
## Balance ~ Income + Rating + Cards + Age + Gender.Male + Student.Yes + 
##     Ethnicity.African.American + Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Gender.Male                 1      8654 2735618 2761.5
## - Ethnicity.African.American  1     16083 2743048 2762.4
## - Cards                       1     17506 2744470 2762.5
## <none>                                    2726964 2762.6
## - Age                         1     44184 2771148 2765.4
## - Income:Student.Yes          1     90501 2817466 2770.4
## - Rating:Student.Yes          1    167204 2894168 2778.5
## 
## Step:  AIC=2761.54
## Balance ~ Income + Rating + Cards + Age + Student.Yes + Ethnicity.African.American + 
##     Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Ethnicity.African.American  1     16455 2752073 2761.3
## - Cards                       1     17641 2753259 2761.5
## <none>                                    2735618 2761.5
## - Age                         1     45032 2780650 2764.5
## - Income:Student.Yes          1     93630 2829248 2769.7
## - Rating:Student.Yes          1    170632 2906251 2777.8
## 
## Step:  AIC=2761.35
## Balance ~ Income + Rating + Cards + Age + Student.Yes + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                      Df Sum of Sq     RSS    AIC
## - Cards               1     17287 2769361 2761.2
## <none>                            2752073 2761.3
## - Age                 1     47526 2799599 2764.5
## - Income:Student.Yes  1     98062 2850136 2769.9
## - Rating:Student.Yes  1    173001 2925074 2777.7
## 
## Step:  AIC=2761.23
## Balance ~ Income + Rating + Age + Student.Yes + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                      Df Sum of Sq     RSS    AIC
## <none>                            2769361 2761.2
## - Age                 1     44411 2813771 2764.0
## - Income:Student.Yes  1     98347 2867708 2769.7
## - Rating:Student.Yes  1    175231 2944592 2777.7
summary(model.backward)
## 
## Call:
## lm(formula = Balance ~ Income + Rating + Age + Student.Yes + 
##     Income:Student.Yes + Rating:Student.Yes, data = train.bin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -198.713  -68.905   -3.523   59.063  289.741 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -542.28051   23.17499 -23.399  < 2e-16 ***
## Income               -7.45135    0.27482 -27.114  < 2e-16 ***
## Rating                3.93723    0.06158  63.940  < 2e-16 ***
## Age                  -0.72304    0.33299  -2.171  0.03071 *  
## Student.Yes         232.28427   52.84056   4.396 1.54e-05 ***
## Income:Student.Yes   -2.56507    0.79384  -3.231  0.00137 ** 
## Rating:Student.Yes    0.87034    0.20179   4.313 2.20e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.05 on 294 degrees of freedom
## Multiple R-squared:  0.9594, Adjusted R-squared:  0.9586 
## F-statistic:  1159 on 6 and 294 DF,  p-value: < 2.2e-16

We note the following about the results of our backwards selected model:

  • stepAIC() will return the results of all iterations, showing the AIC in ascending order
  • The resulting model only includes Income, Rating, Age, Student and the interactions between Income and Student and Rating and Student
    • All of which are statistically significant at a 5% sig level
    • Consistent with we gathered from our EDA earlier
    • In the exam always summarize and comapre the results of the resulting model (i.e. is everything significant? Is it consistent with our EDA? Does it make sense?)

Next in CHUNK 16, we try implementing stepAIC for forward selection, which reques an additional argument: scope that requires a list of the upper limit and lower limits to be specified. We’ll only note that the resulting model is identical to the backwards selected model:

# CHUNK 16 - Forward Selection

# first fit the null model (i.e., model with no predictors)
model.null <- lm(Balance ~ 1, data = train.bin)

model.forward <- stepAIC(model.null, direction = "forward",
                         scope = list(upper = model.full.bin, lower = model.null))
## Start:  AIC=3713.8
## Balance ~ 1
## 
##                              Df Sum of Sq      RSS    AIC
## + Rating                      1  52118308 16130877 3281.6
## + Income                      1  16318500 51930685 3633.6
## + Student.Yes                 1   4815192 63433993 3693.8
## + Cards                       1    630541 67618644 3713.0
## <none>                                    68249185 3713.8
## + Gender.Male                 1     95205 68153981 3715.4
## + Married.No                  1     83052 68166134 3715.4
## + Age                         1     80005 68169181 3715.4
## + Ethnicity.African.American  1       422 68248763 3715.8
## + Ethnicity.Asian             1       387 68248799 3715.8
## + Education                   1       235 68248951 3715.8
## 
## Step:  AIC=3281.63
## Balance ~ Rating
## 
##                              Df Sum of Sq      RSS    AIC
## + Income                      1   8477509  7653369 3059.2
## + Student.Yes                 1   4454246 11676631 3186.4
## + Age                         1    408692 15722186 3275.9
## + Married.No                  1    109061 16021816 3281.6
## <none>                                    16130877 3281.6
## + Ethnicity.African.American  1    105708 16025169 3281.7
## + Cards                       1     90760 16040117 3281.9
## + Education                   1     17131 16113746 3283.3
## + Gender.Male                 1     15381 16115496 3283.3
## + Ethnicity.Asian             1     10848 16120029 3283.4
## 
## Step:  AIC=3059.21
## Balance ~ Rating + Income
## 
##                              Df Sum of Sq     RSS    AIC
## + Student.Yes                 1   4650209 3003160 2779.6
## + Age                         1     95929 7557440 3057.4
## + Married.No                  1     82760 7570608 3057.9
## <none>                                    7653369 3059.2
## + Ethnicity.Asian             1     33142 7620226 3059.9
## + Ethnicity.African.American  1     15649 7637719 3060.6
## + Education                   1     11275 7642093 3060.8
## + Gender.Male                 1      4288 7649080 3061.0
## + Cards                       1       806 7652562 3061.2
## 
## Step:  AIC=2779.63
## Balance ~ Rating + Income + Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## + Rating:Student.Yes          1     83132 2920027 2773.2
## + Age                         1     57608 2945552 2775.8
## + Ethnicity.African.American  1     20915 2982245 2779.5
## <none>                                    3003160 2779.6
## + Cards                       1     15989 2987171 2780.0
## + Gender.Male                 1     13881 2989279 2780.2
## + Ethnicity.Asian             1      3777 2999383 2781.2
## + Married.No                  1      2863 3000297 2781.3
## + Income:Student.Yes          1       921 3002239 2781.5
## + Education                   1       237 3002923 2781.6
## 
## Step:  AIC=2773.18
## Balance ~ Rating + Income + Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## + Income:Student.Yes          1    106256 2813771 2764.0
## + Age                         1     52320 2867708 2769.7
## + Ethnicity.African.American  1     23651 2896376 2772.7
## <none>                                    2920027 2773.2
## + Cards                       1     14186 2905842 2773.7
## + Gender.Male                 1     13792 2906236 2773.8
## + Ethnicity.Asian             1      1494 2918534 2775.0
## + Married.No                  1      1401 2918626 2775.0
## + Education                   1         5 2920022 2775.2
## 
## Step:  AIC=2764.02
## Balance ~ Rating + Income + Student.Yes + Rating:Student.Yes + 
##     Income:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## + Age                         1     44411 2769361 2761.2
## <none>                                    2813771 2764.0
## + Ethnicity.African.American  1     18521 2795250 2764.0
## + Cards                       1     14172 2799599 2764.5
## + Gender.Male                 1     10042 2803729 2764.9
## + Married.No                  1       322 2813449 2766.0
## + Ethnicity.Asian             1       208 2813563 2766.0
## + Education                   1       141 2813631 2766.0
## 
## Step:  AIC=2761.23
## Balance ~ Rating + Income + Student.Yes + Age + Rating:Student.Yes + 
##     Income:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## <none>                                    2769361 2761.2
## + Cards                       1   17287.2 2752073 2761.3
## + Ethnicity.African.American  1   16101.2 2753259 2761.5
## + Gender.Male                 1    9158.6 2760202 2762.2
## + Married.No                  1    1522.7 2767838 2763.1
## + Education                   1     410.7 2768950 2763.2
## + Ethnicity.Asian             1      82.0 2769279 2763.2
# upper should be the most saturated togo and the lower should be the most bare we want to go

summary(model.forward)
## 
## Call:
## lm(formula = Balance ~ Rating + Income + Student.Yes + Age + 
##     Rating:Student.Yes + Income:Student.Yes, data = train.bin)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -198.713  -68.905   -3.523   59.063  289.741 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -542.28051   23.17499 -23.399  < 2e-16 ***
## Rating                3.93723    0.06158  63.940  < 2e-16 ***
## Income               -7.45135    0.27482 -27.114  < 2e-16 ***
## Student.Yes         232.28427   52.84056   4.396 1.54e-05 ***
## Age                  -0.72304    0.33299  -2.171  0.03071 *  
## Rating:Student.Yes    0.87034    0.20179   4.313 2.20e-05 ***
## Income:Student.Yes   -2.56507    0.79384  -3.231  0.00137 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 97.05 on 294 degrees of freedom
## Multiple R-squared:  0.9594, Adjusted R-squared:  0.9586 
## F-statistic:  1159 on 6 and 294 DF,  p-value: < 2.2e-16

CHUNKS 17 runs both backwards and forward selection using BIC as the criterion instead, which we’ll note:

  • Both backward and forward selection yields identical models, but compared to AIC – there’s one less predictor (Age)
    • This makes sense since BIC is more conservative in terms of retaining predictors in our model due to a higher penalty for increased parameterization relative to AIC
# CHUNK 17 (Example 3.4.1)
model.backward.BIC <- stepAIC(model.full.bin, k = log(nrow(train.bin)))
## Start:  AIC=2816.5
## Balance ~ Income + Rating + Cards + Age + Education + Gender.Male + 
##     Student.Yes + Married.No + Ethnicity.African.American + Ethnicity.Asian + 
##     Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Education                   1       267 2724660 2810.8
## - Ethnicity.Asian             1       976 2725369 2810.9
## - Married.No                  1      1084 2725476 2810.9
## - Gender.Male                 1      8254 2732647 2811.7
## - Cards                       1     17004 2741397 2812.7
## - Ethnicity.African.American  1     17419 2741812 2812.7
## - Age                         1     45496 2769889 2815.8
## <none>                                    2724393 2816.5
## - Income:Student.Yes          1     89791 2814184 2820.6
## - Rating:Student.Yes          1    164545 2888938 2828.4
## 
## Step:  AIC=2810.82
## Balance ~ Income + Rating + Cards + Age + Gender.Male + Student.Yes + 
##     Married.No + Ethnicity.African.American + Ethnicity.Asian + 
##     Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Ethnicity.Asian             1      1045 2725705 2805.2
## - Married.No                  1      1103 2725763 2805.2
## - Gender.Male                 1      8155 2732815 2806.0
## - Cards                       1     17094 2741754 2807.0
## - Ethnicity.African.American  1     17600 2742260 2807.1
## - Age                         1     45305 2769965 2810.1
## <none>                                    2724660 2810.8
## - Income:Student.Yes          1     89550 2814210 2814.8
## - Rating:Student.Yes          1    164305 2888965 2822.7
## 
## Step:  AIC=2805.23
## Balance ~ Income + Rating + Cards + Age + Gender.Male + Student.Yes + 
##     Married.No + Ethnicity.African.American + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Married.No                  1      1259 2726964 2799.7
## - Gender.Male                 1      8349 2734054 2800.4
## - Ethnicity.African.American  1     16662 2742367 2801.3
## - Cards                       1     16801 2742506 2801.4
## - Age                         1     45195 2770900 2804.5
## <none>                                    2725705 2805.2
## - Income:Student.Yes          1     88669 2814374 2809.2
## - Rating:Student.Yes          1    163417 2889122 2817.1
## 
## Step:  AIC=2799.66
## Balance ~ Income + Rating + Cards + Age + Gender.Male + Student.Yes + 
##     Ethnicity.African.American + Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Gender.Male                 1      8654 2735618 2794.9
## - Ethnicity.African.American  1     16083 2743048 2795.7
## - Cards                       1     17506 2744470 2795.9
## - Age                         1     44184 2771148 2798.8
## <none>                                    2726964 2799.7
## - Income:Student.Yes          1     90501 2817466 2803.8
## - Rating:Student.Yes          1    167204 2894168 2811.9
## 
## Step:  AIC=2794.91
## Balance ~ Income + Rating + Cards + Age + Student.Yes + Ethnicity.African.American + 
##     Income:Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## - Ethnicity.African.American  1     16455 2752073 2791.0
## - Cards                       1     17641 2753259 2791.1
## - Age                         1     45032 2780650 2794.1
## <none>                                    2735618 2794.9
## - Income:Student.Yes          1     93630 2829248 2799.3
## - Rating:Student.Yes          1    170632 2906251 2807.4
## 
## Step:  AIC=2791
## Balance ~ Income + Rating + Cards + Age + Student.Yes + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                      Df Sum of Sq     RSS    AIC
## - Cards               1     17287 2769361 2787.2
## - Age                 1     47526 2799599 2790.4
## <none>                            2752073 2791.0
## - Income:Student.Yes  1     98062 2850136 2795.8
## - Rating:Student.Yes  1    173001 2925074 2803.7
## 
## Step:  AIC=2787.18
## Balance ~ Income + Rating + Age + Student.Yes + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                      Df Sum of Sq     RSS    AIC
## - Age                 1     44411 2813771 2786.3
## <none>                            2769361 2787.2
## - Income:Student.Yes  1     98347 2867708 2792.0
## - Rating:Student.Yes  1    175231 2944592 2799.9
## 
## Step:  AIC=2786.26
## Balance ~ Income + Rating + Student.Yes + Income:Student.Yes + 
##     Rating:Student.Yes
## 
##                      Df Sum of Sq     RSS    AIC
## <none>                            2813771 2786.3
## - Income:Student.Yes  1    106256 2920027 2791.7
## - Rating:Student.Yes  1    188468 3002239 2800.1
model.forward.AIC <- stepAIC(model.null, direction = "forward",
                             scope = list(upper = model.full.bin, lower = model.null),
                             k = log(nrow(train.bin)))
## Start:  AIC=3717.51
## Balance ~ 1
## 
##                              Df Sum of Sq      RSS    AIC
## + Rating                      1  52118308 16130877 3289.0
## + Income                      1  16318500 51930685 3641.0
## + Student.Yes                 1   4815192 63433993 3701.2
## <none>                                    68249185 3717.5
## + Cards                       1    630541 67618644 3720.4
## + Gender.Male                 1     95205 68153981 3722.8
## + Married.No                  1     83052 68166134 3722.8
## + Age                         1     80005 68169181 3722.9
## + Ethnicity.African.American  1       422 68248763 3723.2
## + Ethnicity.Asian             1       387 68248799 3723.2
## + Education                   1       235 68248951 3723.2
## 
## Step:  AIC=3289.04
## Balance ~ Rating
## 
##                              Df Sum of Sq      RSS    AIC
## + Income                      1   8477509  7653369 3070.3
## + Student.Yes                 1   4454246 11676631 3197.5
## + Age                         1    408692 15722186 3287.0
## <none>                                    16130877 3289.0
## + Married.No                  1    109061 16021816 3292.7
## + Ethnicity.African.American  1    105708 16025169 3292.8
## + Cards                       1     90760 16040117 3293.1
## + Education                   1     17131 16113746 3294.4
## + Gender.Male                 1     15381 16115496 3294.5
## + Ethnicity.Asian             1     10848 16120029 3294.5
## 
## Step:  AIC=3070.33
## Balance ~ Rating + Income
## 
##                              Df Sum of Sq     RSS    AIC
## + Student.Yes                 1   4650209 3003160 2794.5
## <none>                                    7653369 3070.3
## + Age                         1     95929 7557440 3072.2
## + Married.No                  1     82760 7570608 3072.8
## + Ethnicity.Asian             1     33142 7620226 3074.7
## + Ethnicity.African.American  1     15649 7637719 3075.4
## + Education                   1     11275 7642093 3075.6
## + Gender.Male                 1      4288 7649080 3075.9
## + Cards                       1       806 7652562 3076.0
## 
## Step:  AIC=2794.46
## Balance ~ Rating + Income + Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## + Rating:Student.Yes          1     83132 2920027 2791.7
## + Age                         1     57608 2945552 2794.3
## <none>                                    3003160 2794.5
## + Ethnicity.African.American  1     20915 2982245 2798.1
## + Cards                       1     15989 2987171 2798.6
## + Gender.Male                 1     13881 2989279 2798.8
## + Ethnicity.Asian             1      3777 2999383 2799.8
## + Married.No                  1      2863 3000297 2799.9
## + Income:Student.Yes          1       921 3002239 2800.1
## + Education                   1       237 3002923 2800.1
## 
## Step:  AIC=2791.71
## Balance ~ Rating + Income + Student.Yes + Rating:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## + Income:Student.Yes          1    106256 2813771 2786.3
## <none>                                    2920027 2791.7
## + Age                         1     52320 2867708 2792.0
## + Ethnicity.African.American  1     23651 2896376 2795.0
## + Cards                       1     14186 2905842 2795.9
## + Gender.Male                 1     13792 2906236 2796.0
## + Ethnicity.Asian             1      1494 2918534 2797.3
## + Married.No                  1      1401 2918626 2797.3
## + Education                   1         5 2920022 2797.4
## 
## Step:  AIC=2786.26
## Balance ~ Rating + Income + Student.Yes + Rating:Student.Yes + 
##     Income:Student.Yes
## 
##                              Df Sum of Sq     RSS    AIC
## <none>                                    2813771 2786.3
## + Age                         1     44411 2769361 2787.2
## + Ethnicity.African.American  1     18521 2795250 2790.0
## + Cards                       1     14172 2799599 2790.4
## + Gender.Male                 1     10042 2803729 2790.9
## + Married.No                  1       322 2813449 2791.9
## + Ethnicity.Asian             1       208 2813563 2791.9
## + Education                   1       141 2813631 2792.0

AIC vs BIC, Backward vs Forward

AIC vs BIC

  • Differ due to scale of penalty, where BIC is the log(n) and AIC is 2, thus BIC penalizes increased complexity harsher
  • BIC requires the increased goodness of fit to be larger (relative to AIC) in order to justify the additional parameterization
  • BIC will favor models with fewer features and represents a more conservative approach to selection – good when we just want to identify key factors
  • AIC is good if we want to consider all possibly significant factors

Backward vs Forward

  • Forward will usually produce a model with fewer features, more aligned with the conservative approach to model selection
  • Backwards will produce a model with more features (relative to forward), more aligned with the exploratory, all possibilities spirit of identifying factors
  • Based on these considerations and our goal to only identify key factors in balance, the model.forward.BIC best algins with the goals of the task

3.4.3 Model Validation

Now that we have model with our selected features, we’ll validate the model against the test and see how various models perform when predicting unseen data.

Which Model is Most Predictive?

Now that we have different models of varying complexity fitted to the training set, we can compare their predictive performance on the common test set to see which one is most predictive. Here we can answer questions such as, does the selected model via our stepwise algorithm perform better than the more complicated full model?

In CHUNK 18, we compare the TEST RMSE of four of our fitted models:

  1. null_\(\beta_0\) only
  2. full.bin_completely saturated with our two interaction terms
  3. backwards selected model using AIC
  4. backwards selected model using BIC - yields lowest RMSE, considered best using this metric
    • We’ll choose this one since not only is it simpler (and in turn easier to interpret) but also is more predicive than the model selected using AIC in the sense that its test RMSE is lower.

REMEMBER, on the exam it will be helpful if we include a summary table that shows the models up for comparison and their TEST performance metrics to support our decision for what model is chosen.

# CHUNK 18
# define the rmse() function again
rmse <- function(observed, predicted) {
  sqrt(mean((observed - predicted)^2))
}

rmse(test$Balance, predict(model.null, newdata = test.bin))
## [1] 403.7726
rmse(test$Balance, predict(model.full.bin, newdata = test.bin))
## [1] 108.8133
rmse(test$Balance, predict(model.backward, newdata = test.bin))
## [1] 108.3417
rmse(test$Balance, predict(model.backward.BIC, newdata = test.bin))
## [1] 107.8142

Model Diagnostics

Now that we have selected a model, we should follow by diagnosing it to check for any abnormalities or if our model assumptions are appropriate wrt our data.

Base R’s plot() will generate diagnostic plots when a model object is passed through as seen in CHUNK 19:

# CHUNK 19
plot(model.backward.BIC)

We note the following:

  • “Residuals vs Fitted”
    • If the fitted model has captured the relationship between the target and its predictors, the plot should have no discernible pattern and the residuals should be spread out erratically (no fanning out)
    • Out plot shows U-shaped pattern which may suggest nonlinear patterns between our predictors and the target; we may want to add polynomial terms to fix this (E.g. \(Income^2\) or \(Rating^2\))
    • No heterscedasticity (constant variance assumptions seems okay)
  • “Normal QQ”
    • If our normality assumption was unviolated then the plots should fall along the diagnol line
    • In our plot we see the tails of the distribution (i.e. extreme values) above the diagnol, indicating a fatter tailed distribution
  • “Scale-Location”
    • Primarily a scaled version of the Residuals vs Fitted, but better to assess the constant variance/homoscedasticity assumption
    • The red line should be flat, indicating constant variance
    • Same conclusion: U-shaped, but homoscedasticity seems fine
  • “Residuals vs Leverage”
    • Helps identify influential data points in the model; points outside of Cook’s distance indicate influential points (points that affect the estimated results of our model)
    • No influential points seem present

3.4.4 Regularization

  • Regularization is an alternative to forward and backward selection for identifying features that could be removed
  • Regularization is different though since it adjusts our coefficient estimates by including a penalty term for additional parameterization to prevent overfitting
  • It achieves this by reducing coefficient estimates’ magnitudes to near zero or even zero (in the case of Lasso regression)

In our Credit case study, we’ll explore lasso regression due to its feature selection properties (reduce coefficient estimates to zero if insignificant)

Fitting Penalized Regression Models: glmnet()

Instead of the lm() function, we can use the glmnet() function to fit regularized regression models, but note that it has differewnt syntax that requires a design matrix to be passed through.

First, CHUNK 20 sets up the training design matrix using the model.matrix() function. Note that one advantage of the model.matrix() function is that it automatically binarizes our categorical variables (we’re not using the train.bin data.frame, so the factor variables are automatically binarized for us).

This is helpful since glmnet() only can work with quantitative inputs

# CHUNK 20 -- create design matrix to fit full model with the two interaction terms
X.train <- model.matrix(Balance ~ . + Rating:Student + Income:Student,
                        data = train)  # identifcal input as lm()'s formula arugment
head(X.train)  # print out a few rows of the design matrix
##    (Intercept)  Income Rating Cards Age Education GenderMale StudentYes
## 1            1  14.891    283     2  34        11          1          0
## 3            1 104.593    514     4  71        11          1          0
## 4            1 148.924    681     3  36        11          0          0
## 5            1  55.882    357     2  68        16          1          0
## 10           1  71.061    491     3  41        19          0          1
## 13           1  80.616    394     1  57         7          0          0
##    MarriedNo EthnicityAfrican American EthnicityAsian Rating:StudentYes
## 1          0                         0              0                 0
## 3          1                         0              1                 0
## 4          1                         0              1                 0
## 5          0                         0              0                 0
## 10         0                         1              0               491
## 13         0                         0              1                 0
##    Income:StudentYes
## 1              0.000
## 3              0.000
## 4              0.000
## 5              0.000
## 10            71.061
## 13             0.000

In CHUNK 21, we fit the lasso regression using the glmnet() function in CHUNK 21, we make our model specifications including fitting multiple lambdas (the scaling parameter) and call the output using the coef() function:

# CHUNK 21
#install.packages("glmnet")  # uncomment this line the first time you use glmnet
library(glmnet)
## Loading required package: Matrix
## Loaded glmnet 3.0-1
lasso <- glmnet(x = X.train, y = train$Balance, family = "gaussian",
                lambda = 10^(0:3), alpha = 1)  
# pass the design matrix, target, family distr of target, specify lambda (the regularization parameter) and alpha = 1 for lasso

# first way to extract coefficient estimates
lasso$a0  # contains an intercept for each lambda
##        s0        s1        s2        s3 
##  531.1362 -183.3551 -516.8619 -555.1325
lasso$beta  # contains coefficient estimates for each lambda (1000, 100, 10, 1 respectively)
## 13 x 4 sparse Matrix of class "dgCMatrix"
##                           s0        s1          s2          s3
## (Intercept)                . .           .           .        
## Income                     . .          -6.4047003  -7.3563171
## Rating                     . 1.9563912   3.6787565   3.9137820
## Cards                      . .           0.1275813   4.9974186
## Age                        . .          -0.3299576  -0.7005179
## Education                  . .           .           .        
## GenderMale                 . .           .           8.6338520
## StudentYes                 . .         265.3799583 250.1347442
## MarriedNo                  . .           .           3.0059538
## EthnicityAfrican American  . .          -0.1034776 -16.3731282
## EthnicityAsian             . .           .          -0.7407126
## Rating:StudentYes          . 0.2380456   0.3450852   0.7266730
## Income:StudentYes          . .           .          -1.8914534
# second way, preferred in ISLR
coef(lasso)
## 14 x 4 sparse Matrix of class "dgCMatrix"
##                                 s0           s1           s2           s3
## (Intercept)               531.1362 -183.3550989 -516.8618738 -555.1324995
## (Intercept)                 .         .            .            .        
## Income                      .         .           -6.4047003   -7.3563171
## Rating                      .         1.9563912    3.6787565    3.9137820
## Cards                       .         .            0.1275813    4.9974186
## Age                         .         .           -0.3299576   -0.7005179
## Education                   .         .            .            .        
## GenderMale                  .         .            .            8.6338520
## StudentYes                  .         .          265.3799583  250.1347442
## MarriedNo                   .         .            .            3.0059538
## EthnicityAfrican American   .         .           -0.1034776  -16.3731282
## EthnicityAsian              .         .            .           -0.7407126
## Rating:StudentYes           .         0.2380456    0.3450852    0.7266730
## Income:StudentYes           .         .            .           -1.8914534
mean(train$Balance)
## [1] 531.1362

From the coefficient output, we note:

  • The larger the lambda (e.g. s0 = lambda 1000) all coefficients are shrunk to 0 leaaving only the intercept (the sample mean of the response)
  • As lambda decreases, more coefficients are left in the model – although Education is still dropped at a lambda of 1
  • Note that glmnet() does NOT respect the hierarchical principle so main effect predictors can be dropped while the interaction term is kept (e.g. s1 where Rating:StudentYes is kept but StudentYes isn’t)

Hyperparameter Tuning: cv.glmnet()

The selection of lambda is selected a prior (alpha as well if we didn’t already decide to go with Lasso due to the context our objective). We can tune lambda using cross-validation and the cv.glmnet() function that will produce an “optimal” lambda via comparing cross-validation errors.

_As a reminder, cross-validation would involve cv.glmnet() running k iterations for each lambda where a metric is recorded and averaged across all folds and saved for that lambda – i.e. the cross-validation error. Then the output shows a range of lambdas and their corresponding cross-validation errors. Presumably we choose the lambda with the lowest error.

Additionally, the cv.glmnet() function is usually used with the plot() function that will return the cross-validation curve where the xval error is plotted as a function of the lambda used. The plot is useful since:

  • Red dots represent the cross-validation error corresponding to different values of \(\lambda\)
  • Numbers at the top indicate how many corresponding features are nonzero give the value of \(\lambda\) or \(\log \lambda\)
  • First vertical dotted line corresponds to the value of \(\lambda\) or \(\log \lambda\) where the cross-validation error is minimized
    • Can be directly extracted by calling the lambda.min object in the model list
  • The second vertical dotted line has to do with one-standard error rule, can be extracted with the lambda.lse component – will lead to a more regularized regression

CHUNK 22 runs the cross valdation, plots the xval errors against each lambda, and calls the lambda associated with the smallest cross validation error:

# CHUNK 22
set.seed(1111)
m <- cv.glmnet(x = X.train, y = train$Balance,
               family = "gaussian", alpha = 1)  # default is 10 folds, use MSE for gaussian

plot(m)  # plot of xval errors

m$lambda.min # smallest xval error
## [1] 0.5630216

Now we refit our lasso regression using the “optimal” lambda determined by cv.glmnet(), summarize the estimated coefficients, set up a design matrix for our TEST data, then calculate the TEST RMSE of our Lasso model in CHUNK 23.

We note that the RMSE is 108 – higher than our backwards selected BIC model. Additionally, the Lasso model is harder to explain due to model complexity and therefore we conclude our previous selection still stands.

# CHUNK 23
# Fit the lasso using the best value of lambda
lasso.best <- glmnet(x = X.train, y = train$Balance, family = "gaussian",
                 lambda = m$lambda.min, alpha = 1)

# Look at the coefficient estimates
lasso.best$beta
## 13 x 1 sparse Matrix of class "dgCMatrix"
##                                    s0
## (Intercept)                 .        
## Income                     -7.3739775
## Rating                      3.9185005
## Cards                       5.2233269
## Age                        -0.7141180
## Education                  -0.1011369
## GenderMale                  9.4128722
## StudentYes                237.8787238
## MarriedNo                   3.3217738
## EthnicityAfrican American -17.6213968
## EthnicityAsian             -2.5916152
## Rating:StudentYes           0.8047331
## Income:StudentYes          -2.2025826
# Set up the design matrix for the test set
X.test <- model.matrix(Balance ~ . + Rating:Student + Income:Student,
                       data = test)

# Make predictions on the test set and calculate test RMSE
lasso.best.pred <- predict(lasso.best, newx = X.test)  # note the newx argument is used instead of newdata since X.test is a matrix, not data.frame
rmse(test$Balance, lasso.best.pred)
## [1] 108.6479

4 Generalized Linear Models

Linear models are useful but they rely on the assumption that the target variable is normally distributed (along with other associated assumptions on the error). Applying a linear model on these targets can technically be done, but the estimated model may not fit the data well and produce poor predictions. Some targets of interest that defy these assumptions include:

  • Binary targets – when we want to know the estimated prob of an bernoulli event occurring
    • E.g. claim/no claim, death/no death
  • Count targets – when we want to estimate the number of times a certain event occurs over a specified period of time; usually non-negative INTEGERS
    • E.g. claim count, number of deaths
  • Positive, highly skewed continuous targets – when want to estimate a continous target that only takes positive numbers and highly skewed (to the right)
    • E.g. claim amount, income

Generalized linear models (GLMs) provide a unifying framework for dealing with a rich class of non-normally distributed target variables and allow for more complex relationships between the target and its predictors.

4.1 Conceptual Foundations of GLMs

Note that many of the techniques used to develop and construct linear models apply to GLMs, this section only focuses specific advantages/issues applicable to GLMs such as:

  • Selection of target distributions and link functions
    • Including the use of offsets and weights
  • Interpretation of the results of a GLM
  • Evaluation of GLMs for binary target variables
    • e.g Confusion matrices, classification errors, ROC curves/AUC

GLMs, Explained

GLMs provide more flexibility and substantially widen the scope of potential applications in two respects:

  1. Distribution of the target variable: The target no longer has to be normally distributed; it only has to belong to the exponential family of distributions
  2. Relationship between the target mean and linear predictors: Instead of using our previous model for the target mean, \[ \mathbb{E}[Y] = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p \] We equate the linear predictor \(\boldsymbol{X} \boldsymbol{\beta}\) to \(\eta\) and generalize the model so that instead of the target mean equating \(\eta\), we allow a function of the target mean to equate our linear predictor, \(\eta\) – so that:

\[ g(\mathbb{E}[Y]) = \eta = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p \]

  • where \(g()\) is called the link function, that links the target mean \(\mathbb{E}[Y]\) to the linear predictor, \(\eta\) (i.e. the link function links the random component, \(Y\) to our systematic component \(\eta\)).

  • \(g()\) is any monotonic function that allows the inverse of the link function to estimate the target mean:

\[ \mathbb{E}[Y] = g^{-1}(\eta) = g^{-1}(\beta_0 + \beta_1 X_1 + ... + \beta_p X_p) \]

  • This allows us to model situations where the effects of the predictors on the target mean are more complex than having an additive effect, without transforming the target observations.

  • GLMs can be defined by the assumed distribution of the target variable and the specified link function \(g()\)

  • Note that the linear regression model is a GLM with a normally distributed target and an “identity” link function where \(g(\mathbb{E}[Y]) = \mathbb{E}[Y]\)

IMPORTANT: Note that a GLM that uses a log-link is NOT the same as a linear regression model using a log-transformed Y. The former assumes that the mean of the target must be non-negative (although the target observations can technically take on negatitve values), while the latter assumes the target is log-normally distributed and thus non-negative.

  • tl;dr: \(g(\mathbb{E}[Y]) \neq \mathbb{E}[g(Y)]\)

4.1.2 Model Statistics

Parameter Estimates

  • Instead of using OLS, GLMs use maximum likelihood estimate (MLE) to estimate our unknown parameters \(\boldsymbol{\beta}\)
    • Note that for linear models, the MLEs coincide with OLS estimates so they’ll produce the same results
  • Generally speakng, MLE works in a way that the parameters are solved such that the likelihood of observing the given data is maximized
    • Again, OLS is similar in that the parameters are solved such that the error between fitted values and observations are minimized
  • MLEs assert asymptotic assumptions of unbiasedness, efficienc and normalcy
  • One issue with MLE is that it can run into convergence issues especially when the non-canonical link is used

Deviance

  • Since the target is no longer assumed to be normal in GLMs, goodness-of-fit measures such as \(R^2\) do not apply with GLMs
  • Instead DEVIANCE represents a likelihood-based goodness-of-fit measure to evaluate how well the model fits on our training dataset
  • Deviance measures how much the fitted GLM departs from the most saturated GLM (a perfectl fit model), represented by \(D = 2(l_{saturated} - l_{fitted})\)
    • A lower deviance indicating a closer fit (compared to the fully saturated model)
    • This is different from \(R^2\) where it is measuring our fitted model against the primitive-intercept only model

Deviance Residuals

  • With GLMs, the raw residuals \(e_i = y_i - \hat{\mu_i}\) may not mean much since the target, and in turn, residuals are no longer assumed to be normally distributed (since the target doesn’t have to be normal) or have constant variance (since the variance can vary with the target mean across differnt observations
    • This makes it hard to benchmark raw residuals of ALL GLMs to be reliably compared
  • Deviance residuals (\(d_i\)) are the signed squared root of the contribution of the \(i\)th observation to the deviance
    • \(\sum_{i=1}^{n} d_i^2 = D\)
  • Assuming the GLM is adequate, approximately normally distributied and homoscedastic for most target distributions, then deviance residuals can be compared to the normal distribution to evaluate how well the fitted GLM accurately describes the given data
    • Similar diagnoses can be used with raw residuals when using linear models

Penalized Liklihood Measures: AIC/BIC

  • The deviance of a GLM is simmilar to \(R^2\) in linear models, in that it will always increase with higher parameterization/saturation of the model, even if they don’t add much predictive power
  • A more meaningful measure would be a metric that also penalizes for unnecessary model complexity such as AIC and BIC:
    • \(\text{AIC} = -2l + 2p\)
    • \(\text{BIC} = -2l + \log (n_{tr}) p\)
    • Same as before, BIC is more conservative in terms of selecting features (prefers parsimony)

4.1.3 Performance Metrics for Classifiers

Recall that the RMSE is a common and relaible performance metric for linear models on numeric targets. Now that GLMs will allow us to use categorical target varaibles, we need to explore performance metrics specific to classifiers:

Confusion Matrices

  • A tabular display of how predictions in a binary classfier lines up with observed classes
  • REQUIRES a specfied cutoff/threshold to decide on the class assignement (e.g. estimated prob > 0.65, then classify as positive/event will occur)
  • See below for an example:

    Ac tual-No Ac tual-Yes
    Pred-No True Negative False Negative (Type II)
    Pred-Yes False Positive (Type I) True Positive
  • Classification Error Rate = \(\frac{\text{False Neg} + \text{False Pos}}{n}\)
    • \(n\) represents the number of observations in either the training or test set depending what the error rate is representing (usually test set)
  • Sensitivity: Relative frequency of correctly predicting an event, when the event actually takes place
    • How well we predict yes out of the actual yes’s
    • \(\text{sensitivity} = \frac{TP}{TP + FN}\)
  • Specficity: Relative frequency of correctly predicting a NON-event, when there is indeed no event
    • How well we predict no out of the actual no’s
    • \(\text{specificity} = \frac{TN}{TN + FP}\)
  • The higher the sensitivity and specificity, the more attractive the calssifier
  • The TEST set’s confusion matrix is usually of interest (although we can compare between the training and test matrices to see how much our classifier overfit the data)

Changing the Cutoff

  • The cutoff should be selected so that both sensitivity and specificity are close to 1
  • In general, adjusting the cutoff will make tradeoffs between sensitivty and specificity
    • A cutoff set to 1 (100% predicted no’s) will cause sensitivty to be 0 and specificity to be 1
    • Decreasing the cutoff will increase more True Positives and decrease False Negatives, but we’ll have more False Positives and less True Negatives – so Sensitivty increases and Specificty decreases
    • If the cutoff is 0 (100% predicted yes’s) then sensitivity will be 1 and specificity will be 0
  • Cutoffs should be determined weighing the benefits of catching positive cases and the costs of missing positive cases – it’ll depend on the context of the problem
    • Fraud detection: we want high sensitivity but that’ll mean some actually non-fraudulent claims will be flagged as fraudulent. In this case, we’d better be safe than sorry – so lower the cutoff

ROC Curves

  • Instead of manually choosing a cutoff, the Receiver Operating Characteristic (ROC) curve is a graphical tool that plots sensitivty against specificity for a given classified at each cutoff range [0, 1]
    • Usually \(1 - \text{specificity}\) is plotted on the x-axis and sensitivity on the y-axis
    • Each point on the ROC curve represents a cutoff
    • All ROC curves start at (0, 0) and end at (1, 1), related to the tradeoffs of sliding the cutoffs between 0 and 1
  • A perfect classifier will yield a point of (0, 1), so a ROC curve that increases to 1 as soon as it leaves the origin indicates a good classifier
  • Thus the area under the [ROC] curve (AUC) is a good representation of predictive ability, with an AUC of 1 being the max/perfect predictor; while an AUC of 0.5 (represents the diagnol) means that the classifier is no better than just taking a guess
    • AUC is a good benchmark for judging absolute and relative performance of different classfiers
  • See Example 4.1.3 below:
# Example 4.1.3
library(caret)
obs <- c(0, 0, 1, 0, 1, 1)
pred <- c(0.25, 0.40, 0.50, 0.60, 0.75, 0.80)

# 0.7 is the cutoff
# need to pass factor vectors of pred and obs based on cutoff through confusionMatrix arguments
confusionMatrix(factor(1*(pred > .7)), factor(obs), positive = "1")  
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction 0 1
##          0 3 1
##          1 0 2
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.3588, 0.9958)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.1094          
##                                           
##                   Kappa : 0.6667          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.6667          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3333          
##       Balanced Accuracy : 0.8333          
##                                           
##        'Positive' Class : 1               
## 
library(pROC)
roc.model <- roc(obs, pred)  # do not need to factorize
plot(roc.model)
points(c(1, 1, 1, 2/3, 2/3, 1/3, 0), c(0, 1/3, 2/3, 2/3, 1, 1, 1),
       pch = 16)

auc(roc.model)
## Area under the curve: 0.8889

We note that from the *confusionMatrix output:

  • At a cutoff of 0.7, sensitivity that is prop of flagged positives out of all actual positives is 0.667 and specificity is 1 which means the model flagged all negatives that were observed
  • The first jump in the ROC curve indicates the 0.7 cutoff

4.2 Case Study 1: GLMs for Continuous Targets

In this case study we’ll revisit the personal injury insurance dataset with the focus being on:

  • Selecting appropriate distributions and link functions for a positive, continuous target with a right skew
  • Fit a GLM using the glm() function
  • Make predictions and compare predictive performance between different GLMs
  • Generate and interpret diagnostic plots for a GLM
  • Note that since there are only 4 variables in the entire dataset, feature selection is less of a concern in this example

4.2.1 Prepatory Steps

Background

Note the following:

  • Objective is to build a GLM to predict the size of settle claims using related risk factors
    • By selecing the most promising GLM and qunatifying its predictive accuracy
  • Claims with 0 payments is excluded in this example
  • Our options for the type of GLM to fit include:
    • Log-transform amt and fit a normal linear model on log(amt)
    • Use a GLM with a normal distribution and a log-link to ensure the target mean is non-negative
    • Use a GLM with a continous target distribution that’s positive-valued and captures the skewness of amt (e.g. gamma or inverse Gaussian) along with a log-link
# CHUNK 1 - load in the data, summary stats
persinj <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/persinj_b.csv")
summary(persinj)
##       amt               inj           legrep          op_time     
##  Min.   :     10   Min.   :1.00   Min.   :0.0000   Min.   : 0.10  
##  1st Qu.:   6297   1st Qu.:1.00   1st Qu.:0.0000   1st Qu.:23.00  
##  Median :  13854   Median :1.00   Median :1.0000   Median :45.90  
##  Mean   :  38367   Mean   :1.83   Mean   :0.6366   Mean   :46.33  
##  3rd Qu.:  35123   3rd Qu.:2.00   3rd Qu.:1.0000   3rd Qu.:69.30  
##  Max.   :4485797   Max.   :9.00   Max.   :1.0000   Max.   :99.10
str(persinj)  # inj and legrep should be factorized, although legrep is just binary so not necessary
## 'data.frame':    22036 obs. of  4 variables:
##  $ amt    : num  87.8 353.6 688.8 172.8 43.3 ...
##  $ inj    : int  1 1 1 1 1 6 1 1 1 1 ...
##  $ legrep : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ op_time: num  0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 ...
# CHUNK 2 - factorize inj and legrep
persinj$inj <- as.factor(persinj$inj)
persinj$legrep <- as.factor(persinj$legrep)
summary(persinj)
##       amt          inj       legrep       op_time     
##  Min.   :     10   1:15638   0: 8008   Min.   : 0.10  
##  1st Qu.:   6297   2: 3376   1:14028   1st Qu.:23.00  
##  Median :  13854   3: 1133             Median :45.90  
##  Mean   :  38367   4:  189             Mean   :46.33  
##  3rd Qu.:  35123   5:  188             3rd Qu.:69.30  
##  Max.   :4485797   6:  256             Max.   :99.10  
##                    9: 1256

Note that:

  • It is important to factorize inj since the numbers themselves represent groups and the model treats it as a numeric then it implies that the estimated effect (change in inj) is the same for every consecutive change (e.g. effect is the same between 1 and 2; 2 and 3)
    • Inj will set 1 as the baseline/ref, which has the most obs so no need to relevel
  • legrep is not required to be factorized since it is just 2 levels, but we’ll do so anyways
    • Also note that the baseline/ref is 0 even though it has a smaller number of obs, we’ll retain since it makes sense to have the baseline represent the absence of

4.2.2 Model Construction and Evaluation

Setup the Training/Test Sets

CHUNK 3 uses createDataPartition to split our persinj dataset between training and test subsets. We’ll note that:

  • Maximums vary greatly, causing the means to differ due to highly skewed nature
  • Medians are comparably so we’re okay
  • As a reminder, we fit our model on the training set and assess performance by calculating RMSE on the test set
# CHUNK 3
library(caret)
set.seed(2019)
partition <- createDataPartition(y = persinj$amt, p = .75, list = FALSE)
train <- persinj[partition, ]
test <- persinj[-partition, ]
summary(train$amt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      20    6297   13852   38670   35118 4485797
summary(test$amt)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10    6299   13858   37458   35140 1450818

Benchmark Model: Normal - log(amt)

As a benchmark, we fit a normal OLS model on a log-transformed amt in CHUNK 4. The log-transformation is informed by our previous plot that showed log(amt) behaves more normally than raw amt as we’d expect. Additionally we include legrep and op_time since our plot in section 2 seemed to imply there could potentially be one.

# CHUNK 4
glm.ols <- glm(log(amt) ~ inj + legrep * op_time,
               family = gaussian(link = "identity"), data = train)
summary(glm.ols)
## 
## Call:
## glm(formula = log(amt) ~ inj + legrep * op_time, family = gaussian(link = "identity"), 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.4217  -0.5805   0.0631   0.6688   4.4581  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7.5212622  0.0269021 279.579  < 2e-16 ***
## inj2             0.5912759  0.0237858  24.858  < 2e-16 ***
## inj3             0.8161053  0.0385945  21.146  < 2e-16 ***
## inj4             0.7958558  0.0901834   8.825  < 2e-16 ***
## inj5             0.6462279  0.0894089   7.228 5.12e-13 ***
## inj6             0.3823662  0.0777895   4.915 8.95e-07 ***
## inj9            -0.8151999  0.0367039 -22.210  < 2e-16 ***
## legrep1          0.9120834  0.0339725  26.848  < 2e-16 ***
## op_time          0.0358738  0.0005093  70.431  < 2e-16 ***
## legrep1:op_time -0.0101808  0.0006384 -15.946  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.146774)
## 
##     Null deviance: 35056  on 16527  degrees of freedom
## Residual deviance: 18942  on 16518  degrees of freedom
## AIC: 49180
## 
## Number of Fisher Scoring iterations: 2

Note the following:

  • Although the lm() function would fit the same model, the summary() output would slightly differ
  • For example, instead of \(R^2\) and an F-stat, Null deviance and the residual deviance is shown, as well as the AIC
  • Predictions using our estimated linear model and the resulting RMSEs are done in CHUNK 5 below
    • Noet that the RMSE of our benchmark model (72888) is considerably smaller than the null-intercept model (where the predictions are all the same - average amt in the training set)
# CHUNK 5
# define the rmse() function again for later use
rmse <- function(observed, predicted) {
  sqrt(mean((observed - predicted)^2))
}

pred.ols <- exp(predict(glm.ols, newdata = test)) # must exponentiate since predictions are on the log scale
head(pred.ols)  # should return a vector of the predicted amt on the original scale
##         3        10        13        21        24        27 
## 1853.5345 1853.5345 4609.6856  820.2833  820.2833  820.2833
rmse(test$amt, pred.ols)  # RMSE on the test set
## [1] 72888.33
rmse(test$amt, mean(train$amt))  # RMSE of an intercept only mode (would just be the sample mean of the response - amt - in the training set)
## [1] 79370.9

4.2.3 Model Validation and Interpretation

Now that we know that GLM 2 (gamma-log) performs best out of the 3 GLMs and 1 OLS benchmark model we fitted, as indicated by having the smallest TEST RMSE, we’ll proceed to validate this specific model.

Additionally, we note that the summary() output for GLM 2 shows that all predictors are statistically significant, so we’ll retain all our features.

Model Diagnostics

  • CHUNK 9 passes the glm.gamma object through the plot() function to generate the usual four diagnostic plots – note that since these are GLMs, the function automatically picks up the deviance residuals so we can conduct comparable checks to when diagnosing a regular linear model
# CHUNK 9
plot(glm.gamma)

We note the following:

  • “Residials vs Fitted” shows the standardized deviance residuals against the predicted values of the training set
    • Most points scatter around 0 in an erratic manner (no pattern) and are equally spread out (no fanning), indicating the specified model and seems appropriate and aligns with our model assumptions
    • We do note a few outliers and (54222, 10224, 19970) and also call out that the positive residuals are a lot higher in magnitude tahn the negative ones
  • “Normal Q-Q” tests how well the deviance residuals conform to our normality assumption
    • We see that for the upper extreme values are much higher than the normal reference line. These indicate a fatter tail on the right end (i.e. the standardized deviance residuals are lot higher than what it would be under a normal distribution and so the data is a lot more right-skewed than the gamma distribution)
  • The scale location also highlights the outliers previously called out but indicates no issues with constant variance of the deviance residuals
  • The residuals vs leverage plot highlights the outliers again, but none seem like true influential points that rise above Cook’s distance and distorting our model estimates

Model Interpretation

Other than having data indicating a larger right-skew than the Gamma distribution underlying our model, no other assumptions seem to be egregiously failing. Thus we fit GLM 2 (gamma-log) on the entire data to get more robust predictions (in CHUNK 10)

# CHUNK 10
glm.final <- glm(amt ~ inj + legrep * op_time,
                 family = Gamma(link = "log"), data = persinj)
summary(glm.final)
## 
## Call:
## glm(formula = amt ~ inj + legrep * op_time, family = Gamma(link = "log"), 
##     data = persinj)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.5289  -0.9142  -0.3648   0.1873   8.2209  
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.2033907  0.0279443 293.562  < 2e-16 ***
## inj2             0.6281073  0.0248995  25.226  < 2e-16 ***
## inj3             0.8882054  0.0402472  22.069  < 2e-16 ***
## inj4             1.1199670  0.0951566  11.770  < 2e-16 ***
## inj5             1.3963478  0.0955128  14.619  < 2e-16 ***
## inj6             0.8867918  0.0816012  10.867  < 2e-16 ***
## inj9            -0.6205268  0.0381881 -16.249  < 2e-16 ***
## legrep1          0.4437842  0.0354423  12.521  < 2e-16 ***
## op_time          0.0343052  0.0005303  64.685  < 2e-16 ***
## legrep1:op_time -0.0050443  0.0006663  -7.571 3.86e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1.675808)
## 
##     Null deviance: 44010  on 22035  degrees of freedom
## Residual deviance: 22242  on 22026  degrees of freedom
## AIC: 487535
## 
## Number of Fisher Scoring iterations: 8

We note the following:

  • All features including inj type, legrep, op_time and the interaction between legrep and op_time all have a statistically significant effect on the expected claim size
    • The interaction can be interpreted as that operational time’s effect on claim size will vary depending on whether the insured has legal representation or not
  • Again, the log-link allows us to interpret the \(\hat{\beta_j}\)’s as \(e^{\beta_j} - 1 \times 100 \%\) change in the expected claim size for a one unit change, or compared to the baseline for the categorical predictors
    • e.g. The expected claim amount is estimated to be about 142% (\(e^{0.887} - 1 \times 100\%\)) higher for an insured with injury type 6 (death) than the baseline (no injury, inj1), holding all other predictors constant.
    • We note that it seems like all injury types more severe than no injury (the baseline, inj1) is estimated to have a positive/increasing effect on expected claim size (except inj9, unknown injury type), taking all other factors into consideration.
    • We also note that whether the insured has legal representation and operational time have similar effects.
    • We also note that those with legal representation will have an estimated smaller expected claim size than those without for higher operational times, holding all other factors constant.
  • In addition to providing numeric interpretations of the estimates, we should be sure to comment on (1) whether the results are reasonable/make sense/consistent with our prios and (2) how it relates to the overall objective/business problem

4.3 Case Study 2: GLMs for Binary Targets

In this case-study we’ll be looking at a situation where we need to predict a binary target variable and perform classifications – which have some unique features. Specifically we’ll hit the following:

  • Combining factor levels to reduce the dimensions of the data
  • Select an appropriate link for binary targets
  • Implementing different kinds of GLMs for binary targets
  • Interpret the results of a fitted logistic regression (logit link)
  • Using the results of the GLM to determine cutoffs for practical considerations

4.3.1 Prepatory Steps

Data Description

We’ll take a look at the vehins dataset (CHUNK 1) – note the following:

  • Over 67k obs and 10 variables – the target being clm which is binary variable indicating whether a claim was incurred for that policyholder during the policy period
  • The objective to identify key factors leading to claim occurence
  • Varying exposures for each record/policyholder
  • The mean/p of clm is 6.814% of the 67k policies
  • Since the goal is to develop a model that will help us predict claim occurence based on policyholder characteristics, it’s important to note that potential predictors cannot include info that we’d know AFTER a claim is incurred (e.g. numclaims and clamcst) – hence we should drop these two (CHUNK 2)
  • There are 53 obs where the vehicle value is 0 – doesn’t make practical sense so let’s remove
  • veh_age and agecat can be binarized/converted to factors (CHUNK 3A)
    • Again, these should be changed so the estimated coefficients doesn’t impose the restriction where a one unit increase (2 -> 3, 3 -> 4) would be associated with the same change in the linear predictor (holding all other predictors fixed)
# CHUNK 1
vehins <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/vehins.csv")
# The original version of "vehins.csv" can be accessed using the following code
#library(insuranceData)
#vehins <- data(dataCar)

summary(vehins)
##       clm             exposure          veh_value        numclaims      
##  Min.   :0.00000   Min.   :0.002738   Min.   : 0.000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.219028   1st Qu.: 1.010   1st Qu.:0.00000  
##  Median :0.00000   Median :0.446270   Median : 1.500   Median :0.00000  
##  Mean   :0.06814   Mean   :0.468651   Mean   : 1.777   Mean   :0.07276  
##  3rd Qu.:0.00000   3rd Qu.:0.709103   3rd Qu.: 2.150   3rd Qu.:0.00000  
##  Max.   :1.00000   Max.   :0.999316   Max.   :34.560   Max.   :4.00000  
##                                                                         
##     claimcst          veh_body        veh_age      gender    area     
##  Min.   :    0.0   SEDAN  :22233   Min.   :1.000   F:38603   A:16312  
##  1st Qu.:    0.0   HBACK  :18915   1st Qu.:2.000   M:29253   B:13341  
##  Median :    0.0   STNWG  :16261   Median :3.000             C:20540  
##  Mean   :  137.3   UTE    : 4586   Mean   :2.674             D: 8173  
##  3rd Qu.:    0.0   TRUCK  : 1750   3rd Qu.:4.000             E: 5912  
##  Max.   :55922.1   HDTOP  : 1579   Max.   :4.000             F: 3578  
##                    (Other): 2532                                      
##      agecat     
##  Min.   :1.000  
##  1st Qu.:2.000  
##  Median :3.000  
##  Mean   :3.485  
##  3rd Qu.:5.000  
##  Max.   :6.000  
## 
str(vehins)
## 'data.frame':    67856 obs. of  10 variables:
##  $ clm      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ exposure : num  0.304 0.649 0.569 0.318 0.649 ...
##  $ veh_value: num  1.06 1.03 3.26 4.14 0.72 2.01 1.6 1.47 0.52 0.38 ...
##  $ numclaims: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ claimcst : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ veh_body : Factor w/ 13 levels "BUS","CONVT",..: 4 4 13 11 4 5 8 4 4 4 ...
##  $ veh_age  : int  3 2 2 2 4 3 3 2 4 4 ...
##  $ gender   : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 2 1 1 ...
##  $ area     : Factor w/ 6 levels "A","B","C","D",..: 3 1 5 4 3 3 1 2 1 2 ...
##  $ agecat   : int  2 4 2 2 2 4 4 6 3 4 ...
# CHUNK 2
vehins$claimcst <- NULL  # cannot serve as a predictor in this context
vehins$numclaims <- NULL # cannot serve as a predictor in this context

nrow(vehins[vehins$veh_value == 0, ])  # check how many obs have 0 vehicle value, doesn't make sense
## [1] 53
vehins <- vehins[vehins$veh_value > 0, ]  # remove these obs

# CHUNK 3A
# Before conversion
class(vehins$agecat)
## [1] "integer"
class(vehins$veh_age)
## [1] "integer"
vehins$agecat <- as.factor(vehins$agecat)
vehins$veh_age <- as.factor(vehins$veh_age)

# After conversion
class(vehins$agecat)
## [1] "factor"
class(vehins$veh_age)
## [1] "factor"

Data Exploration

In chunk 3B, we relevel all the categorical variables so that the baseline/reference level is that with the most observations - veh_body (no change, SEDAN was already baseline) - gender (no change, F was already baseline) - veh_age (changed 3 to be the baseline) - area (changed C to be the baseline) - agecat (change 4 to be the baseline)

# CHUNK 3B

# Save the names of the factor variables as a vector
# Relevel
vars.cat <- c("veh_age", "veh_body", "gender", "area", "agecat")
for (i in vars.cat){
  table <- as.data.frame(table(vehins[, i]))
  max <- which.max(table[, 2])
  level.name <- as.character(table[max, 1])
  vehins[, i] <- relevel(vehins[, i], ref = level.name)
}

summary(vehins)
##       clm             exposure          veh_value         veh_body    
##  Min.   :0.00000   Min.   :0.002738   Min.   : 0.180   SEDAN  :22232  
##  1st Qu.:0.00000   1st Qu.:0.219028   1st Qu.: 1.010   HBACK  :18915  
##  Median :0.00000   Median :0.446270   Median : 1.500   STNWG  :16234  
##  Mean   :0.06811   Mean   :0.468481   Mean   : 1.778   UTE    : 4586  
##  3rd Qu.:0.00000   3rd Qu.:0.709103   3rd Qu.: 2.150   TRUCK  : 1742  
##  Max.   :1.00000   Max.   :0.999316   Max.   :34.560   HDTOP  : 1579  
##                                                        (Other): 2515  
##  veh_age   gender    area      agecat   
##  3:20060   F:38584   C:20530   4:16179  
##  1:12254   M:29219   A:16302   1: 5734  
##  2:16582             B:13328   2:12868  
##  4:18907             D: 8161   3:15757  
##                      E: 5907   5:10722  
##                      F: 3575   6: 6543  
## 

Now only, veh_value is the only numeric variable to analyze (other than clm and exposure):

  • Ranges from 0.180 to 34.560 (or 1,800 to 345,600) – wide range
  • The mean is greater than the median, indicating a right skew – we’ll confirm in the histogram generated in CHUNK 4
  • We’ll log-transform and null the original variable
# CHUNK 4
library(ggplot2)
ggplot(vehins, aes(x = veh_value)) +
  geom_histogram()

vehins$log_veh_value <- log(vehins$veh_value)

vehins$veh_value <- NULL

Next we analyze the (bivariate) relationship between the response (clm) and the only other numeric variable, log_veh_value in CHUNK 5 – via a split box plot sinc clm is technically a categorical. We note the following:

  • log_veh_value seems to be slightly higher for those with clm = 1. We’ll see how this positive relationship will be quantified in our GLM later.
# CHUNK 5
ggplot(vehins, aes(x = factor(clm), y = log_veh_value)) +
  geom_boxplot(fill = "red")

Since the other variables are all categorical, to explore each bivariate relationship with clm (which is also categorical), a bar plot (stacked or side-by-side) may be used to check for any relationships (see CHUNK 6 which uses a for loop). If there are any relationships, we’d expect differnces in the response across the different levels of a variable. We note the following:

  • Slightly higher claim occurrence with veh_age 2
  • BUS types have a much higher claim occurrence; while CONVT, mIBUS and UTE seem to have lower occurences
  • No discernible difference in gender
  • Higher occurence in Area F
  • Occurence seems to decline with increasing age-bands
# CHUNK 6
for (i in vars.cat) {
  plot <- ggplot(vehins, aes(x = vehins[, i], fill = factor(clm))) +
    geom_bar(position = "fill") +  # makes bar a percentage
    labs(x = i, y = "percent") +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
  print(plot)
}

In CHUNK 7, we’ll look at summary statistics, splitting each variable into their unique levels and comparing the mean number of claim occurrences and the number of obs. These summaries will be helpful in deciding which levels should be folded into one another to for variables with a high number of levels like veh__body, area and agecat.

In CHUNK 8, we use the grouped statistics to make the following level combinations:

  • For veh_body, we can fold all levels except BUS, CONVT and MCARA into one giant category called HYBRID since all except these 3 have similar occurence rates
    • Although the three abnormal levels have relatively small n’s, their occurence rate is different from each other that they should be kept separate
  • For area, we can fold A/C/D/E together since they have similar occurence rates – call it base. Fold B/F into another called other
  • For agecat, 2/3/4 can be folded into one called group 2 and 5/6 can be folded into a new group 3
  • For the exam, if asked to reduce/combine levels, we should be doing so in a way that makes sense (intuitively) and backed by summary statistics such as similar means/medians; avoid justifying solely due to small observation sizes, although if severe these levels should be combined into another
# CHUNK 7
library(plyr) 
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:gridExtra':
## 
##     combine
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
for (i in vars.cat) {
  x <- vehins %>% 
    group_by_(i) %>% # underscore after group_by lets it pass through string of var name
    dplyr::summarise(mean = mean(clm),
              n = n()
    )
  print(x)
}
## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.
## # A tibble: 4 x 3
##   veh_age   mean     n
##   <fct>    <dbl> <int>
## 1 3       0.0679 20060
## 2 1       0.0672 12254
## 3 2       0.0759 16582
## 4 4       0.0620 18907
## # A tibble: 13 x 3
##    veh_body   mean     n
##    <fct>     <dbl> <int>
##  1 SEDAN    0.0664 22232
##  2 BUS      0.184     38
##  3 CONVT    0.0370    81
##  4 COUPE    0.0872   780
##  5 HBACK    0.0668 18915
##  6 HDTOP    0.0823  1579
##  7 MCARA    0.116    121
##  8 MIBUS    0.0601   716
##  9 PANVN    0.0824   752
## 10 RDSTR    0.0741    27
## 11 STNWG    0.0721 16234
## 12 TRUCK    0.0683  1742
## 13 UTE      0.0567  4586
## # A tibble: 2 x 3
##   gender   mean     n
##   <fct>   <dbl> <int>
## 1 F      0.0686 38584
## 2 M      0.0675 29219
## # A tibble: 6 x 3
##   area    mean     n
##   <fct>  <dbl> <int>
## 1 C     0.0688 20530
## 2 A     0.0664 16302
## 3 B     0.0723 13328
## 4 D     0.0607  8161
## 5 E     0.0652  5907
## 6 F     0.0783  3575
## # A tibble: 6 x 3
##   agecat   mean     n
##   <fct>   <dbl> <int>
## 1 4      0.0682 16179
## 2 1      0.0863  5734
## 3 2      0.0724 12868
## 4 3      0.0705 15757
## 5 5      0.0572 10722
## 6 6      0.0556  6543
# CHUNK 8 

# Brief note on how mapvalues works: takes in the factor variable, the old levels, and assigns the new level
# veh_body
var <- "veh_body"
var.levels <- levels(vehins[, var])
vehins[,var] <- mapvalues(vehins[, var], var.levels,
                          c("HYBRID", "BUS", "CONVT", "HYBRID", "HYBRID",
                            "HYBRID", "MCARA", "HYBRID", "HYBRID", "HYBRID",
                            "HYBRID", "HYBRID", "HYBRID"))

# area
var <- "area"
var.levels <- levels(vehins[, var])
vehins[,var] <- mapvalues(vehins[, var], var.levels,
                          c("BASE", "BASE", "OTHER", "BASE", "BASE", "OTHER"))

# agecat
var <- "agecat"
var.levels <- levels(vehins[, var])
vehins[,var] <- mapvalues(vehins[, var], var.levels,
                          c("2", "1", "2", "2", "3", "3"))

# Relevel
for (i in vars.cat){
  table <- as.data.frame(table(vehins[, i]))
  max <- which.max(table[, 2])
  level.name <- as.character(table[max, 1])
  vehins[, i] <- relevel(vehins[, i], ref = level.name)
}

summary(vehins)
##       clm             exposure          veh_body     veh_age   gender   
##  Min.   :0.00000   Min.   :0.002738   HYBRID:67563   3:20060   F:38584  
##  1st Qu.:0.00000   1st Qu.:0.219028   BUS   :   38   1:12254   M:29219  
##  Median :0.00000   Median :0.446270   CONVT :   81   2:16582            
##  Mean   :0.06811   Mean   :0.468481   MCARA :  121   4:18907            
##  3rd Qu.:0.00000   3rd Qu.:0.709103                                     
##  Max.   :1.00000   Max.   :0.999316                                     
##     area       agecat    log_veh_value     
##  BASE :50900   2:44804   Min.   :-1.71480  
##  OTHER:16903   1: 5734   1st Qu.: 0.00995  
##                3:17265   Median : 0.40547  
##                          Mean   : 0.38675  
##                          3rd Qu.: 0.76547  
##                          Max.   : 3.54270
# check level summary stats again
for (i in vars.cat) {
  x <- vehins %>% 
    group_by_(i) %>% # underscore after group_by lets it pass through string of var name
    dplyr::summarise(mean = mean(clm),
              n = n()
    )
  print(x)
}
## # A tibble: 4 x 3
##   veh_age   mean     n
##   <fct>    <dbl> <int>
## 1 3       0.0679 20060
## 2 1       0.0672 12254
## 3 2       0.0759 16582
## 4 4       0.0620 18907
## # A tibble: 4 x 3
##   veh_body   mean     n
##   <fct>     <dbl> <int>
## 1 HYBRID   0.0680 67563
## 2 BUS      0.184     38
## 3 CONVT    0.0370    81
## 4 MCARA    0.116    121
## # A tibble: 2 x 3
##   gender   mean     n
##   <fct>   <dbl> <int>
## 1 F      0.0686 38584
## 2 M      0.0675 29219
## # A tibble: 2 x 3
##   area    mean     n
##   <fct>  <dbl> <int>
## 1 BASE  0.0663 50900
## 2 OTHER 0.0735 16903
## # A tibble: 3 x 3
##   agecat   mean     n
##   <fct>   <dbl> <int>
## 1 2      0.0702 44804
## 2 1      0.0863  5734
## 3 3      0.0566 17265

Next we’ll select an interaction (where a predictor’s relationship with claim occurence is dependent on another predicotr). CHUNK 9 explores a possible interaction between veh_body and log_veh_value. We note the following:

  • It does seem like there is an interaction between log_veh_value and veh_body as for some veh_body, a higher vehicle value is associated with a higher claim occurence (HYBRID, CONVT and MCARA), while for BUS the opposite is true (i.e. higher claim occurence is associated with a lower value)
  • Seems like we should include an interaction term for these two predictors
# CHUNK 9
ggplot(vehins, aes(x = factor(clm), y = log_veh_value)) +
  geom_boxplot() +
  facet_wrap(~ veh_body) +
  ylim(-1, 1)

4.3.2 Model Construction and Selection

We begin constructing our GLM by first selecting a target distribution (binomial, obviously) and contemplate good candaidates for the link function.

Splitting the Train/Test Set

  • Same as usual, use the createDataPartition function to generate indices
  • Remember to set.seed() for reproducible results!
  • The mean occurence rates between the train/test sets are similar so they’re both representative and a testament to how well the built-in stratification is working within the createDataPartition() function – we can proceed
# CHUNK 10
library(caret)
set.seed(4769)  # remember to set.seed for reproducible results!!
partition <- createDataPartition(y = vehins$clm, p = .75, list = FALSE)

train <- vehins[partition, ]
test <- vehins[-partition, ]
mean(train$clm)
## [1] 0.06870784
mean(test$clm)
## [1] 0.06631268

Training our GLM

Next, we’lll fit two separate GLMs (one using the logit-link and another using the probit-link) passing through all the predictors we prepared earlier along with the interaction terms for log_veh_value and veh_body. Note that we’re excluding the exposure variable for now in our model.

Comparing the AICs on the training set (CHUNK 11), the two models perform almost identically (slightly better for the probit model), but since the logit model produces a more interpretable model, we’ll proceed with the logit link. - Note that deviance (similar to \(R^2\)) is only a goodness-of-fit measure and does not penalize for overparamterization – thus, alternative metrics that do, such as AIC or BIC are preferable for comparing models

# CHUNK 11
logit.full <- glm(clm ~ . - exposure + log_veh_value:veh_body,
                  data = train, family = binomial)
probit.full <- glm(clm ~ . - exposure + log_veh_value:veh_body,
                   data = train, family = binomial(link = "probit"))

AIC(logit.full)
## [1] 25348.85
AIC(probit.full)
## [1] 25348.12

Constructing a Confusion Matrix

Now to guage how well our logistic regression model actually performs (still on the training set), we’ll produce a confusion matrix (CHUNK 12), by specifying an arbitrary cutof (0.1).

We note the following:

  • High specificity – model is good at detecting no claim occurence (reminder, spec = predicted N / actual N)
  • Low sensitivity – model is not great at detecting claim occurence (reminder, sens = predicted Y / actual Y)
  • These are partly due to the fact that most policies/obs do not lead to a claim (data doesn’t represent a lot of clm = 1 policies)
# CHUNK 12
cutoff <- 0.1  # you can try other values

# Generate predicted probabilities
pred <- predict(logit.full, type = "response")

# Turn predicted probabilities into predicted classes
class <- 1*(pred > cutoff) # OR ifelse(pred > cutoff, 1, 0)

confusionMatrix(factor(class), factor(train$clm), positive = "1")  # positive = "1" also meaning claim occurence
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 46709  3405
##          1   650    89
##                                           
##                Accuracy : 0.9203          
##                  95% CI : (0.9179, 0.9226)
##     No Information Rate : 0.9313          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0185          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.02547         
##             Specificity : 0.98628         
##          Pos Pred Value : 0.12043         
##          Neg Pred Value : 0.93205         
##              Prevalence : 0.06871         
##          Detection Rate : 0.00175         
##    Detection Prevalence : 0.01453         
##       Balanced Accuracy : 0.50587         
##                                           
##        'Positive' Class : 1               
## 

Feature Selection

Next we’ll beging selecting/removing features that lack predictive power (leading to overfitting). We’ll use the stepAIC function/stepwise algorithms to determine which features to remove.

Additionally, for factor variables we’ll explicitly binarize our variables to determine which levels seem to have no significant effect relative to the baseline, and thus condense our factor variables further by potentially combining more levels.

CHUNK 13 runs the summary output – we note the following:

  • Many features present in our model due to multiple levels and multiple factos included
  • Only a few features seem statistically significant, namely veh_bodyBUS, areaOTher, the agecat levels and log_veh_value
  • Some of these features may be insignificant due to other highly correlated predictors (multicollinearirty)
  • Before running stepAIC(), we’ll explicitly binarize our variables so that individual levels (not whole factor variables) will be removed if the effect is insignificant (CHUNK 14)
    • Frrom the output some of the veh_body levels and veh_age levels may be better off being combined with other levels such as the baseline
    • We’ll also explicitly binarize the interaction term for log_veh_value and veh_body
    • Remmeber to delete the original variables after binding in the binarized versions
# CHUNK 13
summary(logit.full)
## 
## Call:
## glm(formula = clm ~ . - exposure + log_veh_value:veh_body, family = binomial, 
##     data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1054  -0.3962  -0.3710  -0.3442   2.5928  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.666221   0.040753 -65.425  < 2e-16 ***
## veh_bodyBUS                  1.893682   0.614643   3.081  0.00206 ** 
## veh_bodyCONVT                0.344542   1.097079   0.314  0.75348    
## veh_bodyMCARA               -0.676286   1.037468  -0.652  0.51449    
## veh_age1                    -0.126600   0.056819  -2.228  0.02587 *  
## veh_age2                     0.063206   0.048253   1.310  0.19023    
## veh_age4                    -0.008732   0.052120  -0.168  0.86695    
## genderM                     -0.036598   0.036311  -1.008  0.31350    
## areaOTHER                    0.140962   0.039342   3.583  0.00034 ***
## agecat1                      0.240163   0.057624   4.168 3.08e-05 ***
## agecat3                     -0.236353   0.044210  -5.346 8.98e-08 ***
## log_veh_value                0.184019   0.038525   4.777 1.78e-06 ***
## veh_bodyBUS:log_veh_value   -2.120656   1.262239  -1.680  0.09294 .  
## veh_bodyCONVT:log_veh_value -0.580723   0.642281  -0.904  0.36591    
## veh_bodyMCARA:log_veh_value  1.135352   0.860314   1.320  0.18694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25455  on 50852  degrees of freedom
## Residual deviance: 25319  on 50838  degrees of freedom
## AIC: 25349
## 
## Number of Fisher Scoring iterations: 5
# CHUNK 14 - explicitly binarize
library(caret)
binarizer <- dummyVars(~ log_veh_value * veh_body + agecat + veh_age,
                       data = vehins,
                       fullRank = TRUE)
binarized.vars <- data.frame(predict(binarizer, newdata = vehins))
head(binarized.vars)
##   log_veh_value veh_body.BUS veh_body.CONVT veh_body.MCARA agecat.1
## 1    0.05826891            0              0              0        0
## 2    0.02955880            0              0              0        0
## 3    1.18172720            0              0              0        0
## 4    1.42069579            0              0              0        0
## 5   -0.32850407            0              0              0        0
## 6    0.69813472            0              0              0        0
##   agecat.3 veh_age.1 veh_age.2 veh_age.4 log_veh_value.veh_bodyBUS
## 1        0         0         0         0                         0
## 2        0         0         1         0                         0
## 3        0         0         1         0                         0
## 4        0         0         1         0                         0
## 5        0         0         0         1                         0
## 6        0         0         0         0                         0
##   log_veh_value.veh_bodyCONVT log_veh_value.veh_bodyMCARA
## 1                           0                           0
## 2                           0                           0
## 3                           0                           0
## 4                           0                           0
## 5                           0                           0
## 6                           0                           0
vehins.bin <- cbind(vehins, binarized.vars)  # attach the binarized variables

# remove the original factor variables (impor)
vehins.bin$veh_age <- NULL
vehins.bin$agecat <- NULL
vehins.bin$veh_body <- NULL
vehins.bin$log_veh_value <- NULL

head(vehins.bin)  # full binarized dataset
##   clm  exposure gender area log_veh_value veh_body.BUS veh_body.CONVT
## 1   0 0.3039014      F BASE    0.05826891            0              0
## 2   0 0.6488706      F BASE    0.02955880            0              0
## 3   0 0.5694730      F BASE    1.18172720            0              0
## 4   0 0.3175907      F BASE    1.42069579            0              0
## 5   0 0.6488706      F BASE   -0.32850407            0              0
## 6   0 0.8542094      M BASE    0.69813472            0              0
##   veh_body.MCARA agecat.1 agecat.3 veh_age.1 veh_age.2 veh_age.4
## 1              0        0        0         0         0         0
## 2              0        0        0         0         1         0
## 3              0        0        0         0         1         0
## 4              0        0        0         0         1         0
## 5              0        0        0         0         0         1
## 6              0        0        0         0         0         0
##   log_veh_value.veh_bodyBUS log_veh_value.veh_bodyCONVT
## 1                         0                           0
## 2                         0                           0
## 3                         0                           0
## 4                         0                           0
## 5                         0                           0
## 6                         0                           0
##   log_veh_value.veh_bodyMCARA
## 1                           0
## 2                           0
## 3                           0
## 4                           0
## 5                           0
## 6                           0

CHUNK 15 takes the binarized full dataset and implements the exact same train/test split as before, reusing the randomized split obs indices from before (partition). We then run the full logit model again – the sumamry output should be exactly the same (again we’re only explicitly binarizing so the stepAIC() algorithm is able to remove individual levels if they’re insignificant).

# CHUNK 15
# Set up the binarized training and test sets
train <- vehins.bin[partition, ]
test <- vehins.bin[-partition, ]

# Fit the logistic regression model to the binarized training set
logit.full <- glm(clm ~ . - exposure, data = train, family = binomial)
summary(logit.full)
## 
## Call:
## glm(formula = clm ~ . - exposure, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1054  -0.3962  -0.3710  -0.3442   2.5928  
## 
## Coefficients:
##                              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.666221   0.040753 -65.425  < 2e-16 ***
## genderM                     -0.036598   0.036311  -1.008  0.31350    
## areaOTHER                    0.140962   0.039342   3.583  0.00034 ***
## log_veh_value                0.184019   0.038525   4.777 1.78e-06 ***
## veh_body.BUS                 1.893682   0.614643   3.081  0.00206 ** 
## veh_body.CONVT               0.344542   1.097079   0.314  0.75348    
## veh_body.MCARA              -0.676286   1.037468  -0.652  0.51449    
## agecat.1                     0.240163   0.057624   4.168 3.08e-05 ***
## agecat.3                    -0.236353   0.044210  -5.346 8.98e-08 ***
## veh_age.1                   -0.126600   0.056819  -2.228  0.02587 *  
## veh_age.2                    0.063206   0.048253   1.310  0.19023    
## veh_age.4                   -0.008732   0.052120  -0.168  0.86695    
## log_veh_value.veh_bodyBUS   -2.120656   1.262239  -1.680  0.09294 .  
## log_veh_value.veh_bodyCONVT -0.580723   0.642281  -0.904  0.36591    
## log_veh_value.veh_bodyMCARA  1.135352   0.860314   1.320  0.18694    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25455  on 50852  degrees of freedom
## Residual deviance: 25319  on 50838  degrees of freedom
## AIC: 25349
## 
## Number of Fisher Scoring iterations: 5

CHUNK 16 the stepAIC() function is used twice to run a backward selection algorithm using AIC as its criterion and the second using BIC

  • Again, backward selection tends to yield more features while forward yields less
  • Similarly, AIC tends to yield more features than BIC
  • The backwards AIC algorithm selected a model with area, log_veh_value, veh_body.BUS, agecat.1, agecat.3, veh_age.1, veh_age.2, and the interactions between log-veh_value and veh_bodyBUS, and veh_bodyMCARA
    • veh_age.2 and the interactio with veh_bodyBUS are the only two features insignificant
  • The backwards BIC algorithm selected a model with much less features, only including areaOTHER, log_veh_value and the agecat levels
    • All the interaction terms were dropped
    • All features are significant
    • The difference between the AIC and BIC selected models make sense since BIC places a much harsher penalty on the number of parameters than AIC, especially given a large number of obs in the training set
  • Since the objective is to identify only the key factors of a claim occurring, we’ll choose the more parsimonious BIC model
# CHUNK 16
library(MASS)
logit.AIC <- stepAIC(logit.full)
## Start:  AIC=25348.85
## clm ~ (exposure + gender + area + log_veh_value + veh_body.BUS + 
##     veh_body.CONVT + veh_body.MCARA + agecat.1 + agecat.3 + veh_age.1 + 
##     veh_age.2 + veh_age.4 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyCONVT + 
##     log_veh_value.veh_bodyMCARA) - exposure
## 
##                               Df Deviance   AIC
## - veh_age.4                    1    25319 25347
## - veh_body.CONVT               1    25319 25347
## - veh_body.MCARA               1    25319 25347
## - log_veh_value.veh_bodyCONVT  1    25320 25348
## - gender                       1    25320 25348
## - veh_age.2                    1    25321 25349
## - log_veh_value.veh_bodyMCARA  1    25321 25349
## <none>                              25319 25349
## - log_veh_value.veh_bodyBUS    1    25322 25350
## - veh_age.1                    1    25324 25352
## - veh_body.BUS                 1    25326 25354
## - area                         1    25332 25360
## - agecat.1                     1    25335 25363
## - log_veh_value                1    25342 25370
## - agecat.3                     1    25348 25376
## 
## Step:  AIC=25346.87
## clm ~ gender + area + log_veh_value + veh_body.BUS + veh_body.CONVT + 
##     veh_body.MCARA + agecat.1 + agecat.3 + veh_age.1 + veh_age.2 + 
##     log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyCONVT + 
##     log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - veh_body.CONVT               1    25319 25345
## - veh_body.MCARA               1    25319 25345
## - log_veh_value.veh_bodyCONVT  1    25320 25346
## - gender                       1    25320 25346
## - log_veh_value.veh_bodyMCARA  1    25321 25347
## <none>                              25319 25347
## - veh_age.2                    1    25321 25347
## - log_veh_value.veh_bodyBUS    1    25322 25348
## - veh_age.1                    1    25324 25350
## - veh_body.BUS                 1    25326 25352
## - area                         1    25332 25358
## - agecat.1                     1    25336 25362
## - log_veh_value                1    25347 25373
## - agecat.3                     1    25348 25374
## 
## Step:  AIC=25344.97
## clm ~ gender + area + log_veh_value + veh_body.BUS + veh_body.MCARA + 
##     agecat.1 + agecat.3 + veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + 
##     log_veh_value.veh_bodyCONVT + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - veh_body.MCARA               1    25319 25343
## - gender                       1    25320 25344
## - log_veh_value.veh_bodyMCARA  1    25321 25345
## - log_veh_value.veh_bodyCONVT  1    25321 25345
## <none>                              25319 25345
## - veh_age.2                    1    25321 25345
## - log_veh_value.veh_bodyBUS    1    25322 25346
## - veh_age.1                    1    25324 25348
## - veh_body.BUS                 1    25326 25350
## - area                         1    25332 25356
## - agecat.1                     1    25336 25360
## - log_veh_value                1    25347 25371
## - agecat.3                     1    25349 25373
## 
## Step:  AIC=25343.43
## clm ~ gender + area + log_veh_value + veh_body.BUS + agecat.1 + 
##     agecat.3 + veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + 
##     log_veh_value.veh_bodyCONVT + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - gender                       1    25321 25343
## - log_veh_value.veh_bodyCONVT  1    25321 25343
## <none>                              25319 25343
## - veh_age.2                    1    25322 25344
## - log_veh_value.veh_bodyBUS    1    25323 25345
## - log_veh_value.veh_bodyMCARA  1    25323 25345
## - veh_age.1                    1    25325 25347
## - veh_body.BUS                 1    25327 25349
## - area                         1    25332 25354
## - agecat.1                     1    25336 25358
## - log_veh_value                1    25348 25370
## - agecat.3                     1    25349 25371
## 
## Step:  AIC=25342.52
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyCONVT + 
##     log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - log_veh_value.veh_bodyCONVT  1    25322 25342
## <none>                              25321 25343
## - veh_age.2                    1    25323 25343
## - log_veh_value.veh_bodyBUS    1    25324 25344
## - log_veh_value.veh_bodyMCARA  1    25324 25344
## - veh_age.1                    1    25325 25345
## - veh_body.BUS                 1    25328 25348
## - area                         1    25333 25353
## - agecat.1                     1    25337 25357
## - log_veh_value                1    25348 25368
## - agecat.3                     1    25351 25371
## 
## Step:  AIC=25342.41
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## <none>                              25322 25342
## - veh_age.2                    1    25325 25343
## - log_veh_value.veh_bodyBUS    1    25326 25344
## - log_veh_value.veh_bodyMCARA  1    25326 25344
## - veh_age.1                    1    25327 25345
## - veh_body.BUS                 1    25330 25348
## - area                         1    25335 25353
## - agecat.1                     1    25339 25357
## - log_veh_value                1    25348 25366
## - agecat.3                     1    25353 25371
summary(logit.AIC)
## 
## Call:
## glm(formula = clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + 
##     agecat.3 + veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + 
##     log_veh_value.veh_bodyMCARA, family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0936  -0.3957  -0.3710  -0.3448   2.5219  
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                 -2.68557    0.02922 -91.906  < 2e-16 ***
## areaOTHER                    0.14069    0.03933   3.577 0.000348 ***
## log_veh_value                0.17631    0.03457   5.101 3.38e-07 ***
## veh_body.BUS                 1.87999    0.61420   3.061 0.002207 ** 
## agecat.1                     0.23894    0.05758   4.150 3.33e-05 ***
## agecat.3                    -0.23955    0.04409  -5.433 5.54e-08 ***
## veh_age.1                   -0.11617    0.05539  -2.097 0.035953 *  
## veh_age.2                    0.07383    0.04569   1.616 0.106107    
## log_veh_value.veh_bodyBUS   -2.09924    1.26072  -1.665 0.095889 .  
## log_veh_value.veh_bodyMCARA  0.59658    0.28817   2.070 0.038430 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25455  on 50852  degrees of freedom
## Residual deviance: 25322  on 50843  degrees of freedom
## AIC: 25342
## 
## Number of Fisher Scoring iterations: 5
logit.BIC <- stepAIC(logit.full, k = log(nrow(train)))
## Start:  AIC=25481.4
## clm ~ (exposure + gender + area + log_veh_value + veh_body.BUS + 
##     veh_body.CONVT + veh_body.MCARA + agecat.1 + agecat.3 + veh_age.1 + 
##     veh_age.2 + veh_age.4 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyCONVT + 
##     log_veh_value.veh_bodyMCARA) - exposure
## 
##                               Df Deviance   AIC
## - veh_age.4                    1    25319 25471
## - veh_body.CONVT               1    25319 25471
## - veh_body.MCARA               1    25319 25471
## - log_veh_value.veh_bodyCONVT  1    25320 25471
## - gender                       1    25320 25472
## - veh_age.2                    1    25321 25472
## - log_veh_value.veh_bodyMCARA  1    25321 25472
## - log_veh_value.veh_bodyBUS    1    25322 25474
## - veh_age.1                    1    25324 25476
## - veh_body.BUS                 1    25326 25478
## <none>                              25319 25481
## - area                         1    25332 25483
## - agecat.1                     1    25335 25487
## - log_veh_value                1    25342 25493
## - agecat.3                     1    25348 25500
## 
## Step:  AIC=25470.59
## clm ~ gender + area + log_veh_value + veh_body.BUS + veh_body.CONVT + 
##     veh_body.MCARA + agecat.1 + agecat.3 + veh_age.1 + veh_age.2 + 
##     log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyCONVT + 
##     log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - veh_body.CONVT               1    25319 25460
## - veh_body.MCARA               1    25319 25460
## - log_veh_value.veh_bodyCONVT  1    25320 25461
## - gender                       1    25320 25461
## - log_veh_value.veh_bodyMCARA  1    25321 25462
## - veh_age.2                    1    25321 25462
## - log_veh_value.veh_bodyBUS    1    25322 25463
## - veh_age.1                    1    25324 25465
## - veh_body.BUS                 1    25326 25467
## <none>                              25319 25471
## - area                         1    25332 25472
## - agecat.1                     1    25336 25476
## - log_veh_value                1    25347 25488
## - agecat.3                     1    25348 25489
## 
## Step:  AIC=25459.84
## clm ~ gender + area + log_veh_value + veh_body.BUS + veh_body.MCARA + 
##     agecat.1 + agecat.3 + veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + 
##     log_veh_value.veh_bodyCONVT + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - veh_body.MCARA               1    25319 25450
## - gender                       1    25320 25450
## - log_veh_value.veh_bodyMCARA  1    25321 25451
## - log_veh_value.veh_bodyCONVT  1    25321 25451
## - veh_age.2                    1    25321 25451
## - log_veh_value.veh_bodyBUS    1    25322 25453
## - veh_age.1                    1    25324 25454
## - veh_body.BUS                 1    25326 25456
## <none>                              25319 25460
## - area                         1    25332 25462
## - agecat.1                     1    25336 25466
## - log_veh_value                1    25347 25477
## - agecat.3                     1    25349 25479
## 
## Step:  AIC=25449.47
## clm ~ gender + area + log_veh_value + veh_body.BUS + agecat.1 + 
##     agecat.3 + veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + 
##     log_veh_value.veh_bodyCONVT + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - gender                       1    25321 25440
## - log_veh_value.veh_bodyCONVT  1    25321 25441
## - veh_age.2                    1    25322 25441
## - log_veh_value.veh_bodyBUS    1    25323 25442
## - log_veh_value.veh_bodyMCARA  1    25323 25442
## - veh_age.1                    1    25325 25444
## - veh_body.BUS                 1    25327 25446
## <none>                              25319 25450
## - area                         1    25332 25451
## - agecat.1                     1    25336 25455
## - log_veh_value                1    25348 25467
## - agecat.3                     1    25349 25468
## 
## Step:  AIC=25439.72
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyCONVT + 
##     log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - log_veh_value.veh_bodyCONVT  1    25322 25431
## - veh_age.2                    1    25323 25431
## - log_veh_value.veh_bodyBUS    1    25324 25432
## - log_veh_value.veh_bodyMCARA  1    25324 25432
## - veh_age.1                    1    25325 25434
## - veh_body.BUS                 1    25328 25436
## <none>                              25321 25440
## - area                         1    25333 25441
## - agecat.1                     1    25337 25445
## - log_veh_value                1    25348 25456
## - agecat.3                     1    25351 25459
## 
## Step:  AIC=25430.77
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1 + veh_age.2 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - veh_age.2                    1    25325 25423
## - log_veh_value.veh_bodyBUS    1    25326 25423
## - log_veh_value.veh_bodyMCARA  1    25326 25424
## - veh_age.1                    1    25327 25424
## - veh_body.BUS                 1    25330 25427
## <none>                              25322 25431
## - area                         1    25335 25433
## - agecat.1                     1    25339 25436
## - log_veh_value                1    25348 25446
## - agecat.3                     1    25353 25451
## 
## Step:  AIC=25422.54
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1 + log_veh_value.veh_bodyBUS + log_veh_value.veh_bodyMCARA
## 
##                               Df Deviance   AIC
## - log_veh_value.veh_bodyMCARA  1    25328 25415
## - log_veh_value.veh_bodyBUS    1    25328 25415
## - veh_body.BUS                 1    25332 25419
## - veh_age.1                    1    25335 25422
## <none>                              25325 25423
## - area                         1    25337 25424
## - agecat.1                     1    25342 25429
## - agecat.3                     1    25355 25442
## - log_veh_value                1    25367 25454
## 
## Step:  AIC=25415.04
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1 + log_veh_value.veh_bodyBUS
## 
##                             Df Deviance   AIC
## - log_veh_value.veh_bodyBUS  1    25332 25408
## - veh_body.BUS               1    25335 25411
## - veh_age.1                  1    25339 25415
## <none>                            25328 25415
## - area                       1    25341 25417
## - agecat.1                   1    25345 25421
## - agecat.3                   1    25358 25434
## - log_veh_value              1    25372 25448
## 
## Step:  AIC=25407.62
## clm ~ area + log_veh_value + veh_body.BUS + agecat.1 + agecat.3 + 
##     veh_age.1
## 
##                 Df Deviance   AIC
## - veh_body.BUS   1    25336 25401
## - veh_age.1      1    25342 25407
## <none>                25332 25408
## - area           1    25344 25409
## - agecat.1       1    25349 25414
## - agecat.3       1    25361 25426
## - log_veh_value  1    25375 25440
## 
## Step:  AIC=25400.48
## clm ~ area + log_veh_value + agecat.1 + agecat.3 + veh_age.1
## 
##                 Df Deviance   AIC
## - veh_age.1      1    25346 25400
## <none>                25336 25401
## - area           1    25348 25402
## - agecat.1       1    25353 25407
## - agecat.3       1    25365 25419
## - log_veh_value  1    25379 25433
## 
## Step:  AIC=25400.09
## clm ~ area + log_veh_value + agecat.1 + agecat.3
## 
##                 Df Deviance   AIC
## <none>                25346 25400
## - area           1    25359 25402
## - agecat.1       1    25362 25405
## - agecat.3       1    25377 25420
## - log_veh_value  1    25379 25423
summary(logit.BIC)
## 
## Call:
## glm(formula = clm ~ area + log_veh_value + agecat.1 + agecat.3, 
##     family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5332  -0.3957  -0.3730  -0.3442   2.5141  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.67961    0.02730 -98.165  < 2e-16 ***
## areaOTHER      0.14087    0.03931   3.584 0.000339 ***
## log_veh_value  0.16440    0.02859   5.750 8.90e-09 ***
## agecat.1       0.23614    0.05747   4.109 3.98e-05 ***
## agecat.3      -0.23941    0.04398  -5.443 5.24e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 25455  on 50852  degrees of freedom
## Residual deviance: 25346  on 50848  degrees of freedom
## AIC: 25356
## 
## Number of Fisher Scoring iterations: 5
AIC(logit.AIC)
## [1] 25342.41
AIC(logit.BIC)
## [1] 25355.91

Testing Predictions/Adjusting for Exposure

Before calculating predictions on our TEST set, we need to confront the fact that we’ve ignored the exposure variable so far – i.e. we’ve assumed each obs/policyholder experienced the same exposure over the policy year, which isn’t actually the case. In reality, some policies began midway through the year, having shorter exposures than a policy that was effective for the whole year.

One way we can correct for this is to use the exposure variable to assume that the probability of a claim ocurring is proportionally reduced by the amount of exposure (we’re reducing since we assumed an exposure of 1 full year, but that’s actually the maximum).

Mathematically:

  • The exposure-adjusted probability is \(\pi^*\) where \(\pi^* = t\pi\)
  • The predicted exposure-adjusted probability is then \(\hat{\pi^*} = t \hat{\pi}\)

ChUNK 17 re-scales the predictions generated by the three models we previously tested (full, AIC and BIC) using exposure and then generates a ROC curve and the AUC to compare each model’s predictive performance on the TEST dataset:

  • FULL-AUC: 0.6589
  • AIC-AUC: 0.6587
  • BIC-AUC: 0.6592
  • Note that if the ROC/AUCs were ran on these three models without adjusting for exposure, all three models would reflect AUC’s close to 0.5 (indicating poor performance across the board)

BIC selected model performs best according to the test AUC – which is reassuring despite the BIC model having less features than the other two

# CHUNK 17
library(pROC)
pred.full <- test$exposure * predict(logit.full, newdata = test, type = "response")
roc.full <- roc(test$clm, pred.full)  # actual, obs
auc(roc.full)
## Area under the curve: 0.6589
pred.AIC <- test$exposure * predict(logit.AIC, newdata = test, type = "response")
roc.AIC <- roc(test$clm, pred.AIC)
#plot(roc.AIC) # we can also plot the roc() returned list
auc(roc.AIC)
## Area under the curve: 0.6587
pred.BIC <- test$exposure * predict(logit.BIC, newdata = test, type = "response")
roc.BIC <- roc(test$clm, pred.BIC)
auc(roc.BIC)
## Area under the curve: 0.6592

Model Interpretation

Finally, we’ll refit our selected model on the entire datasets for more robust parameter estimates and call it the final model (CHUNK 18). These will be used to produce our final interpretations.

Since we’ve used the logit-link we’re able to interpret the model on an odds-based (logit exclusive), or on a probability-based (which is available for all models):

Odds-Based Interpretation

  • In terms of multiplicative changes/percentage changes with respect to the odds of a claim occurring
  • See the table below for the appropriate odds-based interpretations of each coefficient in terms of multiplicative changes
Odds-Based Interpretation of the Final Model

Probability-Based Interpretation

  • Estimated probabilities are hard to extract easily from the logistic regression results since the probability function \(p = \frac{e^{\eta}}{(1 + e^{\eta})} = \frac{1}{(1 + e^{-\eta})}\) is nonlinear and a complex function of the linear predictor
  • One possible way is to set a baseline as the “average policyholder” or “avereage ” who has the mean or median feature values and then examine the isolated impact of each feature on the estimated probability
    • See CHUNK 19 for an example
  • Based on this “average policyholder”, we can say:
    • Living in OTHER areas comapred to BASE increases the estimated prob of a claim occurring by \(0.0746 - 0.068 = 0.66\%\)
    • A 10% increase (in proportion) in log_veh_value increases the estimated prob a claim occuring by \(0.0684 - 0.068 = 0.04\%\)
    • Being in age category 1 increases the estimated prob of a claim occuring by \(0.0837 - 0.068 = 1.57\%\)
    • Being in age category 3 changes the estimated prob of a claim occuring by \(0.05567 - 0.068 = -1.23\%\)
# CHUNK 18 - FINAL model
logit.final <- glm(clm ~ area + log_veh_value + agecat.1 + agecat.3,
                   data = vehins.bin, family = binomial)
summary(logit.final)
## 
## Call:
## glm(formula = clm ~ area + log_veh_value + agecat.1 + agecat.3, 
##     family = binomial, data = vehins.bin)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5173  -0.3923  -0.3724  -0.3461   2.5019  
## 
## Coefficients:
##               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)   -2.67942    0.02367 -113.200  < 2e-16 ***
## areaOTHER      0.09974    0.03449    2.892  0.00383 ** 
## log_veh_value  0.15962    0.02481    6.433 1.25e-10 ***
## agecat.1       0.22451    0.05055    4.441 8.93e-06 ***
## agecat.3      -0.21333    0.03786   -5.635 1.76e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 33728  on 67802  degrees of freedom
## Residual deviance: 33608  on 67798  degrees of freedom
## AIC: 33618
## 
## Number of Fisher Scoring iterations: 5
# CHUNK 19
new.data <- data.frame(area = c("BASE", "OTHER", "BASE", "BASE", "BASE"),
                       log_veh_value = c(0.38675, 0.38675, 0.42543, 0.38675, 0.38675),
                       agecat.1 = c(0, 0, 0, 1, 0),
                       agecat.3 = c(0, 0, 0, 0, 1))

new.data  # first element reflects the "average" policyholder/obs, and the subsequent obs represent changes for each feature in the final model
##    area log_veh_value agecat.1 agecat.3
## 1  BASE       0.38675        0        0
## 2 OTHER       0.38675        0        0
## 3  BASE       0.42543        0        0
## 4  BASE       0.38675        1        0
## 5  BASE       0.38675        0        1
predict(logit.final, newdata = new.data, type = "response")
##          1          2          3          4          5 
## 0.06800846 0.07460946 0.06840083 0.08369446 0.05567071

5 Decision Trees

  • Decision trees are another supervised learning technique that can be used to generate predictions of some target variable given a set of predictors
  • Whereas a GLM is based on the construct that some function of the expected target is related to some linear combination of specified predictors, a decision tree takes all combinations of those predictors (or features) and tries to divide that entire feature space into non-overlapping, exhaustive regions of relatively similar/homogenous observations with respect to the response
  • Decision tree predictions are made by locating the region the new observation/data point of interest would belong to and either use the average response value or majority class belonging to that region to form the prediction

5.1 Conceptual Foundations of Decision Trees

5.1.1 Base Decision Trees - The simplest case

  • Decision trees develop a series of classifcation rules or splits based on the values of the predictor variables, where observationsin the same group based on these classifications/splits are relatively stable and thus much easier to analyze and predict
    • That is within these subgroups, observations are more like one another wrt the response
  • A quantitative target variable is predicted by using the sample mean of the target in the group as the predicted value – aka regression trees
  • A qualitative target variable is predicted by using the most common class of the target variable in that group as the predicted value – aka __classification trees_
  • These methods earn this method the CART name, or Classification and Regression Trees

A Simple Illustration

  • Given that there are two predictors \(X_1\) and \(X_2\), a decision tree can look like figure (a) shown below
    • The tree can be interpreted as: the first rule is to sort based on whether the observeration’s \(X_1 < 10\); if so then it gets sorted into the first region, \(R_1\);
    • If not then it gets sorted into the next split where it faces the second rule that checks if \(X_2 < 8\); if so then it gets sorted into \(R_2\), if not then into \(R_3\)
  • Essentially the decision tree predictive model is taking a two-dimensional space (defined by \(X_1\) and \(X_2\)) and partitioning it into three regions \(R_i\) where \(i = \{1, 2, 3\}\) (see figure (b) below)
Decision Tree Figures

Commonly Used Terms

  • Node: Point on a decision tree that corresponds to a subset of the data – usually into one or more binary splits
  • Root node: Node at the top of the tree, representing the full dataset
  • Terminal nodes/Leaves: Nodes at the bottom where there no further splits
  • Binary tree: Tree where each node only has two children
  • Depth: Number of branches/splits from the tree’s rooth node to the furthest terminal node; a measure of complexity

How Are Splits Determined?

  • Fundamental to decision trees is understanding how the series of binary classification rules are designed, i.e. how does the tree know which splits to do at what point in the tree?
    • Note that binary splits usualy separate observations into two groups according to the value of one and only one of the predictors (i.e. can’t use both \(X_1\) and \(X_2\) to determine the same split). This is what gives the partitioningin our simple example, the visual of small rectangles within the larger rectange (the feature space)
  • So how does the decision tree algorithm (1) go about selecting the predictor to define the split and (2) what cutoff level to use at the split (e.g. \(X > cutoff\))
    • Intuitively, the algorithm should split the target observations based on the value of an “influential” variable that results in the greatest reductin in variability of the target variable in the following split subgroups/regions
    • The result would be relatively “pure” or homogenous target observations in the resulting two nodes
  • In order to determine the greatest improvement in “purity”, or rather the greatest decrease in “impurity” we need a way to measure the degree of heterogeneity – the impurity measure will depend on the target we’re trying to predict

Impurity Measures

Regression Trees

  • For a certain node (not necessarily a terminal node) \(R_m\), the RSS (residual sum of squares) would be used to measure “impurity”

\[ RSS_m = \sum_{i \in R_m} (y_i - \hat{y_{R_m}})^2 \]

  • \(\hat{y_{R_m}}\) is the mean of the target variable in \(R_m\) and serves as our prediction for the target variable there
  • In terms of RSS, the first split should be such that the overall RSS on both sides of the split/branch are smallest among all possible choices of (1) the predictor \(X_j\) and the cutoff level \(s\)

Classification Trees

Classification Trees are more difficult since there are more impurity measures to deal with. Given a node \(R_m\) and that the target variable has \(K\) distinct classes, three common measure are:

  1. Entropy, where \(p_k\) is the proportion of target observations in the \(k\)th class in the given tree node:

\[ E = - \sum_{k = 1} ^ K p_k \log_2(p_k) \]

  • By construction, entropy increases with the degree of impurity in the node
  • Smallest value is 0, when one class is completely dominant (PURITY)
  • Larger value is \(\log_2 K\), when \(p_k = 1/K\) for each k (i.e. observations are evenly split, IMPURITY)
  1. Gini

\[ G = \sum_{k=1}^K p_k (1-p_k) = 1 - \sum_{k=1}^Kp_k^2 \]

  • Again, higher \(G\) value indicating higher impurity
  1. Classification Error
  • Defined as the proportion the observations incorrectly classified

Note that in most cases the node impurity measure doesn’t affect the performance or results of a classification tree. Entropy and Gini are differentiable so technically more amenable to numerical optimization

SO… how are splits determined?

  • Given these impurity measures, the best binary split at each step is determined to MAXIMIZE the REDUCTION in the chosen impurity measure
  • That is, splits are made to REDUCE IMPURITY the MOST
  • Resulting splits are further split recursively and continues until we reach a stopping point
    • e.g. number of observations in the resultin node falls below a certain threshold or no additional splits could reduce impurity in a substantial way
    • Note that without these stopping points, decision trees can get very complex – cutting them down is a process known as pruning

Contrlling Complexity via Pruning

  • One of the biggest drawbacks of decision trees is that they are prone to overfitting – trees are made too complext in an attempt to capture the signal in the training data that they catch the noise in the training data too

  • For decision trees, the complexity/size of the tree can be measured by the number of splits or the number of leaves/terminal nodes
    • Numerous terminal nodes may make it difficult to interpret and produce predictions with a high variance (when confronted with unseen data)
    • One reason predictions become unstable for a tree with a large number of leaves is that at the final splits, there are often a small number of observations more susceptible to noise
  • Decision trees therefore need strategies employed to find the right degree of complexity that optimizes the bias-variance tradeoff and maximizes predictive performance on independent test data

  • One way to control complexity is to pre-specify an impurity reduction threshold
    • Starting at the root node, a split will only be made if the reduction in impurity exceeds a certain threshold
    • This can be problematic if a “not-so-good-split” isn’t made early in the tree, even if it potentially could be followed by a “good split” – thus a fairly short-sighted method
  • The preferred way of controlling complexity is PRUNING (kind of like stepwise selection)
    • Start with an overblow tree and then retroactively remove splits that do not fulfill the impurity reduction threshold
    • Brings the tree to a desired level
    • “Prunes” back branches that offer little predictive power and prevents overfitting, while making the tree more predictive and interpretable

Pros and Cons of Decision Trees

Relative to GLMs, the pros/cons of decision trees are as follows:

PROS

  • Given there aren’t too many buckets/leaves, trees are easy to interpret and explain to a non-technical audience because of the logical if/then nature of classification rules
    • GLMs (estimated coefficients and outputs) can be difficult to interpret in some cases (some link functions less interpretable than others)
  • Great at handling nonlinear relationships and do not require variable transformations – any monotonic transformations yield the same tree since the ranking of feature values is preserved
    • GLMs rely on a linear predictor and often require some transformation for handling nonlinear relationships
  • Good at recognizing interaction effects between variables on the response
    • No need to specify potential interactions before fitting a tree as is the case with GLMs
  • Categorical predictors are automatically handled without the need for binarization or selecting a baseline level
    • GLMs require factor/categorical variables to be translated as a numeric via binarization
  • Variables are automatically selected – variables that aren’t significant are filtered out of the tree and the most important appear at the top
    • GLMs require additional procedures i.e. feature selection
  • Decision trees are non-parameteric and distribution free in nature, so they’re less susceptible to model misspecification
    • GLMs require specified assumptions regarding how to parameterize the model
  • Decision trees can handle missing data fairly easily

Cons

  • Decision trees are prone to overfitting and susceptible to producing unstable predictions with high variance, especially for complicated trees
    • More sophisticated methods such as bagging/random forests and boosting can be deployed to mitigate this drawback
  • Small changes in the training data can lead to a large change in the fitted tree and predictions (i.e. sensitive to the training data, another form of overfitting issues)
  • Tend to favor categorical features with many levels over those with fewer due to the sheer number of possible combinations multiple levels of a factor can be binarily split
    • Combining levels before fitting a decision tree however can help with this
  • Lack of model diagnostic tools for decision trees

5.1.2 Ensemble Trees: Random Forests

Motivation Behind Ensemble Methds

  • Decision trees on their own are sensitive to noise and tends to overfit to the training data even with pruning
    • Small changes in the training data can have significant effects on the resulting refitted decision tree, thus predictions can be highly unstable
  • Ensemble methods attempt to hedge against overfitting to substantially improve predicitve performance in many cases by creating multiple independent base models, instead of relying on a single model
    • The results of these multiple models are then combined to make predictions
    • This allows the model to capture complex relationships in the data, since multiple base models each work on different parts of the complex relationship, while also reducing the variability of the model’s predictions (idea is similar to the variance of average is lower than the variance a single component)
    • Overall, produces better predictions in terms of both bias and variance
  • Common ensemble methods include bagging (specifically random forests) and boosting

Random Forests – the Basics

  • They entail generating multiple bootstrapped (sampling with replacement) samples of the training set and fit base trees in parallel independently on each of the bootstrapped training samples

  • After generating multiple bootstrapped samples of the training dataset, a base unpruned tree is trained on each of the \(b\) (where \(b\) is the number of bootstrapped training samples) training sets

  • The results from these base trees are then combined to form an overall prediction
    • For quantitative targets, the overall prediction is the average of the \(B\) base predictions
    • For qualitative targets, a “majority vote” of where the observation should be classified is taken from all \(B\) base predictions
  • See below for a schematic of how random forests work to take a single training set, bootstrap that set to create multiple \(B\) training sets, fit multiple base decision trees to then arrive at a final prediction

Schematic of the mechanics of a random forest

Randomization at Each Split

  • One of the distinguishing characterists of a random forest is that at each split (when each of the base decision trees are grown) a random sample of \(m\) predictors is chosen as the split candidate out of the \(p\) available features
    • Whichever of the \(m\) chosen predictors contributes to the greatest reduction in impurity is used to determine the split
    • A new random sample of \(m\) predictors is made at every split (so one chosen previously in the same base tree can be up to get chosen again)
    • Typical choice of \(m\) is \(\sqrt{p}\) for classification and \(p/3\) for regression but can be tuned via cross-validation as part of the model-building process
    • The reason a random sample of a subset of the available features used as a predictor to determine a given split is beneficial is to prevent each base tree from looking similar to each other – promotes a diverse forest of trees with a diverse set of features for different splits, helping reduce variance when the predictions are averaged

Pros and COns of Random Forests (relative to a single tree)

Pros

  • As an ensemble method, produces more robust predictions than a single decision tree
    • Even those the \(B\) base trees are unpruned (have low bias, high variance), the averaging results of the trees contributes substantial variance reduction – especially when the \(B\) number of base trees is large
    • Note that bias is generally not reduced by random forests

Cons

  • Random forests are not as interpretable as a single decision tree
    • Hundreds or even thousands of decision trees cannot be visualized in a tree-like diagram and also lose much of their interpretability than a single tree that can be followed along
    • Less transparent in that it’s difficult to see how predictions depend on each feature (i.e. the whole process seems like a black box)
  • Random forests can be computationally expensive compared to a base decision tree due to the computational burden of fitting multiple base decision trees

5.1.3 Ensemble Methods: Boosting

  • Boosting adopts a different approach to ensemble learning than random forests; Boosting works on the principle of sequential learning
    • Instead of making independent bootstrapped samples and averaging predictions across the different treesm boosting builds a sequence of interdepedent trees using information from previously grown trees
  • Specially, boosting involves an iterative process where a tree is fit to the residuals of the preceding tree (using the same set of predictors)
    • A scaled down version of the current tree’s predictions is subtracted from the preceding tree’s residuals to form new residuals – then the process repeats
    • The effect is that each tree focuses on the preceding tree’s predicting observations that were poorly predicted
    • Thus although random forests address model variance, boosting focuses on model bias, so it’s better at capturing the signal in the data
  • The overall prediction of the boosted tree is the sum of the scaled-down prediction of each model
    • The use of scaled down predictions is meant to slow down the learning process (generally true that processes that learn slower, learn better)

Boosting vs Random Forests/Base Trees

  • Boosted models perform better than random forests in terms of prediction accuracy since they emphasize reducing bias
    • More prone to overfitting than random forests
    • Also entails high computational costs (similar to random forests)
  • Boosted models compared to decision trees again are better at gaining prediction accuracy (since it reduces bias significantly)
    • Less interpretable (just like random forests)
    • More computationally costly (again, just like random forests)

5.2 Mini-Case Study: A Toy Decision Tree

  • In this “mini” case-study, we’ll construct a toy decision tree on a small scale dataset
    • THe small scale will allow us to perform some calculations by hand and replicate/confirm the R output
  • The objectives covered in this case study are to:
    • Fit a regression tree using rpart()
    • Produce a graphical tree using rpart.plot
    • Interpret the output produced by rpart()
    • Prune the regression tree using prun()

First, we’ll load the data set in CHUNK 1 – \(Y\) is our numeric response and \(X_1\), \(X_2\) are our two predictor variables (both numeric)

# CHUNK 1
# Clean the working memory
rm(list = ls())

# Set up toy variables
X1 <- c(1, 2, 3, 4, 2, 1)
X2 <- c(0, 1, 2, 1, 2, 1)
Y <- c(1.2, 2.1, 1.5, 3.0, 2.0, 1.6)

dat <- data.frame(X1, X2, Y)

Main Function: rpart()

  • rpart() – short for recursive partitioning – is one of many packages and functions used to fit a decision tree
  • Using method = “anova” will fit a regression tree, while method = “class” will run a classification tree; leaving it blank will make R try to figure it out on its own
  • The control argument uses the rpart.control() function to specify a list of “controlling” parameters that tells the tree when to stop growing. Some parameters of rpart.control() include:
    • minsplit: Min number of obs that must exist in the node for a split to be attempted; the lower, the more complicated tree
    • minbucket: Min number of observations in any terminal node (or bucket); the lower the more complicated
    • cp: The “complexity parameter”; value between 0 and 1 that specifies the min amount of impurity reduction required for a split to be made; the higher the less complicated the tree will be; additionally note that the cp parameter is used to pre-prune splits that are obviously not worthwhile (e.g. a cp of 0.001)
    • maxdepth: Sets the max number of branches from the tree’s root node to the furthest terminal node; the larger, the more complex
  • Very often the control parameters will probably need to be tuned as part of the tree construction process to get the right level of tree complexity. Note the two parameters associated with cross-validation that affect how tree splits are evaluated:
    • xval: The number of folds when doing cross-validation; no direct effect on tree complexity but it affects the performance assessed by cross-validation of the decision trees automatically fitted
    • parms: Limited to categorical targets only and describes parameters that guide how splits are performed (e.g. gini, information as impurity measures)

Fitting and Visualizing a Decision Tree

CHUNK 2 we’ll fit the decision tree of using rpart using \(Y\) as our response and \(X_1\) and \(X_2\) as our predictors; note the control parameters we specified – deliberately set so that the most complex tree will be constructed.

We plot the tree using rpart.plot – note the following:

  • The top number in each node indicates the fitted target value
  • The bottom number is the % of our obs in that node
  • The first split is whether \(X_1 < 4\), indicating between the two predictors and of all cutoff values, \(X_1 < 4\) resulted in the greatest reduction in tree impurity (measured by RSS)
  • The partitions continue with the left branch until no more node can be split to reduce the overall RSS of the model
    • We have a perfect fit (RSS is exactly 0 by the end with 1 obs in each leaf) as expected due to our control parameter specifications

Using a Decision Tree to Detect Interaction

  • From the plot in CHUNK 2, it seems like there’s an interaction between \(X_1\) and \(X_2\) – note that \(X_2\) only has a significant effect on \(Y\) if \(X_1 < 4\), as illustrated by \(X_2\) showing up as split only on the left branch
    • If \(X_1 \ge 4\) then \(X_2\) is deemed useless by the model.
  • If there were no interactions between \(X_1\) and \(X_2\) then we would expect the same set of splits that appear in the left branch to also be used in the right branch

  • CHUNK 3 runs a linear model to confirm the interaction

# CHUNK 2
# Uncomment the next two lines the first time you use these two packages
# install.packages("rpart")
# install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
dt <- rpart(Y ~ ., data = dat, control = rpart.control(minsplit = 1,
                                                       minbucket = 1,
                                                       cp = 0,
                                                       xval = 6))
rpart.plot(dt)

# CHUNK 3 -- run lm to confirm an interaction between x1 and x2
model <- lm(Y ~ X1 * X2, data = dat)
summary(model)
## 
## Call:
## lm(formula = Y ~ X1 * X2, data = dat)
## 
## Residuals:
##         1         2         3         4         5         6 
##  0.053226 -0.024194 -0.053226  0.008065  0.106452 -0.090323 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.06129    0.18810  -0.326  0.77547   
## X1           1.20806    0.10892  11.092  0.00803 **
## X2           1.31774    0.20351   6.475  0.02303 * 
## X1:X2       -0.77419    0.09995  -7.746  0.01626 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1136 on 2 degrees of freedom
## Multiple R-squared:  0.9871, Adjusted R-squared:  0.9677 
## F-statistic:    51 on 3 and 2 DF,  p-value: 0.01929

A Closer Look at Tree Splits

  • CHUNK 4 calls the output of our fitted decision tree which includes more detailed information regarding the tree’s splits

  • We note the following about the output:
    • Each node shows the (1) split, (2) number of observations, (3) deviance (in this case RSS) and (4) the fitted target value (mean of y in that node)
    • The leaf/terminal node is indicated by a *
    • Child nodes are indented and numbered as such 2(parent node), 2(parent node) + 1
# CHUNK 4
dt
## n= 6 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 6 2.000 1.90  
##    2) X1< 3.5 5 0.548 1.68  
##      4) X2< 0.5 1 0.000 1.20 *
##      5) X2>=0.5 4 0.260 1.80  
##       10) X1>=2.5 1 0.000 1.50 *
##       11) X1< 2.5 3 0.140 1.90  
##         22) X1< 1.5 1 0.000 1.60 *
##         23) X1>=1.5 2 0.005 2.05  
##           46) X2>=1.5 1 0.000 2.00 *
##           47) X2< 1.5 1 0.000 2.10 *
##    3) X1>=3.5 1 0.000 3.00 *

Pruning a Decision Tree

  • A notably powerful feature of rpart() is that it autoomatically fits a collection of decision trees corresponding to every value of a cp greater than the cp value specified (and hence simpler) in the argument and evaluates their predictive performance using internal cross-validation.

  • This allows us to evaluate the model at each level of complexity and choose the best one
    • We can access these results by looking at the cptable element of our fitted model object (see CHUNK 5)
  • The cptable is matrix that has the following components:
    • CP showing each cutoff value of the complexity parameter
    • nsplit showing the number of splits the tree constructed
    • rel error is relative error (\(1 - R^2\) or RSS/Total SS); a scaled version of the training error; this should decrease with model complexity indicating a better fit
    • xerror shows the cross validation error; typically this increases with model complexity indicating overfitting detected by cross validation; different from usualy cross-validation error since it’s divided by TSS
    • xstd shows the standard error of the cross-validation error
  • We note that the scaled cross-validation errors indicate that the 5-split tree has the highest xerror although it has a 0 rel error, indicating a gross overfit
    • The overfit indicates that the tree should be pruned (CHUNK 7)
    • cp.min contains the cp associated with the smallest cross-validation error (cp = 0.72600)
    • dt.pruned uses the prun() function to prune back the tree using the cp associated with the smallest cv error we identified in the previous line
    • The prune function then prunes back any branches that do not satisy the new cp we fed it
    • The resulting pruned tree – as expected – is the no-split tree with only the root node which suggests the fitted decision tree isn’t useful, probably due to sample size
# CHUNK 5 - Cross validation results
dt$cptable
##        CP nsplit rel error   xerror     xstd
## 1 0.72600      0    1.0000 1.440000 0.745794
## 2 0.14400      1    0.2740 2.399012 1.059821
## 3 0.06375      2    0.1300 2.220556 1.004330
## 4 0.00250      4    0.0025 2.420000 1.252211
## 5 0.00000      5    0.0000 2.420000 1.252211
#CHUNK 6 removed, extra example demonstrating explanation of rel error in cptable

# CHUNK 7
cp.min <- dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]
cp.min
## [1] 0.726
dt.pruned <- prune(dt, cp = cp.min)
rpart.plot(dt.pruned)

5.3 Case Study: Classification Trees

In this case study we’ll build a more sophisticated classification tree, using both base and ensemble methods. Objectives follow:

  • Build a base classification tree
    • Control complexity via pruning
  • Build ensemble trees using train() and trainCOntrol() functions
    • Tune model parameters
  • Quantify prediction accuracy (for both base and ensemble) classification trees
  • Revisit pros/cons of decision trees relative to GLMs

5.3.1 Problem Setup and Prepatory Steps

Data Description

We’ll use the Wisconsin Breast Cancer dataset – the file has \(n = 699\) each representing a needle aspirate sample and 458 are benign / 241 are malignante. For each sample, there are nine cytological characteristics that are possibly related to breast cancer malignancy. See the data dictionary below:

Data dictionary for breast data

The goal of this study is to identify key characteristics affecting malignancy and develop tree-based classifiers that accurately predict malignancy using a combination of the above characteristics.

In CHUNK 1 we load the breast dataset and generate some summary stats. Note the following:

  • All variables are currently treated as numeric except bNuclei
  • We make the following modifications to the data as we pre-process the data for analysis (CHUNKS 2-5):
    • We drop the ID variable – won’t be a helpful predictor (CHUNK 2)
    • The only factor variable (bNuclei) has 16 missing obs under the level “?”; we should delete them since the small number of removed records won’t bias or impair our model (CHUNK 3)
    • We’ll follow the treatment of bNuclei by converting it back as an integer variable to recognize its ordered nature (that way, R won’t try to create any strange splits such as bNuclei in groups 1, 2 and 5) (CHUNK 4)
    • Finally we convert class (our response) as a factor variable to acknowledge the categorical nature of our response
  • Remember to note the end result and summarize the changes to the dataset as a result of cleaning/pre-processing and justification if seemingly controversial (e.g removing a large number of observations, etc. )
# CHUNK 1
# Clean the working memory
rm(list = ls())
breast <- read.csv("C:/Users/CN115792/Desktop/Exam PA/NOTES/ACTEX Stock Files/breast.csv")
summary(breast)
##        ID             thickness           size            shape       
##  Min.   :   61634   Min.   : 1.000   Min.   : 1.000   Min.   : 1.000  
##  1st Qu.:  870688   1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000  
##  Median : 1171710   Median : 4.000   Median : 1.000   Median : 1.000  
##  Mean   : 1071704   Mean   : 4.418   Mean   : 3.134   Mean   : 3.207  
##  3rd Qu.: 1238298   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000  
##  Max.   :13454352   Max.   :10.000   Max.   :10.000   Max.   :10.000  
##                                                                       
##     adhesion          eSize           bNuclei      chromatin     
##  Min.   : 1.000   Min.   : 1.000   1      :402   Min.   : 1.000  
##  1st Qu.: 1.000   1st Qu.: 2.000   10     :132   1st Qu.: 2.000  
##  Median : 1.000   Median : 2.000   2      : 30   Median : 3.000  
##  Mean   : 2.807   Mean   : 3.216   5      : 30   Mean   : 3.438  
##  3rd Qu.: 4.000   3rd Qu.: 4.000   3      : 28   3rd Qu.: 5.000  
##  Max.   :10.000   Max.   :10.000   8      : 21   Max.   :10.000  
##                                    (Other): 56                   
##    nNucleoli         mitosis           class       
##  Min.   : 1.000   Min.   : 1.000   Min.   :0.0000  
##  1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.:0.0000  
##  Median : 1.000   Median : 1.000   Median :0.0000  
##  Mean   : 2.867   Mean   : 1.589   Mean   :0.3448  
##  3rd Qu.: 4.000   3rd Qu.: 1.000   3rd Qu.:1.0000  
##  Max.   :10.000   Max.   :10.000   Max.   :1.0000  
## 
str(breast)
## 'data.frame':    699 obs. of  11 variables:
##  $ ID       : int  1000025 1002945 1015425 1016277 1017023 1017122 1018099 1018561 1033078 1033078 ...
##  $ thickness: int  5 5 3 6 4 8 1 2 2 4 ...
##  $ size     : int  1 4 1 8 1 10 1 1 1 2 ...
##  $ shape    : int  1 4 1 8 1 10 1 2 1 1 ...
##  $ adhesion : int  1 5 1 1 3 8 1 1 1 1 ...
##  $ eSize    : int  2 7 2 3 2 7 2 2 2 2 ...
##  $ bNuclei  : Factor w/ 11 levels "?","1","10","2",..: 2 3 4 6 2 3 3 2 2 2 ...
##  $ chromatin: int  3 3 3 3 3 9 3 3 1 2 ...
##  $ nNucleoli: int  1 2 1 7 1 7 1 1 1 1 ...
##  $ mitosis  : int  1 1 1 1 1 1 1 1 5 1 ...
##  $ class    : int  0 0 0 0 0 1 0 0 0 0 ...
# CHUNK 2
# Drop the ID variable
breast$ID <- NULL


# CHUNK 3
table(breast$bNuclei)
## 
##   ?   1  10   2   3   4   5   6   7   8   9 
##  16 402 132  30  28  19  30   4   8  21   9
# Drop the missing observations for bNuclei
breast <- breast[!(breast$bNuclei == "?"), ]
nrow(breast)
## [1] 683
# CHUNK 4
# Convert bNuclei to integers
breast$bNuclei <- factor(breast$bNuclei, levels = 1:10, order = TRUE)
breast$bNuclei <- as.integer(breast$bNuclei)
class(breast$bNuclei)
## [1] "integer"
table(breast$bNuclei)
## 
##   1   2   3   4   5   6   7   8   9  10 
## 402  30  28  19  30   4   8  21   9 132
# CHUNK 5
# Convert class to a factor
breast$class <- as.factor(breast$class)
summary(breast)
##    thickness           size            shape           adhesion    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 1.000   1st Qu.: 1.00  
##  Median : 4.000   Median : 1.000   Median : 1.000   Median : 1.00  
##  Mean   : 4.442   Mean   : 3.151   Mean   : 3.215   Mean   : 2.83  
##  3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##      eSize           bNuclei         chromatin        nNucleoli    
##  Min.   : 1.000   Min.   : 1.000   Min.   : 1.000   Min.   : 1.00  
##  1st Qu.: 2.000   1st Qu.: 1.000   1st Qu.: 2.000   1st Qu.: 1.00  
##  Median : 2.000   Median : 1.000   Median : 3.000   Median : 1.00  
##  Mean   : 3.234   Mean   : 3.545   Mean   : 3.445   Mean   : 2.87  
##  3rd Qu.: 4.000   3rd Qu.: 6.000   3rd Qu.: 5.000   3rd Qu.: 4.00  
##  Max.   :10.000   Max.   :10.000   Max.   :10.000   Max.   :10.00  
##     mitosis       class  
##  Min.   : 1.000   0:444  
##  1st Qu.: 1.000   1:239  
##  Median : 1.000          
##  Mean   : 1.603          
##  3rd Qu.: 1.000          
##  Max.   :10.000
table(breast$class)/nrow(breast)  # prop of benign/malignant
## 
##         0         1 
## 0.6500732 0.3499268

Visualizing the Data

After we’ve pre-processed/cleaned our dataset, we can move on to exploring the data via visuals. Since each of the potential predictors are integers on a scale of 1:10 (technically numeric), a box plot split by the binary response would be a good starting point (see CHUNK 6)

From the box plots, we note:

  • All the predictors, except for mitosis, the median rank is higher for those identified as malignant than benign
    • This indicates that all these characteristics (except mitosis) may be good predictors of our response
    • Additionally, those that are flagged as malignant are associated with much larger and volatile values of the predictors (spread is wider)
  • The challenge in this case however is to identify which of these variables would be most useful predictors, noting that they all may be highly correlated with one another
# CHUNK 6 - Visualizing the Data
library(ggplot2)

# Extract all variables except class
var <- names(breast)[-10]

for (i in var) {
  plot <- ggplot(breast, aes(x = class, y = breast[, i])) +
    geom_boxplot() +
    labs(y = i)
  print(plot)
}

5.3.2 Construction and Evaluation of Base Classification Trees

Now that we’ve cleaned, summarized and visualized the data – we can begin constructing our model, starting with just a base classification tree.

CHUNK 7 sets up the data partition for the training/test sets – again, the proportion of 1’s for the responses between the train and test sets are similar demonstrating that the built-in stratification is working well and that both sets are representative of the data.

# CHUNK 7
library(caret)
set.seed(2019)

partition <- createDataPartition(y = breast$class, p = .7, list = FALSE)
train <- breast[partition, ] 
test <- breast[-partition, ]

table(train$class)/nrow(train)
## 
##         0         1 
## 0.6492693 0.3507307
table(test$class)/nrow(test)
## 
##         0         1 
## 0.6519608 0.3480392

Tree 1: A larger, complex tree

Fitting the tree

CHUNK 8 builds a large, complex tree using our training data to start off our model building process. Note the following:

  • Note we set a seed before fitting our tree so that the internal cross-validation (cptable) is reproducible
  • We identify the method as “class” since our response is categorical (anova for regression trees)
  • The control parameters under rpart.control() include
    • A min number of obs in any terminal node must be at least 2 (reasonable given the small sample size)
    • A cp of 0.0005 (i.e. the information gain/reduction in impurity must be above this in order for a split to be generated); we want this to be lower than a standard 0.001 with the intention of starting off with a complex tree
    • A maxdepth is set at 20 branches from the root to the terminal nodes
    • We identify “gini” as the impurity measure for this method

Interpreting the Output

  • In the prnted output, RSS is replaced with the number of missclassifications in the “loss” column and the proportion of obs in each class in the yprob column
  • The first/root node, there are 479 obs (n of training set), 168 are misclassified (i.e. are a 1)
  • Size < 3.5, represents the first split, meaning that out of all the other predictors
    • A split based on size led to the highest reduction in impurity/ or highest information gain
    • Those with size < 3.5 are sent to the left where class = 0 (benign)
    • Those with size >= 3.5 are sent to the right where the predicted class is malignant (class = 1)
    • Note that the number of missclassifications has decreased from 168 to 25 and 8 repsepctively (or 33 total)
  • The tree continues and we note that some of the terminal nodes of the tree have less than 5 observations in them, indicating overfitting
# CHUNK 8
library(rpart)
library(rpart.plot)
set.seed(123)  # set the seed so the built-in cross-validation is reproducible

# method = "class" ensures that target is treated as a categorical variable
dt.1 <- rpart(class ~ ., data = train, method = "class",
              control = rpart.control(minbucket = 2,
                                      cp = 0.0005,
                                      maxdepth = 20),
              parms = list(split = "gini"))

dt.1  # print the output
## n= 479 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 479 168 0 (0.64926931 0.35073069)  
##    2) size< 3.5 328  25 0 (0.92378049 0.07621951)  
##      4) bNuclei< 5.5 311  10 0 (0.96784566 0.03215434)  
##        8) nNucleoli< 3.5 304   5 0 (0.98355263 0.01644737)  
##         16) bNuclei< 2.5 282   0 0 (1.00000000 0.00000000) *
##         17) bNuclei>=2.5 22   5 0 (0.77272727 0.22727273)  
##           34) thickness< 3.5 15   0 0 (1.00000000 0.00000000) *
##           35) thickness>=3.5 7   2 1 (0.28571429 0.71428571)  
##             70) eSize< 2.5 3   1 0 (0.66666667 0.33333333) *
##             71) eSize>=2.5 4   0 1 (0.00000000 1.00000000) *
##        9) nNucleoli>=3.5 7   2 1 (0.28571429 0.71428571)  
##         18) thickness< 4.5 2   0 0 (1.00000000 0.00000000) *
##         19) thickness>=4.5 5   0 1 (0.00000000 1.00000000) *
##      5) bNuclei>=5.5 17   2 1 (0.11764706 0.88235294) *
##    3) size>=3.5 151   8 1 (0.05298013 0.94701987)  
##      6) size< 4.5 29   6 1 (0.20689655 0.79310345)  
##       12) bNuclei< 7.5 13   6 1 (0.46153846 0.53846154)  
##         24) adhesion< 3.5 5   1 0 (0.80000000 0.20000000) *
##         25) adhesion>=3.5 8   2 1 (0.25000000 0.75000000)  
##           50) shape< 4.5 3   1 0 (0.66666667 0.33333333) *
##           51) shape>=4.5 5   0 1 (0.00000000 1.00000000) *
##       13) bNuclei>=7.5 16   0 1 (0.00000000 1.00000000) *
##      7) size>=4.5 122   2 1 (0.01639344 0.98360656) *
rpart.plot(dt.1)  # plot the output

Tree 2: Pruning Tree 1 using the One SE Rule

  • Note that some issues with Tree 1 is that it’s probably overfitting our training data and the number of splits/complexity of the tree may not be interpretable to a non-technical audience

  • We’ll attempt to reduce this tree (pruning) to a reasonable size for both predictions and interpretation by using the different cp’s tested in rpart()’s built-in cross-validation procedure (CHUNK 9)
    • Again, remember that rpart() runs an internal cross-validation (default to 10 fold) testing out a set of cp’s above the specfieid value that was run through the function (in this case 0.0005)
    • The results of those cross-validation/cp’s can be accessed through the cptable element

Interpreting the CP Table/Cross Validation Results

  • Similar to regression trees, the relative error decreases with increasing tree complexity (and a lower cp threshold), indicating a better fit to the training data
    • For classification trees, the classification error is scaled by the classification error with no splits (similar to regression with RSS / TSS)
  • The cross-validation error (xerror) however indicates predictive performance (since it’s using cross-validation to check for overfitting and performance on “unseen” data
  • plotcp() can be used to visualize the behavior of the cross-validation error at different cp’s (CHUNK 10)
    • THe top of the plot shows the corresponding tree size (number of splits + 1)
  • Both the cp table and cp plot indicate that the chosen cp = 0.0005 results in the smallest cross-validation error, however we note that trees with 4 or 6 splits has a comparable CV error while having a much smaller size
    • Since our objective is to have an interpretable and predictive model, we’ll employ the one-standard error rule which states that we can select the smallest tree whos CV error is within one SE from the smallest CV error
    • The One SE rule represented in the plot has the horizontal dashed line
    • In this case, the highest CV error we’ll tolerate is 0.125293 (= 0.1011905 + 0.02410285) – which means that we ought to choose the tree with 4 splits (cp = 0.008928571) which has a CV error below this max at 0.1130952

Pruning the tree using a selected CP from cross-validation

  • In CHUNK 11, we run the chosen cp (0.0089286) through the prune function and plot the resulting pruned tree (dt.2)
  • Note that the pruned tree (dt.2) is considerably smaller and easier to interpret than Tree 1, with only four splits: size, bNuclei, nNucleoli and then thickness
  • Again, size seems to be the most informative predictor (as indicated by the first split) which makes intuitive sense
  • Also we note the interactions between size and the other predictors as there are further splits only on the left branch of the root node, but not on the right branch
# CHUNK 9
dt.1$cptable  # returns the cross-validation results of cp
##            CP nsplit  rel error    xerror       xstd
## 1 0.803571429      0 1.00000000 1.0000000 0.06216670
## 2 0.077380952      1 0.19642857 0.2261905 0.03520741
## 3 0.017857143      2 0.11904762 0.1488095 0.02897483
## 4 0.011904762      3 0.10119048 0.1309524 0.02727046
## 5 0.008928571      4 0.08928571 0.1130952 0.02542604
## 6 0.005952381      6 0.07142857 0.1130952 0.02542604
## 7 0.000500000     11 0.04166667 0.1011905 0.02410285
# CHUNK 10
plotcp(dt.1)  # plots the cp resuls

# CHUNK 11
dt.2 <- prune(dt.1, cp = dt.1$cptable[5, "CP"])  # use selected CP for pruning

rpart.plot(dt.2)

Comparing the Predictive Performance of the Two Trees

We’ll now compare the predictive performance of Tree 1 (the complex tree) versus Tree 2 (the pruned tree) on the TEST dataset.

  • CHUNK 12 uses the predict() function to generate predicted values.
    • We’re passing through the TEST data and specifying the type as “class” which will return the predicted class labels
    • Alternatively if the type was “prob” then a vector of predicted probabilities belonging to that class would be returned instead
  • CHUNK 12 also produces the confusion matrices using the predictions of the two models
    • Note that despite Tree 1 being much more compelx than Tree 2, they have similar accuracy rates (0.9706)
    • The sensitivity (TPR) and specificity (TNR) are also close to 1 and the same between both trees
    • Based on accuracy and sensitivity/specificity, we’d go with the pruned model since it returns the same performance results and the more complex model but we prefer something simpler and more interpretable
  • Alternatively, we can compare the two trees in terms of AUC (CHUNK 13)
    • The roc() function will only accept the probabilities of the class = 1 and the actual responses from the test set
    • The AUC for the more complex model is slightly higher but still very similar
    • Given that tree 2 is simpler and more interpretable while returning the same level of predictive accuracy, we’d choose tree 2
# CHUNK 12
pred.1.class <- predict(dt.1, newdata = test, type = "class")
pred.2.class <- predict(dt.2, newdata = test, type = "class")

# The positive class should be "1"
confusionMatrix(pred.1.class, test$class, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129   2
##          1   4  69
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9371, 0.9891)
##     No Information Rate : 0.652           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9356          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9699          
##          Pos Pred Value : 0.9452          
##          Neg Pred Value : 0.9847          
##              Prevalence : 0.3480          
##          Detection Rate : 0.3382          
##    Detection Prevalence : 0.3578          
##       Balanced Accuracy : 0.9709          
##                                           
##        'Positive' Class : 1               
## 
confusionMatrix(pred.2.class, test$class, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129   2
##          1   4  69
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9371, 0.9891)
##     No Information Rate : 0.652           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9356          
##                                           
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9718          
##             Specificity : 0.9699          
##          Pos Pred Value : 0.9452          
##          Neg Pred Value : 0.9847          
##              Prevalence : 0.3480          
##          Detection Rate : 0.3382          
##    Detection Prevalence : 0.3578          
##       Balanced Accuracy : 0.9709          
##                                           
##        'Positive' Class : 1               
## 
# CHUNK 13
library(pROC)

# Extract the predicted probabilities for the second class
pred.1 <- predict(dt.1, newdata = test)[, 2]
pred.2 <- predict(dt.2, newdata = test)[, 2]

roc(test$class, pred.1, auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test$class, predictor = pred.1, auc = TRUE)
## 
## Data: pred.1 in 133 controls (test$class 0) < 71 cases (test$class 1).
## Area under the curve: 0.9682
roc(test$class, pred.2, auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test$class, predictor = pred.2, auc = TRUE)
## 
## Data: pred.2 in 133 controls (test$class 0) < 71 cases (test$class 1).
## Area under the curve: 0.9602
# CHUNK 13
library(pROC)

# Extract the predicted probabilities for the second class
pred.1 <- predict(dt.1, newdata = test)[, 2]  # [, 2] indicating the prob for class = 1
pred.2 <- predict(dt.2, newdata = test)[, 2]

roc(test$class, pred.1, auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test$class, predictor = pred.1, auc = TRUE)
## 
## Data: pred.1 in 133 controls (test$class 0) < 71 cases (test$class 1).
## Area under the curve: 0.9682
roc(test$class, pred.2, auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test$class, predictor = pred.2, auc = TRUE)
## 
## Data: pred.2 in 133 controls (test$class 0) < 71 cases (test$class 1).
## Area under the curve: 0.9602

5.3.3 Construction and Evaluation of Ensemble Trees

In this subsection, we’ll attempto construct two ensemble classification trees:

  • A random forest (form of bagging)
  • A boosted tree

Also note that to fit these ensembled trees, we’ll turn to the caret package instead of rpart for building and fitting our models

Ensemble 1: Random Forest

Constructing the Random Forest

In CHUNK 14 note the following:

  • trainControl() function to specify how our model should perform its cross-validation
    • This function allows us to control how cross-validation is done to tune our parameters
    • ChUNK 14 shows a 5-fold cross-validation be repeated 3 times and to apply undersampling
    • Note that undersampling is used with “unbalanced data – proportions of the target are different” and undersampling would undersample the negative (benign) class while retaining all positive/malignant observations to produce a roughly balanced sample
  • Next, we need a grid (rf.grid) of all the possible hyperparameters that we wish to tune in cross validation
    • The grid should be a dataframe whose variables are named exactly the same way as the tuning parameters are specfied in the predictive model
    • For a random forest, the mtry (number of random features to consider at each split in each different tree) parameter is the most important
    • In our code we’ll try out a sequence of 1 through 9
  • train() will run our random forest model
    • We specify our control and rf.grid objects into trControl and tuneGrid respectively
    • The method arguemnt specifies that we want to construct a random forest (rf)
    • We specify 200 trees to be grown (default is 500); the more the better (if the data set is large, then maybe a smaller number would cut down on computation time)
    • The importance scores will be used in our Variable importance plots later
  • Calling the random forest object won’t return how splits are defined (because remember there are hundreds generated)
    • Instead some basic info about the random forest (like the number of obs, how many predictors) and other crosss-validation and resampling procedures are summarized
    • It also states what the optimal mtry value was (= 1), meaning only 1 predictor should be sampled and automatically used in every split of every tree grown
# CHUNK 14
set.seed(1)  # because cross-validation is done
control <- trainControl(method = "repeatedcv",
                        number = 5,  # number of folds
                        repeats = 3,  # number of repetitions
                        sampling = "down")  # undersamples; "up" would oversample

rf.grid <- expand.grid(mtry = 1:9)
rf.grid
##   mtry
## 1    1
## 2    2
## 3    3
## 4    4
## 5    5
## 6    6
## 7    7
## 8    8
## 9    9
rf <- train(class ~ .,  # all variables
            data = train,
            method = "rf",  # random forest
            ntree = 200,  # number of trees, the larger the better
            importance = TRUE,  # return the "importance"
            trControl = control,
            tuneGrid = rf.grid)
rf
## Random Forest 
## 
## 479 samples
##   9 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 383, 383, 384, 382, 384, 384, ... 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   1     0.9749264  0.9455864
##   2     0.9728355  0.9409089
##   3     0.9721339  0.9394974
##   4     0.9679744  0.9301140
##   5     0.9735155  0.9425218
##   6     0.9644874  0.9224921
##   7     0.9672728  0.9288399
##   8     0.9672725  0.9289430
##   9     0.9638074  0.9211887
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 1.

Evaluating the Random Forest

CHUNK 15 generates our fitted random forest model to generate predictions on the TEST data set. We generate both a confusion matrix and an ROC curve/AUC to evaluate its performance against our previous models.

  • The Confusion matrix indicates an accuracy of 98% and a sensitivity/specificity of 1 and 0.9699 respectively
    • These are improvements, although not huge improvements granted they were already good, compared to our base decision tree models
    • The sensitivity of 1 is however quite remarkable and useful in this context (we want to make sure we identify those with malignant tumors that they are malignant)
  • The AUC is also higher

  • Although we lost the interpretability of our model, we gain predictive power

# CHUNK 15
pred.rf.class <- predict(rf, newdata = test, type = "raw")  # by default will do "raw" and return  predicted classes
confusionMatrix(pred.rf.class, test$class, positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 129   0
##          1   4  71
##                                           
##                Accuracy : 0.9804          
##                  95% CI : (0.9506, 0.9946)
##     No Information Rate : 0.652           
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9574          
##                                           
##  Mcnemar's Test P-Value : 0.1336          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9699          
##          Pos Pred Value : 0.9467          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3480          
##          Detection Rate : 0.3480          
##    Detection Prevalence : 0.3676          
##       Balanced Accuracy : 0.9850          
##                                           
##        'Positive' Class : 1               
## 
# Add the type = "prob" option to return predicted probabilities
pred.rf.prob <- predict(rf, newdata = test, type = "prob")[, 2]

roc(test$class, pred.rf.prob, auc = TRUE)
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = test$class, predictor = pred.rf.prob,     auc = TRUE)
## 
## Data: pred.rf.prob in 133 controls (test$class 0) < 71 cases (test$class 1).
## Area under the curve: 0.9963

Variable Importance Plots

Note that although random forests can’t provide a series of easy-to-understand classification rules like a single decision tree, it can be visualized in a way that answers some useful questions – such as variable importance plots (CHUNK 16)

  • Variable importance plots rank predictors according to their contribution to the model
  • The importance score for a given predictor is the total drop in node impurity due to that predictor averaged over all trees in the random forest
  • Variables higher up oon the list lead to larger improvements in splitting criterion when used in a split
  • Our produced variable importance plot indicates that bNuclei, size and thickness are among the most important predictiors
    • Note the difference in importance from the plot here and the pruned tree (Tree 2) where the order of importance differs (size, bNuclei, nNucleoli and thickness in the pruned tree)
    • The ranking of the random forest should be more reliable since it’s based on a total of multiple trees rather than just one (which can be unstable)
# CHUNK 16
imp <- varImp(rf)
plot(imp, main = "Variable Importance of Classification Random Forest")

Ensemble 2: Boosted Model

We will use the train() and trainControl() functions again to produce a boosted model. Again, remember that compared to a random forest, a boosted model has a lot more paramters totune and requires more coding.

Constructing the Boosted Model

  • In CHUNK 17, we have to rename our response (class) levels in order for the xgboost (extreme gradient boosting) function to run properly

  • Similar to the random forest, we set up a grid of the hyper parameters we’ll tune (xgb.grid). Note that all SEVEN of these parameters need to be specified.
    • max_depth, min_child_weight, gamma: all determine the complexity of the underlying trees – at least one should be tuned (multiple values)
    • colsample_bytree, subsample: relate to the proportion of features and obs used in each individual tree – at least the proportion of features should be tuned (colsample_bytree)
    • nrounds: Max number of rounds/iterations allowed in the fitting process (1000 is usually high enough, the algorithm will stop automatically)
    • eta: this is the learning rate parameter (0,1) that applies to the contribution of each tree – the higher the learning rate, the fewer iterations but more likely to overfit (typically should set between 0.01-0.2)
  • Again similar to random forests, we use trainControl to specify our other non-tuned control parameters/cross-validation specifications
    • method set to cv for cross-validation (not repeated.cv)
    • number set to 5 (for 5 fold)
    • sample set to “down” again for undersampling
  • The Boosted model object (xgb.tuned) doesn’t return an interpretable output (similar to random forests) since multiple trees were fit. Instead a summary of the boosting method and specified cross-validation and sample procedures are returned
    • It also states the optimal hyper parameters as a result of the tuning process (max_dept = 7, eta = 0.01, colsample_bytree = 0.6)
# CHUNK 17
# Rename the two levels of class
breast$class <- ifelse(breast$class == "0", "B", "M")
breast$class <- as.factor(breast$class)

# Reproduce the training and test sets with the "transformed" target
train <- breast[partition, ]
test <- breast[-partition, ]

# CHUNK 18
xgb.grid <- expand.grid(max_depth = c(1, 3, 7),
                        min_child_weight = 1,
                        gamma = 0,
                        nrounds = 1000,
                        eta = c(0.01, 0.05, 0.1),
                        colsample_bytree = c(0.6, 0.9),
                        subsample = 0.6)
xgb.grid
##    max_depth min_child_weight gamma nrounds  eta colsample_bytree
## 1          1                1     0    1000 0.01              0.6
## 2          3                1     0    1000 0.01              0.6
## 3          7                1     0    1000 0.01              0.6
## 4          1                1     0    1000 0.05              0.6
## 5          3                1     0    1000 0.05              0.6
## 6          7                1     0    1000 0.05              0.6
## 7          1                1     0    1000 0.10              0.6
## 8          3                1     0    1000 0.10              0.6
## 9          7                1     0    1000 0.10              0.6
## 10         1                1     0    1000 0.01              0.9
## 11         3                1     0    1000 0.01              0.9
## 12         7                1     0    1000 0.01              0.9
## 13         1                1     0    1000 0.05              0.9
## 14         3                1     0    1000 0.05              0.9
## 15         7                1     0    1000 0.05              0.9
## 16         1                1     0    1000 0.10              0.9
## 17         3                1     0    1000 0.10              0.9
## 18         7                1     0    1000 0.10              0.9
##    subsample
## 1        0.6
## 2        0.6
## 3        0.6
## 4        0.6
## 5        0.6
## 6        0.6
## 7        0.6
## 8        0.6
## 9        0.6
## 10       0.6
## 11       0.6
## 12       0.6
## 13       0.6
## 14       0.6
## 15       0.6
## 16       0.6
## 17       0.6
## 18       0.6
control <- trainControl(method = "cv",
                        number = 5,
                        sampling = "down")

xgb.tuned <- train(class ~ .,
                   data = train,
                   method = "xgbTree",
                   trControl = control,
                   tuneGrid = xgb.grid)
xgb.tuned
## eXtreme Gradient Boosting 
## 
## 479 samples
##   9 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 383, 383, 383, 383, 384 
## Addtional sampling using down-sampling
## 
## Resampling results across tuning parameters:
## 
##   eta   max_depth  colsample_bytree  Accuracy   Kappa    
##   0.01  1          0.6               0.9707895  0.9361896
##   0.01  1          0.9               0.9686842  0.9318446
##   0.01  3          0.6               0.9707675  0.9362334
##   0.01  3          0.9               0.9686842  0.9318446
##   0.01  7          0.6               0.9728509  0.9406236
##   0.01  7          0.9               0.9707675  0.9363520
##   0.05  1          0.6               0.9603509  0.9139218
##   0.05  1          0.9               0.9666009  0.9268119
##   0.05  3          0.6               0.9707675  0.9358995
##   0.05  3          0.9               0.9686842  0.9310448
##   0.05  7          0.6               0.9645175  0.9225921
##   0.05  7          0.9               0.9728509  0.9406611
##   0.10  1          0.6               0.9603509  0.9128839
##   0.10  1          0.9               0.9582675  0.9087453
##   0.10  3          0.6               0.9707675  0.9364837
##   0.10  3          0.9               0.9645175  0.9229567
##   0.10  7          0.6               0.9624342  0.9179970
##   0.10  7          0.9               0.9707675  0.9366766
## 
## Tuning parameter 'nrounds' was held constant at a value of 1000
## 
## Tuning parameter 'min_child_weight' was held constant at a value of
##  1
## Tuning parameter 'subsample' was held constant at a value of 0.6
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 1000, max_depth =
##  7, eta = 0.01, gamma = 0, colsample_bytree = 0.6, min_child_weight =
##  1 and subsample = 0.6.

Evaluating the Boosted Model

We now evaluate the boosted model by generating predictions on the TEST dataset and then producing a confusion matrix and ROC curve/AUC (CHUNK 19)

  • The confusion matrix indicates an accuracy of 0.9706 with a sensistivity of 1 and a specificty of 0.9549
    • Again the 1 sensitivity is remarkable and useful in this context
    • However, the accuracy is the same rate as the two base decision trees
  • The AUC is 0.9948 – the highest of all 4 models indicating that it’s superior to the other models in terms of prediction performance, although all models seem to perform well
    • We may see the increase in predictive performance as secondary to model interpretability, so we may stil opt to choose the pruned base tree so a non-technical audience can better interpret our model given that there’s no significant increase in predictive performance
# CHUNK 19
pred.xgb.class <- predict(xgb.tuned, newdata = test)
confusionMatrix(pred.xgb.class, test$class, positive = "M")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   B   M
##          B 127   0
##          M   6  71
##                                           
##                Accuracy : 0.9706          
##                  95% CI : (0.9371, 0.9891)
##     No Information Rate : 0.652           
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9364          
##                                           
##  Mcnemar's Test P-Value : 0.04123         
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9549          
##          Pos Pred Value : 0.9221          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.3480          
##          Detection Rate : 0.3480          
##    Detection Prevalence : 0.3775          
##       Balanced Accuracy : 0.9774          
##                                           
##        'Positive' Class : M               
## 
pred.xgb.prob <- predict(xgb.tuned, newdata = test, type = "prob")[, 2]
roc(as.numeric(test$class), pred.xgb.prob, auc = TRUE)
## Setting levels: control = 1, case = 2
## Setting direction: controls < cases
## 
## Call:
## roc.default(response = as.numeric(test$class), predictor = pred.xgb.prob,     auc = TRUE)
## 
## Data: pred.xgb.prob in 133 controls (as.numeric(test$class) 1) < 71 cases (as.numeric(test$class) 2).
## Area under the curve: 0.9948

Ensemble Methods vs Base Trees

  • In this case study we note that both the random forest and boosted model aren’t able to provide a transparent model that clearly illustrates the nature of these models although both improve on predictive performance relative to the two base tree models
  • Since the improvement in predictive performance among the ensembled models are incrememntal at best compared to the base decision trees, the loss of interpretability may be a deal-breaker
    • Thus we will select Tree 2 (the pruned model) as our final model

6 Principal Components and Cluster Analyses

6.1 Principal Component Analysis (PCA)

6.1.1 Theoretical Foundation

Motivation

When we encounter a dataset high dimensions (i.e. a bunch of variables) it may be in our interest to try to unravel the relationships between these variables in a way to produce useful features for constructing our predictive model.

Principal components analysis (PCA) is a technique that transforms a large number of possible correlated varaibles into a smaller, more manageable set of representative variables that capture most of the information (in terms of variability) in the original dataset.

Principal components are just linear combinations of the existing variables and collectively simplify our dataset by reducing its dimension and making it more amenable to data exploration and visualization, particularly useful in feature generation

What are PCs?

Few terms:

  • \(n\) is the number of observations, and \(i = 1, ..., n\)
  • \(p\) is the number of features, and \(j = 1, ..., p\)
  • The target variable does not play role

For a large \(p\), the data matrix (\(n \times p\)) can be given as:

PCA Illustration

Additionally, in PCA, we typically center the observations of the features to have mean zero

  • This means that for some feature \(j\) the mean of all observations in that feature (\(x_{1j}, ... , x_{nj}\)) should have mean zero; if not, then we can subtract each obs from the mean and use those as the feature values

Mathematically, PCs are composite variables – normalized linear combinations of the original set of features

  • So for some \(M\) where \(M\) represents the number of PC’s to use and (\(M \le p\)), the \(m\)th PC can be defined as:

\[ Z_m = \phi_{1m} X_1 + ... + \phi_{pm} X_p = \sum_{j=1}^P \phi_{jm} X_j \]

The \(\phi\)’s are referred to as “loadings” of the features for the \(m\)th PC (\(Z_m\))

  • Again, remember that PC’s are constructed from FEATURES (indexed by \(j\)), not observations (indexed by \(i\))

  • See the figure below that illustrates how PCA achieves dimension reduction in our dataset:

PCA Illustration

PCA Illustration

More importantly, given a data matrix \(\boldsymbol{X}\), the \(m\)th PC (\(Z_m\)) for given observation (\(i\)) can be calculated as:

\[ z_{im} = \phi_{1m} x_{i1} + \phi_{2m} x_{i2} + ... + \phi_{pm} x_{ip} = \sum_{j=1}^P \phi_{jm} x_{ij} \]

The vector \(\boldsymbol{z_m}\) is called the scores of the mth PC; contains scores across all obs for a given PC

  • Can be rewritten as: \(\boldsymbol{z_m} = \boldsymbol{X} \boldsymbol{\phi_m}\)
  • This represents the scores of the \(m\)th PC across all observations as the data matrix times the loading vectors for the \(m\)th PC

How are PCs Calculated?

The objective in forming the first PC (\(Z_1\)) is that it captures as much information (wrt variability) from the dataset as possible

  • The PC loadings for the first PC, (that is \(\boldsymbol{\phi_1}\)}) are determined such that they maximize the sample variance of \(Z_1\), subject to a constraint that the squared of all loadings for the first PC equals 1
  • The constraint is important so that the loadings aren’t inflated to increase variability arbitrarily
  • Again, PC loadings are determined such that they maximize variance

Geometrically, the loadings represent a line in the p-dimensional feature space

  • The data should then vary the most in the sense that projected points of the \(n\) observations on this line (projections represented by PC scores) are as spread out as possible among all possible lines
  • The illustration below shows a simple case with \(p = 2\) where only two features are available and \(n = 5\) 5 observations are in the data
Geo Illustration of PC1 on a 2-dim feature space

Geo Illustration of PC1 on a 2-dim feature space

  • Given the first PC, subsequent PCs are defined in the same way (i.e. maximal variance/linear combinations) but with an additional constraint: must be orthogonal (uncorrelated) with the previous PCs before it
    • The orthogonality constraint ensures that subsequent PCs are independent of the previous PCs and that they’re all measuring different aspects of the varaibles in the dataset
    • The figure below adds a second PC to previous example with 2 features and 5 observations:
Geo Illustration of PC1 on a 2-dim feature space

Geo Illustration of PC1 on a 2-dim feature space

Proportion of Variance Explained (PVE)

To reiterate, PCA is a way to condence the available feature space (\(1, ..., p\)) to something more manageable (\(1, ..., M\)) while trying to retain most of the information in the data, using variance.

To quantify how much variance each of the \(M\) PCs explain, we use PVE – this is also a way of comparing PCs aginst each other. Few terms:

  • Total Variance is the sum of all sample variances over all present variables
  • Variance Explained by the \(m\)th PC is the sample second moment (i.e. the sum of squared PC scores for all observations for the \(m\)th PC divided by the number of observations)

\[ \frac{1}{n} \sum_{i = 1}^n z_{im}^2 = \frac{1}{n} \sum_{i = 1}^n (\sum_{j = 1}^p \phi_{jm} x_{ij})^2 \]

  • Proportion of Variance Explained (PVE) by the \(m\)th PC is:

\[ \text{PVE}_m = \frac{\text{variance explained by mth PC}}{\text{total variance}} \]

  • The PVEs must satisfy two properties:
    • Each \(\text{PVE}_m\) is between 0 and 1
    • By definition of PCs, the \(\text{PVE}_m\)’s are monotonically decreasing in \(m\) (i.e. \(\text{PVE}_1 \ge \text{PVE}_2 \ge \text{PVE}_3 ... \ge \text{PVE}_M\)); this is due to the orthogonality constraint

Choosing the number of PCs to Use

Since PCA will return \(M\) PCs, the next question is – how many \(M\) PCs should be used? We can plot PVEs of each PC on a scree plot

  • The y-axis represents the PVE
  • The x-axis shows the index \(m\) of each PC
  • By definition, it will be a decreasing curve as we move along the x-axis as subsequent PCs have lower PVEs
  • The number of PCs required is typically chosen by eyeballing for the point at which PVE drops off to a sufficiently low level – this is known as the elbow point
    • The figure below is an example of a scree plot – note that the elbow can be described at either PC2 or PC3
Geo Illustration of PC1 on a 2-dim feature space

Geo Illustration of PC1 on a 2-dim feature space

Feature Generation Using PCA

PCA in the context of this exam will be used as a method to create useful features for predicting a given target variable. The idea after all was to take a bunch of features and map them onto a smaller more manageable space of PCs that still retain most of the information (in terms of variability) in the dataset.

Few notes:

  • By definition, each PC are mutually uncorrelated so multicollinearity is not an issue
  • Reducing the dimension of the data in turn can reduce the complexity of the model helping improve the bias-variance tradeoff made in regards to predictive accuracy
  • If a PC is added to a predictive model, it is IMPORTANT to delete the underlying variables from the model to avoid rank-deficient issues

Additional Remarks: Centering and Scaling

We know that we center our observations so they have a zero mean (with respect to a given feature); however, we should also scale our observations before performing PCA if the variables operate on vastly different scales

  • For example if we have two features: (1) age and (2) income the variance between the two features would be vastly different (with income being a lot higher)
    • If we leave the different scales unchecked before PCA, then income would dominate the PC and receive a large PC loading (since income has a higher variance and the objective is to maximize variance) even though it may not be a variable that explains much of the underlying variability in the data
    • Failing to scale our variables can substantially effect the results of our PCA

Additional Remarks: PCA and Categorical Features

  • We need to explicitly binarize them using dummyVars()
  • prcomp() only accepts numeric variables

6.1.2 Case Study: PCA-Feature Generation

In this case study, we’ll look at crime by state data to:

  • Perform PCA using prcomp()
  • Interpret the output of a PCA and extract useful components from a prcomp object
  • Realize the importance of scaling
  • Use the PCA output to generate useful features for our predictive model

Data Description

  • USArrests contains 50 observations (for each US state) in 1973 with four variables:
    • Murder: Number of arrests per 100k residents
    • Assault: same as above for assault
    • Rape: same as above for rape
    • UrbanPop: % of population in urban areas
  • CHUNK 1 loads the data and generates some summary stats – additionally CHUNK 2 will use apply() to calculate the mean and sd() of each variable
    • The summary stats show that all four variables seem to be on different scales (so we should scale, as well as center all of them before our PCA)
    • Note that Assault has large variability among the three crime variables; the other variables also show different levels of variability (sd)
# CHUNK 1
# Clean the working memory
rm(list = ls())

data(USArrests)
summary(USArrests)
##      Murder          Assault         UrbanPop          Rape      
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   : 7.30  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:15.07  
##  Median : 7.250   Median :159.0   Median :66.00   Median :20.10  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :21.23  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:26.18  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :46.00
apply(USArrests, 2, mean)
##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232
apply(USArrests, 2, sd)
##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

Data Visualization

  • CHUNK 3 generates histograms of each of the four variables
    • Nothing controversial except rape looks a little skewed
  • CHUNK 4 logs the rape variable and removes the original variable from our dataset
    • The summmary stats show that logging the rape variable makes the mean and median close to each other
# CHUNK 3
library(ggplot2)

# names(USArrests) extracts the column names of the USArrests data
for (i in names(USArrests)) {
  plot <- ggplot(USArrests, aes(x = USArrests[, i])) +
    geom_histogram() +
    xlab(i)
  print(plot)
}

# CHUNK 4
USArrests$logRape <- log(USArrests$Rape)
USArrests$Rape <- NULL
summary(USArrests)
##      Murder          Assault         UrbanPop        logRape     
##  Min.   : 0.800   Min.   : 45.0   Min.   :32.00   Min.   :1.988  
##  1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50   1st Qu.:2.713  
##  Median : 7.250   Median :159.0   Median :66.00   Median :3.001  
##  Mean   : 7.788   Mean   :170.8   Mean   :65.54   Mean   :2.959  
##  3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75   3rd Qu.:3.265  
##  Max.   :17.400   Max.   :337.0   Max.   :91.00   Max.   :3.829

Implementing/Interpreting a PCA

  • Use prcomp() to perform a PCA – CHUNK 5 runs this function on our USArrests data
  • The function requires a numeric matrix (which is why we have to explicitly binarize the data beforehand)
  • Note that in the prcomp() function we’re setting scale = TRUE given the EDA we performed earlier

  • The summary output returns each PC and their PVE along with the cumulative PVE
    • In our example PC1 explains 63.5% of the varaibility in the data, while PC2 explains the next 24.65%
    • The first two PCs combined explain most of the variability in our data
  • THe returned prcomp() object is a list – the most useful being:
    • rotation: which contains the loading matrix \(\boldsymbol\Phi\) containing each PC’s loadings for each feature)
    • x: which contains the PC score matrix \(\boldsymbol{Z} = \boldsymbol{X} \boldsymbol{\Phi}\), for all observations and a centerd/sclaed version of the data matrix \(\boldsymbol{X}\)

Analyzing the PC Loadings

PCA becomes more meaningful when we take a deeper look into the PC loadings, paying attention to the relative sign and magnitude that can help illuminate the original features contributions to the PC and potentially interesting relationships among the variables.

Taking a look at PC1, we see that:

  • Roughly the same weight is given to the 3 crime variables (Murder, Assault and logRape)
  • All 3 variables are also given the same sign
  • The similar treatment of these 3 variables indicate that the first PC is some measure of the overall crime rate
  • In contrast, PC2 places a heavy weight on UrbanPop and relatively smaller weights on the other three variables, indicating some measure of urbanization
# CHUNK 5
PCA <- prcomp(USArrests, center = TRUE, scale. = TRUE)

summary(PCA)
## Importance of components:
##                           PC1    PC2     PC3     PC4
## Standard deviation     1.5940 0.9930 0.54644 0.41769
## Proportion of Variance 0.6352 0.2465 0.07465 0.04362
## Cumulative Proportion  0.6352 0.8817 0.95638 1.00000
# CHUNK 6
PCA$rotation
##                 PC1        PC2        PC3        PC4
## Murder   -0.5377481  0.4076787 -0.2260763 -0.7025059
## Assault  -0.5725819  0.1859535 -0.4175757  0.6805893
## UrbanPop -0.2766636 -0.8817037 -0.3293236 -0.1939118
## logRape  -0.5535650 -0.1477092  0.8161287  0.0753777
PCA$x
##                        PC1         PC2         PC3          PC4
## Alabama        -1.08857173  1.08040146 -0.26485606 -0.223275549
## Alaska         -1.59517817  1.20815437  1.33114413  0.770869203
## Arizona        -1.74286818 -0.73170358 -0.10577896  0.841565601
## Arkansas        0.02590682  1.08051464  0.22523823  0.203985489
## California     -2.27067947 -1.44575752  0.17412539  0.446993002
## Colorado       -1.33306093 -0.90185347  0.80129408  0.202583949
## Connecticut     1.42803044 -1.07347295 -0.71924115 -0.017836594
## Delaware       -0.10887464 -0.35521443 -0.74485862  0.733947673
## Florida        -2.96086941  0.03372143 -0.63849855  0.003739992
## Georgia        -1.71387288  1.23179720 -0.04883230 -1.098938437
## Hawaii          0.77349867 -1.59004854  0.44126511 -0.843701602
## Idaho           1.58397955  0.20390459  0.23460177  0.525892530
## Illinois       -1.46182554 -0.71600668 -0.52965170  0.020262855
## Indiana         0.37510248 -0.17895295  0.48652752 -0.355380434
## Iowa            2.29527344 -0.08449260  0.09567755 -0.010508842
## Kansas          0.67903028 -0.29739272  0.23794778 -0.184578859
## Kentucky        0.65242081  0.92071803  0.21550950 -0.659335002
## Louisiana      -1.65881774  0.81297569 -0.54304971 -0.571442615
## Maine           2.69037404  0.45290210 -0.56662933  0.244736056
## Maryland       -1.79738288  0.40865125 -0.20272321  0.530434707
## Massachusetts   0.40120822 -1.49624706 -0.46061752  0.080089078
## Michigan       -2.00608208 -0.11940052  0.24260927 -0.021023047
## Minnesota       1.61318004 -0.64050835  0.28371243 -0.034976506
## Mississippi    -1.07406270  2.32612244 -0.59983479 -0.351446592
## Missouri       -0.75002114 -0.26625605  0.48547695 -0.132735565
## Montana         1.08263952  0.51148535  0.39584336 -0.074934548
## Nebraska        1.16116223 -0.21345845  0.32536046  0.022538197
## Nevada         -2.46253447 -0.63142774  0.58107443 -0.110366833
## New Hampshire   2.53217873  0.02594468 -0.19441144 -0.001727117
## New Jersey     -0.28894798 -1.48337457 -0.50002622 -0.351929111
## New Mexico     -1.93999527  0.15484342  0.05844813  0.375579694
## New York       -1.74258521 -0.84945712 -0.50796826 -0.078021024
## North Carolina -1.17261108  2.16876574 -0.96121883  0.762118048
## North Dakota    3.32682130  0.69445403 -0.26900260  0.226819396
## Ohio            0.10043818 -0.76926601  0.25278616 -0.445164553
## Oklahoma        0.19045838 -0.31713206  0.17099230  0.003407881
## Oregon         -0.10276031 -0.52216855  0.93078036  0.419958030
## Pennsylvania    0.82042771 -0.59315984 -0.21001998 -0.418338494
## Rhode Island    1.14053009 -1.43552698 -1.79704434  0.306299281
## South Carolina -1.41385435  1.87836907 -0.20771937  0.078222445
## South Dakota    1.96845640  0.82246941  0.36018752  0.157951039
## Tennessee      -1.06927613  0.83463470  0.38249048 -0.588994500
## Texas          -1.43284496 -0.44488373 -0.23090006 -0.692415319
## Utah            0.42817534 -1.47971429  0.47405855  0.160435090
## Vermont         2.83895331  1.42338830  0.68851605  0.257590415
## Virginia       -0.02500370  0.16519932  0.22313739 -0.189497685
## Washington      0.12670289 -0.96661387  0.70939295  0.351774974
## West Virginia   2.27377958  1.45893004 -0.15313645 -0.162180405
## Wisconsin       2.14985617 -0.58718814 -0.19646165 -0.227634892
## Wyoming         0.55399633  0.29233150 -0.15571676  0.118589497

Biplots

To get another low-dimensional view of the data, the first two PCs can be plotted against each other which allows us to create a two dimensional scatterplot (called a biplot since it visualizes the first two PC scores against each other for all observations)

  • CHUNK 7 generates a biplot of the first two PCs, with all 50 states plotted and the red arrows indicating each feature’s loading
    • From the biplot we can see that states with a high index of crime (towards the upper left plot) and a high index of urbanization (towards the bottom) include states like California, Nevada and Caolorado
    • The similar directions of the three crime variables indicate that they are positively correlated; whereas UrbanPop is less correlated with these three since the direction differs
    • CHUNK 8 shows what the PCA and biplot outputs would look like if the variables weren’t scaled and as expected, the Assault variable dominates due to its relatively high variation not because of its ability to explain the data
# CHUNK 7
# cex argument indicates the amount by which plotting symbols should be scaled
# cex = 0.6 means 40% smaller
# scale = 0 ensures that the arrows are scaled to represent the loadings
biplot(PCA, scale = 0, cex = 0.6)

# CHUNK 8
PCA.unscaled <- prcomp(USArrests, scale. = FALSE)
summary(PCA.unscaled)
## Importance of components:
##                            PC1      PC2     PC3     PC4
## Standard deviation     83.4978 13.98288 2.52368 0.29143
## Proportion of Variance  0.9718  0.02725 0.00089 0.00001
## Cumulative Proportion   0.9718  0.99910 0.99999 1.00000
PCA.unscaled$rotation
##                   PC1          PC2         PC3          PC4
## Murder   -0.041810210  0.047906448 -0.99725886  0.037837046
## Assault  -0.998053720  0.044131837  0.04402688  0.001669238
## UrbanPop -0.046117192 -0.997841072 -0.04561010  0.010301396
## logRape  -0.003725906 -0.008399323 -0.03815909 -0.999229430
biplot(PCA.unscaled, scale = 0, cex = 0.6)

PCA Based Feature Generation

Given our PCA results, we’ll apply our finding to generate some potentially useful features to be used in some predictive model.

We know that PC1 explains a large chunk of the data (63.52% PVE) and that it focuses on some combination of the three crime variables – combining the three seems like a judicious move. We’ll cover three potential ways to generate this new feature:

Method 1: Take the PC as is

  • Most straight forward – just add the PC1 scores as a new feature into the USArrests data set (CHUNK 9)
    • Call the new feature Crime1
    • May not work well/be helpful if we have a lot more features which muddies up what exactly the new feature represents (i.e. not interpretable)
# CHUNK 9
USArrests.1 <- USArrests  # make a new copy of USArrests
USArrests.1$crime1 <- PCA$x[, 1]
head(USArrests.1)
##            Murder Assault UrbanPop  logRape      crime1
## Alabama      13.2     236       58 3.054001 -1.08857173
## Alaska       10.0     263       48 3.795489 -1.59517817
## Arizona       8.1     294       80 3.433987 -1.74286818
## Arkansas      8.8     190       50 2.970414  0.02590682
## California    9.0     276       91 3.703768 -2.27067947
## Colorado      7.9     204       78 3.655840 -1.33306093

Method 2: Pick and Choose Loadings

  • Since we believe PC1 seems to represent some sort of index of the overall crime rate, we can make an explicit decision to drop the other variables (UrbanPop)
  • A new feature will be manually created using the three crime variables’ loadings (CHUNK 10)
    • Note that we NEED to scale the data matrix since the PCs were generated on the scaled observations
  • Picking and choosing which variables to include and their respective loadings allows us to interpret the newly generated feature better
# CHUNK 10
USArrests.2 <- USArrests

# the scale() function will convert the USArrests data to a numeric matrix
# so we use the as.data.frame() function to change it back to a data frame
USArrests.scaled <- as.data.frame(scale(USArrests))

USArrests.2$crime2 <- PCA$rotation[1, 1] * USArrests.scaled$Murder +
  PCA$rotation[2, 1] * USArrests.scaled$Assault +
  PCA$rotation[4, 1] * USArrests.scaled$logRape

# OR
#USArrests$crime2 <- PCA$rotation[1, 1] * scale(USArrests$Murder) +
#  PCA$rotation[2, 1] * scale(USArrests$Assault) +
#  PCA$rotation[4, 1] * scale(USArrests$logRape)

head(USArrests.2)
##            Murder Assault UrbanPop  logRape     crime2
## Alabama      13.2     236       58 3.054001 -1.2326877
## Alaska       10.0     263       48 3.795489 -1.9304293
## Arizona       8.1     294       80 3.433987 -1.4664867
## Arkansas      8.8     190       50 2.970414 -0.2711172
## California    9.0     276       91 3.703768 -1.7840493
## Colorado      7.9     204       78 3.655840 -1.0949065

Method 3: Run a second PCA

  • An alternative (probably more accurate method) is to re-run the PCA on the three crime variables (CHUNK 11)
  • Since the first PCA revealed to us a relationship between the three variables, we may be interested in running a second PCA limited to just these three variables to get the appropriate loadings to use in generating the new feature
    • By design, the new PC1 should only have these three crime variables which we can use the scores as the new feature
# CHUNK 11
USArrests.3 <- USArrests

# Run a PCA on only the three crime-related variables
PCA.3 <- prcomp(USArrests.3[, c(1, 2, 4)], center = TRUE, scale. = TRUE)
PCA.3$rotation
##                PC1        PC2        PC3
## Murder  -0.5839540  0.4974194 -0.6415385
## Assault -0.5970524  0.2722855  0.7545787
## logRape -0.5500237 -0.8236714 -0.1379833
USArrests.3$crime3 <- PCA.3$x[, 1]
head(USArrests.3)
##            Murder Assault UrbanPop  logRape     crime3
## Alabama      13.2     236       58 3.054001 -1.3085142
## Alaska       10.0     263       48 3.795489 -1.9744320
## Arizona       8.1     294       80 3.433987 -1.5022655
## Arkansas      8.8     190       50 2.970414 -0.2874132
## California    9.0     276       91 3.703768 -1.8219787
## Colorado      7.9     204       78 3.655840 -1.1004001

Final Note: Delete the Original Variables

  • Delete/drop the original variables from the dataset/model which underlie the new feature you plan to add to your model
  • Failure to drop the variables will cause duplicate information/multicollinearity issues or a rank deficient model
# CHUNK 12
USArrests.2$Murder <- NULL
USArrests.2$Assault <- NULL
USArrests.2$logRape <- NULL

# OR
# USArrests.2[, c(1, 2, 4)] <- NULL

head(USArrests.2)
##            UrbanPop     crime2
## Alabama          58 -1.2326877
## Alaska           48 -1.9304293
## Arizona          80 -1.4664867
## Arkansas         50 -0.2711172
## California       91 -1.7840493
## Colorado         78 -1.0949065

6.2 Cluster Analysis

Cluster analyses take a different approach to data exploration – partitions hetereogeneous observations into a set of distinct homogeneous groupe, known as clusters, where obs share similar charactersitcs. The goal of clustering is to uncover interesting subgroups in the dataset at hand.

Two types of clustering is covered on the exam:

\(k\)-means clustering Hierarchical clustering

Similar to PCA, both forms of clustering are unsupervised learning techniques (no response variable to predict) but can be helpful by discovering clusters that may serve as useful features in a predictive model.

6.2.1 K-Means Clustering

The aim of \(k\)-means clustering is to assign each observation into one and only one of of the \(k\) cluseters. The number of clusters (\(k\)) is specified upfront after which the algorithm automatically searches for the best configuration of \(k\) clusters.

How the algorithm works

The \(k\) clusters should be chosen such that the variation of the observations inside each cluster is as small as possible, while the variation between each cluster should be large.

  • I.e., observations within clusters should be a like, but clusters themselves should be distinct

The smaller the within-cluster variation, the better the discrimination works – the algorithm attempts to optimize this by searching for the best similarity grouping. The algorithm operates as follows:

  1. Initialization: Randomly select \(k\) points in the feature space. These \(k\) points serve as the initial cluster centers

  2. Iteration: First, assign each observation to the cluster with the closest center in terms of Euclidean distance; Second, recalculate the center of each of the \(k\) clusters; third, repeat steps 1 and 2 until the cluster assignments no longer change

Random Initial Assignments

With each iteration of the algorithm, the within-cluster variation should be reduced and at the completion, we’re guaranteed to arrive at a LOCAL (but not necessarily global) optimum.

  • This is due to the fact that the algorithm relies on randomly assigned cluster center assignments
    • A different set of cluster centers may give rise to a different final set of clusters and different local optimum
  • Outliers can especially distort cluster assignmensts if one of the initial centers is too close to an outlier, then only the outlier will be assigned to that center and form its own cluster away from the rest of the data

  • Therefore, to increase the chance of identifying a global optimum and getting a better representative cluster grouping, it’s advisable to run the algorithm many times (20 - 50) with different initial cluster assignments
    • The best solution (assignments) with the lowest within-cluster variation is selected
    • The kmeans() function’s nstart argument specifies how many times to run the cluster algorithm to search for the best initial cluster centers

Importance of Standardization

  • Similar to PCA, it’s recommended to standardize features before running a cluster analysis for the same reason:
    • Large scale variables (like income) will dominate the distance calculation and exert a disproportionate impact on the cluster arrangements
    • Standardized variables make it easier for the clustering algorithm to attach equal weight to all features when performing distance calcs

Choosing the Number of Clusters - Elbow Method

Another conundrum is how do we know what the best value of \(k\) is?

  • The elbow method provides one solution which is based on the fact that a good grouping is characterized by a small within-cluster viaration but large between-cluster variation
  • Using this idea of small w/in variation and large b/w variation, we can plot the ratio of the two against the value of \(k\) and locate the point where the ratio “drops off” and plateus
    • Locating the “elbow” and its corresponding \(k\) is one potential choice for an optimal/appropriate \(k\)

6.2.2 Hierarchical Clustering

In contrast to \(k\)-means clustering where \(k\) is required to be specified at the start, hierarchical clustering instead produces cluster groupings that’s more tree-based (via a dendrogram) that has a hierarchical structure.

Inter-cluster Dissimilarity

Hierarchical clustering consists of a series of fusions (or mergers) of observations – that is, each individual observation starts as its own cluster and successively fuses with the closest pair of clusters, one pair at a time.

How is the distance/dissimilarity between two clusters measured?

  • For the initial step where its two observations, the answer is straightforward – Euclidean distance
  • However, it gets more complicated when we’re trying to measure the distance between two clusters (not sole observations)
    • This will rely on the “linkage” used
  • Three common linkage functions include:
    • Complete: The maximal pairwise distance between observations in two clusters (look at the furthest pair of obs)
    • Single: THe minimal pairwise distance between observations in two clusters (look at the closest pair of obs)
    • Average: The average of all pairwise distances between observations in two clusters
    • Generally, average and complete are preferred over single since they result in more balanced clusters (evenly sized)

Dendrogram

Dendrograms offer a view of aproduced hierarchy of clusters which shows which clusters are in close proximity to one another, but also (on the vertical axis) the thresholds of the inter-cluster dissimlarites at which fusions first occur.

  • Clusters joined towards the bottom of the dendrogram are rather similar to one another
  • Clusters which fuse at the top are further apart
  • A useful byproduct of the dendrogram is that we can see the cluster compositions at once for every desired number of clusters, from 1 to \(n\)
  • The “height” in a dendrogram is the distance (inter-cluster dissimilarity) at which the two clusters at that point fuse

Once a dendrogram is constructed, clusters are determined by making a horizontal cut across the dendrogram.

  • The resulting clusters formed are the distinct branches immediately below the cut
  • The lower the cut, the more clusters are created
  • No line means 1 cluster (the entire dataset); a line at 0 means \(n\) clusters
    • Therefore the number of clusters will vary on the cut height
    • Similar to selecting a \(k\) in \(k\)-means clustering, but DOES NOT need to be pre-specified as \(k\) does (i.e. changing the umber of clusters won’t change the composition of the dendrogram, whereas changing \(k\) means rerunning the whole algorithm for each \(k\))

K-Means vs Hierarchical Clustering

Is Randomization Needed?

  • \(k\)-Means: Yes (to set the initial cluster centers)
  • Hierarchical: No

Is the number of clusters pre-specified?

  • \(k\)-Means: Yes (we set the number of \(k\) – use an elbow plot to see which \(k\) corresponds with the smallest w/in variation over the largest b/w variation)
  • Hierarchical: No (although we can control the number clusters after the algorithm is complete by specifying a height on the dendrogram)

Are the clusters nested?

  • \(k\)-Means: No
  • Hierarchical: Yes, by definition

6.2.3 Simple Case Study

In this case study, we will use the iris dataset to perform \(k\)-means clustering and hierarchical clustering. Objectives as follows:

  • Run a \(k\)-means cluster analysis, using kmeans()
  • Run a hierarchical cluster analysis, using hclust()
  • Use the output of either analyses to produce features for a predictive model

Data Description

  • We will only use three varaibles from the iris data (sepal.length, petal lenghth and species) (CHUNK 1)
  • The summary statistics in CHUNK 2 show that the values aren’t too different between sepal and petal length, but the standard deviation/variability is – thus we decide to scale these 2 (CHUNK 2)
# CHUNK 1
# Clean the working memory
rm(list = ls())

data(iris)
iris <- iris[, c(1, 3, 5)]
summary(iris)
##   Sepal.Length    Petal.Length         Species  
##  Min.   :4.300   Min.   :1.000   setosa    :50  
##  1st Qu.:5.100   1st Qu.:1.600   versicolor:50  
##  Median :5.800   Median :4.350   virginica :50  
##  Mean   :5.843   Mean   :3.758                  
##  3rd Qu.:6.400   3rd Qu.:5.100                  
##  Max.   :7.900   Max.   :6.900
# CHUNK 2
apply(iris[, -3], 2, sd)
## Sepal.Length Petal.Length 
##    0.8280661    1.7652982
iris$Sepal.Length <- scale(iris$Sepal.Length)
iris$Petal.Length <- scale(iris$Petal.Length)

apply(iris[, -3], 2, sd)
## Sepal.Length Petal.Length 
##            1            1
# alternatively, sapply(iris[, -3], sd)

Applying a K-Means Cluster Analysis

  • First, we create a subset of the iris data to only include sepal and petal length
  • Use the kmeans() function – note that since the initial centers are selected at random, we should set a seed so our results are reproducible (CHUNK 3)
    • We pass specify \(k = 3\) with centers = 3 (arbitrary, we can go back later after looking at an elbow plot)
    • Specify 20 iterations to be done with random initial centers (through nstart) and only report the best one (again, 20-50 is recommended)
  • The output indicates we partitioned 150 obs into 3 groups of sizes 43, 53 and 54
    • The Clustering Vector contains the labels of the group each obs is classified to
    • Note that the model object is a list with several elements – among the most used being: cluster (with the group labels for each obs), betweenss and totss
# CHUNK 3
# kmeans() uses random initial cluster centers
set.seed(1)

# Select the variables on which a k-means cluster analysis is run
cluster_vars <- iris[, c("Sepal.Length", "Petal.Length")]
km3 <- kmeans(cluster_vars, centers = 3, nstart = 20)
km3
## K-means clustering with 3 clusters of sizes 43, 53, 54
## 
## Cluster means:
##   Sepal.Length Petal.Length
## 1    1.2255135    1.0250063
## 2    0.0365328    0.4160503
## 3   -1.0117281   -1.2245544
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 2 1 2 1 2 2 3 1 2 3 2 2 2 2 1 2 2 2 2
##  [71] 2 2 2 2 2 1 1 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 3 2 2 2 2 3 2 1 2 1 1 1
## [106] 1 2 1 1 1 1 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 2 1 1 1 1 1 2 2 1 1 1 2 1
## [141] 1 1 2 1 1 1 2 1 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 17.54315 11.87468 13.32996
##  (between_SS / total_SS =  85.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"
  • After running the 3-means cluster, we extract the vector of group assignments and convert them into a factor and add into the iris database (CHUNK 4) so that we can plot the sepal and petal length distinguished by the clustered groups.

  • Note that we generate a second plot but instead of distinguishing the clustered groups, we distinguish them by the species (which we left out of the cluster analysis)
    • The motivation being: checking how well \(k\)-means clustering caught that these obs were clustered by their species without giving it that information (i.e. the true group assignments)
  • The verdict? It matched the true assignments fairly closely
    • This indicates that the generated group assignments would be a good predictor of what species to classify an observation in (rather than using petal and sepal length).
# CHUNK 4
iris$group <- as.factor(km3$cluster)

library(ggplot2)
library(gridExtra)

p1 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, col = group)) +
  geom_point(size = 2)

p2 <- ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, col = Species)) +
  geom_point(size = 2)

grid.arrange(p1, p2, ncol = 2)

Choosing the best K

  • We “chose” 3 centers knowing that we were checking to see how well the \(k\)-means clusters could pick up on the 3 different species, but what if we didn’t know how many \(k\)’s to set?

  • We can implement the elbow method as discussed previously and check which \(k\) results in the largest increase in the ratio of b/w cluster variability and total variability

  • CHUNK 5 runs several \(k\)-means cluster algorithms with centers 1:10 (iterated 20 times)
    • The var.exp collects the bss / tss ratio for each k
    • Then an elbow plot is generated
    • The elbow plot seems to indicate that the elbow is around 3 (where a \(k\) beyond 3 results in minimal gains is bss_tss)
    • The elbow plot confirms our initial choice of \(k = 3\) is the optimal one
# CHUNK 5
km1 <- kmeans(cluster_vars, centers = 1, nstart = 20)
km2 <- kmeans(cluster_vars, centers = 2, nstart = 20)
km3 <- kmeans(cluster_vars, centers = 3, nstart = 20)
km4 <- kmeans(cluster_vars, centers = 4, nstart = 20)
km5 <- kmeans(cluster_vars, centers = 5, nstart = 20)
km6 <- kmeans(cluster_vars, centers = 6, nstart = 20)
km7 <- kmeans(cluster_vars, centers = 7, nstart = 20)
km8 <- kmeans(cluster_vars, centers = 8, nstart = 20)
km9 <- kmeans(cluster_vars, centers = 9, nstart = 20)
km10 <- kmeans(cluster_vars, centers = 10, nstart = 20)

var.exp <- data.frame(k = 1:10,
                      bss_tss = c(km1$betweenss/km1$totss,
                                  km2$betweenss/km2$totss,
                                  km3$betweenss/km3$totss,
                                  km4$betweenss/km4$totss,
                                  km5$betweenss/km5$totss,
                                  km6$betweenss/km6$totss,
                                  km7$betweenss/km7$totss,
                                  km8$betweenss/km8$totss,
                                  km9$betweenss/km9$totss,
                                  km10$betweenss/km10$totss))

ggplot(var.exp, aes(x = k, y = bss_tss)) +
  geom_point() +
  geom_line() +
  ggtitle("Elbow plot")

  • Finally, CHUNK 6 replaces the features used to generate the group assignments to avoid any multicollinearity/rank deficient issues (similar to PCA).
# CHUNK 6
iris$Sepal.Length <- NULL
iris$Petal.Length <- NULL
head(iris)
##   Species group
## 1  setosa     3
## 2  setosa     3
## 3  setosa     3
## 4  setosa     3
## 5  setosa     3
## 6  setosa     3
tail(iris)
##       Species group
## 145 virginica     1
## 146 virginica     1
## 147 virginica     2
## 148 virginica     1
## 149 virginica     2
## 150 virginica     2

Applying Hierarchical Cluster Analysis

  • The hclust() function runs the hierarchical cluster analysis
  • However, it does not allow a data frame to be passed through – only a \(n \times n\) numeric matrix carrying the inter-observation Euclidean distances
    • The dist() function is used to generate the input matrix
    • CHUNK 7 creates the same subset of the iris data used in the \(k\)-means analysis and passes the sepal and petal length variables in the dist() function to create the distance matrix
    • Note that the data is scaled for the same reason as before (different degrees of variability)
# CHUNK 7
# Clean the working memory
rm(list = ls())

# Reload the iris data
data(iris)
iris <- iris[, c(1, 3, 5)]
iris$Sepal.Length <- scale(iris$Sepal.Length)
iris$Petal.Length <- scale(iris$Petal.Length)

# Calculate the distance matrix
dist_matrix <- dist(iris[, c("Sepal.Length", "Petal.Length")])
  • CHUNK 8 passess the distance matrix through three different hierachical clustering algorithms:
    • hc.complete using the complete linkage (max dist)
    • hc.single using the single linkage (closes dist)
    • hc.average using the average linkage (average dist)
  • The plot function is used to generate the dendrograms
    • Note that the complete and average dendrograms produce more “balanced” clusters in the sense that a horizontal cut in the dendrogram will produce clusters with similar numbers of observations in them
# CHUNK 8
hc.complete <- hclust(dist_matrix)
hc.single <- hclust(dist_matrix, method = "single")
hc.average <- hclust(dist_matrix, method = "average")

plot(hc.complete, cex = 0.3)

plot(hc.single, cex = 0.3)

plot(hc.average, cex = 0.3)

  • CHUNK 9 utilizes the cutree function that takes an hclust object, along with the number of clusters desired (or alternative the cut height) that produces the hclust’s resulting group assignments for each observation

  • CHUNK 9 utilizes the cutree function to build a larger function that uses:
    • Whichever of the three hclust models we ran that’s specified in the function
    • The number of desired clusters to produce the new feature
    • The function also generates a scatterplot of sepal and petal length, and distinguishes the points by the newly generated assigned groups
    • We try out 2:5 clusters
  • Comparing our generated plot (of 3 clusters) to the plot produced from the \(k\)-means analysis, they generally share the same cluster distributions

  • A drawback with hierarchical clustering however is that there’s no systematic way to determine the number of clusters to use (as there is with \(k\)-means via the elbow method)

# CHUNK 9
# Simple function to create a plot given a data frame, an hclust object, and number of clusters
plot_cluster_slice <- function(df, hc, numclusters) {
  df$clusters <- as.factor(cutree(hc, numclusters))
  ggplot(data = df, aes(x = Sepal.Length, y = Petal.Length, col = clusters)) + 
    geom_point()
}

plot_cluster_slice(iris, hc.complete, 2)

plot_cluster_slice(iris, hc.complete, 3)

plot_cluster_slice(iris, hc.complete, 4)

plot_cluster_slice(iris, hc.complete, 5)