Machine Learning Exercises

In this exercise, we will illustrate concepts from the machine learning class slides. First, we will walk though the concepts of “training” and “testing” based on the logistic regression exercise used earlier in the course. Then we will extend the techniques to new datasets. Next, we will perform a k-means clustering technique on a sample of recent condo and apartment sales around campus. The purpose is to familiarize you with some supervised and unsupervised machine learning basics.

General housekeeping items

There are two new packages we will need to install before we get started:

install.packages('caret')
install.packages('xgboost')

Let’s begin by opening libraries and clearing the environment:

library(tidyverse)
library(readxl)
library(broom)
library(stargazer)
library(janitor)
library(scales)
library(cluster) 
library(factoextra)
library(xgboost)
library(caret)

rm(list=ls())

Set your working directory:

setwd('C:/YOURWD')

Logistic regression and confusion matrices

In the regression exercises, we illustrated logistic regression with simulated data. Logistic regression can be thought of as a machine learning method that can be used to predict binary outcomes…. Let’s recreate the example:

set.seed(42)

logit_data <- data.frame('predictor' = rnorm(50, mean = 1, sd = 1), 'import_outcome' = 1) %>% 
  bind_rows(data.frame('predictor' = rnorm(50, mean = -1, sd = 1), 'import_outcome' = 0))

Next, let’s recreate the plot including linear and logistic regression lines:

ggplot(logit_data, aes(x = predictor, y = import_outcome)) + 
  geom_point(alpha = 0.3) + 
  scale_y_continuous(breaks = seq(0,1)) +
  geom_smooth(method = 'lm', se = FALSE, color = 'blue') +
  geom_smooth(method = 'glm', method.args = list(family = 'binomial'), se = FALSE, color = 'red') +
  labs(title = 'Prediction of Important Outcome',
       y = 'Important Outcome',
       x = 'Predictor')

Now, let’s illustrate “training” and “testing” by splitting the sample into training and testing datasets. Here, we are going to randomly keep 70% of the original dataset and store as “train_data”. The remaining data will be stored as “test_data”.

set.seed(42) 

train_data <- slice_sample(logit_data, prop = 0.7, replace = FALSE) 
test_data <- logit_data %>% 
  anti_join(train_data)

Next, estimate a logistic regression on the training dataset and use the estimated model to predict outcomes in testing dataset.

##Estimate the logit model using the training dataset
train_logit <- glm(import_outcome ~ predictor, data = train_data, family = 'binomial') 

##Create 'predictions' on the testing dataset
import_prediction <- augment(train_logit, type.predict = 'response', newdata = test_data) %>%
  mutate(import_outcome_pred = as.numeric(.fitted > 0.5))
Note

Above, we converted the ‘fitted’ values to a binary prediction to force the model to predict either a 1 or a 0 (e.g., greater than 0.5 is predicted to be a 1, and less than 0.5 is predicted to be a 0). However, we can tinker with that threshold to make the predictions more/less conservative.

Now that we have both actual and predicted outcomes in the same dataset, let’s create a confusion matrix.

##Estimate the logit model using the training dataset
conf_data <- import_prediction %>%
  select(import_outcome_pred, import_outcome) %>% #select the predicted and reference/actual columns - in that order
  mutate(across(everything(), factor)) #Convert values to factors (binary variables)

table(conf_data)
                   import_outcome
import_outcome_pred  0  1
                  0 16  2
                  1  2 10

Unsurprisingly, there is a package to report confusion matrices (and associated statistics) from the caret package:

confusionMatrix(table(conf_data), mode = 'everything', positive = '1')
Confusion Matrix and Statistics

                   import_outcome
import_outcome_pred  0  1
                  0 16  2
                  1  2 10
                                          
               Accuracy : 0.8667          
                 95% CI : (0.6928, 0.9624)
    No Information Rate : 0.6             
    P-Value [Acc > NIR] : 0.00151         
                                          
                  Kappa : 0.7222          
                                          
 Mcnemar's Test P-Value : 1.00000         
                                          
            Sensitivity : 0.8333          
            Specificity : 0.8889          
         Pos Pred Value : 0.8333          
         Neg Pred Value : 0.8889          
              Precision : 0.8333          
                 Recall : 0.8333          
                     F1 : 0.8333          
             Prevalence : 0.4000          
         Detection Rate : 0.3333          
   Detection Prevalence : 0.4000          
      Balanced Accuracy : 0.8611          
                                          
       'Positive' Class : 1               
                                          
Note

Can you label the false positives, false negatives, true positives, and true negatives? Can you interpret accuracy, precision, recall, and specificity?

Also, notice that we can set what a ‘positive’ result is. Above, we set it to 1, but we can change it to 0. What happens? Why?

Let’s try this on default predictions using a dataset from University of California - Irvine Machine Learning Repository (UCIMLR) called “default of credit card clients.” Download and save the Excel file to your working directory then import:

cc_default <- read_excel("default of credit card clients.xls", skip = 1) %>% 
  clean_names()

Let’s create a couple of predictor variables:

cc_default <- cc_default %>% 
  mutate(id = row_number(), #Observation id to facilitate matching
         dtl_ratio = bill_amt1 / limit_bal, #Debt to limit ratio
         dtl_ratio_change = (bill_amt1 - bill_amt6) / limit_bal, #Change in debt to limit ratio
         behind = if_any(pay_0:pay_6, ~ . > 0), #Indicator if ever behind on payments in last few months
         avg_status = rowMeans(across(pay_0:pay_6))) %>% #Average payment status over past few months
  select(id, default_payment_next_month, dtl_ratio, dtl_ratio_change, behind, avg_status)

Let’s try “training” and “testing” a model to predict default by splitting the sample into training and testing datasets. As above, we are going to randomly keep 70% of the original dataset and store as “train_data”. The remaining data will be stored as “test_data”.

set.seed(42)   
train_data <- slice_sample(cc_default, prop = 0.7, replace = FALSE)  
test_data <- cc_default %>%    
  anti_join(train_data)

train_data <- train_data %>% 
  select(-id)

test_data <- test_data %>% 
  select(-id)

Next, let’s estimate a logistic regression on the training dataset and use the estimated model to predict outcomes in testing dataset.

##Estimate the logit model using the training dataset 
train_logit <- glm(default_payment_next_month ~ dtl_ratio + dtl_ratio_change + behind + avg_status, data = train_data, family = 'binomial')  

summary(train_logit)

Call:
glm(formula = default_payment_next_month ~ dtl_ratio + dtl_ratio_change + 
    behind + avg_status, family = "binomial", data = train_data)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)      -1.88646    0.03969 -47.534   <2e-16 ***
dtl_ratio         0.05705    0.06140   0.929    0.353    
dtl_ratio_change  0.03410    0.07458   0.457    0.647    
behindTRUE        1.37889    0.04237  32.541   <2e-16 ***
avg_status        0.33532    0.02428  13.812   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 22176  on 20999  degrees of freedom
Residual deviance: 19400  on 20995  degrees of freedom
AIC: 19410

Number of Fisher Scoring iterations: 4
##Create 'predictions' on the testing dataset 
default_prediction <- augment(train_logit, type.predict = 'response', newdata = test_data) %>%   
  mutate(default_outcome_pred = as.numeric(.fitted > 0.5))

default_prediction %>%
  select(default_outcome_pred, default_payment_next_month) %>% 
  mutate(across(everything(), factor)) %>% 
  table() %>% 
  confusionMatrix(mode = 'everything', positive = '1')
Confusion Matrix and Statistics

                    default_payment_next_month
default_outcome_pred    0    1
                   0 6818 1628
                   1  184  370
                                          
               Accuracy : 0.7987          
                 95% CI : (0.7902, 0.8069)
    No Information Rate : 0.778           
    P-Value [Acc > NIR] : 9.82e-07        
                                          
                  Kappa : 0.2142          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.18519         
            Specificity : 0.97372         
         Pos Pred Value : 0.66787         
         Neg Pred Value : 0.80725         
              Precision : 0.66787         
                 Recall : 0.18519         
                     F1 : 0.28997         
             Prevalence : 0.22200         
         Detection Rate : 0.04111         
   Detection Prevalence : 0.06156         
      Balanced Accuracy : 0.57945         
                                          
       'Positive' Class : 1               
                                          
Note

Can you interpret the values for accuracy, precision, recall, and specificity? Tinker with the prediction threshold and see what happens to these values.

Machine learning - Gradient Boosting (XGBoost)

Let’s repeat the analysis using the credit card data above, this time using a decision tree algorithms from machine learning. Here we are going to use a popular technique called Extreme Gradient Boosting using the xgboost package in R.

set.seed(42)    

# Define parameters for the xgboost model
xgb_model <- xgboost(
  data = as.matrix(train_data[,-1]), #Predictor variables
  label = as.matrix(train_data[,1]), #Outcome variable
  params = list(
    booster = 'gbtree', #Default
    objective = "binary:logistic", #default is squarederror (regression) - here we are predicting a binary classification, so logistic is our objective
    eval_metric = "error",
    max_depth = 6, #Default is 6 - controls depth of the tree, deep trees are more complex and have higher chance of overfitting
    eta = 0.3, #Default 0.3 - represents the learning rate
    gamma = 0 #Default is 0, minimum split loss, larger values make the algorithm more conservative
  ),
  nrounds = 100 #Default is 100, controls the number of iterations 
)

Generate predictions and create a confusion matrix:

default_prediction <- test_data %>% 
  mutate(fitted = predict(xgb_model, as.matrix(test_data[,-1]))) %>% 
  mutate(default_outcome_pred = as.numeric(fitted > .5))

default_prediction %>%
  select(default_outcome_pred, default_payment_next_month) %>% 
  mutate(across(everything(), factor)) %>% 
  table() %>% 
  confusionMatrix(mode = 'everything', positive = '1')
Confusion Matrix and Statistics

                    default_payment_next_month
default_outcome_pred    0    1
                   0 6597 1399
                   1  405  599
                                          
               Accuracy : 0.7996          
                 95% CI : (0.7911, 0.8078)
    No Information Rate : 0.778           
    P-Value [Acc > NIR] : 3.436e-07       
                                          
                  Kappa : 0.2943          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       
                                          
            Sensitivity : 0.29980         
            Specificity : 0.94216         
         Pos Pred Value : 0.59661         
         Neg Pred Value : 0.82504         
              Precision : 0.59661         
                 Recall : 0.29980         
                     F1 : 0.39907         
             Prevalence : 0.22200         
         Detection Rate : 0.06656         
   Detection Prevalence : 0.11156         
      Balanced Accuracy : 0.62098         
                                          
       'Positive' Class : 1               
                                          

Look at feature importance:

importance_matrix <- xgb.importance(colnames(train_data[,-1]), model = xgb_model) %>%
  arrange(Gain) %>% 
  mutate(Feature = fct_inorder(Feature))
importance_matrix
            Feature      Gain      Cover  Frequency
             <fctr>     <num>      <num>      <num>
1:           behind 0.1833843 0.02719708 0.01424918
2: dtl_ratio_change 0.2288508 0.40492639 0.42747534
3:        dtl_ratio 0.2323999 0.43700954 0.42382170
4:       avg_status 0.3553650 0.13086700 0.13445378
ggplot(importance_matrix, aes(y = Feature, x = Gain)) +
  geom_bar(stat = 'identity')

Clustering

On the course site, you will find a dataset containing information on sales of condos, apartments, and townhomes from around the University of Alabama’s campus. 1 Make sure this file is saved in your working directory. Let’s begin by importing the dataset:

campus_condo_data <- read_csv('campus_condo_data.csv')

Next, let’s cluster the observations by location. First, we will isolate the latitude and longitude variables and pass through the kmeans clustering function. Here we are going to isolate 2 clusters (or centers) and 25 random starting points (R will select the best one):

latlong <- campus_condo_data %>% 
  select(long, lat) 

set.seed(42)

clusters <- kmeans(latlong, centers = 2, nstart = 25)

str(clusters)
List of 9
 $ cluster     : int [1:660] 2 2 2 2 2 2 1 2 2 1 ...
 $ centers     : num [1:2, 1:2] -87.6 -87.5 33.2 33.2
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:2] "1" "2"
  .. ..$ : chr [1:2] "long" "lat"
 $ totss       : num 0.113
 $ withinss    : num [1:2] 0.02242 0.00996
 $ tot.withinss: num 0.0324
 $ betweenss   : num 0.081
 $ size        : int [1:2] 351 309
 $ iter        : int 1
 $ ifault      : int 0
 - attr(*, "class")= chr "kmeans"

Let’s visualize these clusters using the fviz_cluster ‘factoextra’ package:

fviz_cluster(clusters, data = latlong, geom = 'point', main = 'Real Estate Clusters', xlab = 'Dimension 1: Longitude', ylab = 'Dimension 2: Latitude')

As you can see, we have discretion over the number of clusters to select. It is useful to investigate how much adding clusters helps differentiate groups of similar observations. There are no “hard and fast” rules when determining the number of clusters. Ultimately, it is up to you to determine the number of clusters that work best in your setting. However, it is a useful exercise to evaluate how much each additional cluster improves the classifications. One way that we can do this is to use a “scree_plot” that shows how much incremental variation is explained when adding additional clusters.

wss <- function(k) {
  kmeans(latlong, k, nstart = 25)$tot.withinss
}

k.values <- 1:15

wss_values <- map_dbl(k.values, wss)

plot(k.values, wss_values,
     type = "b", pch = 19, frame = FALSE, 
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of squares")

As you can see, the improvement in similiarity among observations within each group declines as we add clusters. Lets reperform the clustering exercise above, this time using 4 clusters.

latlong <- campus_condo_data %>% 
  select(long, lat) 

set.seed(42)
clusters <- kmeans(latlong, centers = 4, nstart = 25)

fviz_cluster(clusters, data = latlong, geom = 'point', main = 'Real Estate Clusters', xlab = 'Dimension 1: Longitude', ylab = 'Dimension 2: Latitude')

Finally, let’s add the clusters to the original dataset (here we can use the augment function from the broom package). Then let’s estimate a regression predicting the sales price of a house based on size, bedrooms, age, and location (cluster). Can you interpret the output?

campus_condo_data <- augment(clusters, campus_condo_data)

pricefit1 <- lm(log(sales_price) ~ log(square_feet) + bedrooms + age, campus_condo_data)
pricefit2 <- lm(log(sales_price) ~ log(square_feet) + bedrooms + age + .cluster, campus_condo_data)

stargazer(pricefit1, pricefit2,
          type = 'html',
          out = 'price_reg.htm', 
          omit.stat = c('f', 'ser'), 
          title = 'Real Estate Regression Results')
Real Estate Regression Results
Dependent variable:
log(sales_price)
(1) (2)
log(square_feet) 1.434*** 0.903***
(0.113) (0.073)
bedrooms -0.203*** -0.033
(0.042) (0.027)
age -0.010*** -0.013***
(0.001) (0.001)
.cluster2 -0.431***
(0.034)
.cluster3 -0.815***
(0.034)
.cluster4 -1.062***
(0.037)
Constant 2.986*** 6.897***
(0.738) (0.481)
Observations 660 660
R2 0.466 0.793
Adjusted R2 0.464 0.791
Note: p<0.1; p<0.05; p<0.01

Footnotes

  1. The data was collected from Zillow in July 2022. The campus_condo_data data includes apartment, condominium, and townhouse sales from Zillow from November 2019 through May 2022. The data was collected from May 2022-July 2022. The search includes sales in Tuscaloosa inside of 359, Warrior River, McFarland Blvd, and 37th E Street. The data was filtered to remove observations with missing values (e.g., incomplete cases), outliers (e.g., sales price of $100), or unlikely to be individual sales (e.g., multiple units included in same sale). Latitude and longitude were obtained from Google Maps, remaining variables were obtained from Zillow. A big thanks to Kimberly Niven, Sara Buroughs, and Taryn Johnson for collecting this data.↩︎