DATA 622: Homework 2

Decision Trees

This article covers decision trees, as well as the advantages and disadvantages of using them. In this analysis, I train two decision trees on a kaggle dataset containing the ELO ratings of NFL teams prior to their games. This information is leveraged to train models to predict the winner of a particular game (either the Home or Away team). In addition, I train a random forest classifier on a subset of features (simialr to one of the decision trees) to predict the same outcomes.

One of the most salient points of the article is that the main advantage of using decision trees (especially in a business context) is the explainability and intuitiveness of the model. Decision trees are a pretty good approximation of how humans make and think about the relative inputs to their decisions. The game of twenty questions the article points to as an example of this kind of thinking is a helpful illustration. In addition, decision trees are models that can be conveyed to non-technical stakeholders more easily, as a simple flowchart diagram can be an effective visualization. This is actually one point for which decision trees beat out random forests. Even though the ensemble method of a random forest model

While decision trees can be simple, yet effective, they do have some drawbacks. The primary drawback is the tendency of decision trees to overfit to their training data. This comes from the tendency of decision trees to grow the number of leaf nodes with the number of features (This was discussed int he article under the Complexity disadvantage). This results in a tree that can predict training values well, because it grows to the shape of the trainig data, but not more generally. Trees that undergo this sort of training will not always generalize well, and pruning decision trees can be a worthwhiel endeavor. Data Science for Business (Provost & Fawcett is a good resource that discusses handling decision trees in a business modeling context well.

Data Analys

For this assignment, our classification task will be to try to predict whether the home team or away team will win a given NFL game, based on the respective ELO ratings prior to the game, along with other features included in the Kaggle dataset.

We’ll be using the NFL ELO rating dataset we used in homework 1. This dataset comes from Kaggle, and contains the ELO ratings of NFL teams both before and after their matchups, as well as the points scored by each team in the game. More information on the FiveThirtyEight method which calculates these ELO ratings can be found here.

elo <- read_csv("data/nfl_elo.csv")

# Some basic handling of team renames, as well as a boolean winner column
elo <- elo %>% 
  mutate(team1 = ifelse(team1=="WSH", "WAS", team1),
         team2 = ifelse(team2=="WSH", "WAS", team2),
         winner = ifelse(score1 > score2, "Home", "Away"))

elo$winner <- as.factor(elo$winner)

Exploratory Data Analysis

First, let’s take a quick look at the distribution of pre-game ELO ratings of each team (home & away):

prelo1 <- elo %>% 
  rename(elo = elo1_pre, team=team1) %>%
  select(c("team", "elo")) %>%
  mutate(team_type = "Home Team") 
prelo2 <- elo %>%
  rename(elo = elo2_pre, team=team2) %>%
  select(c("team", "elo")) %>%
  mutate(team_type = "Away Team")
prelo <- rbind(prelo1, prelo2)

# Plot pre-game ELO ratings for home and away teams
ggplot(prelo, aes(elo, fill = team_type)) +
  geom_density(alpha = 0.2) +
  labs(x="ELO Rating (pre-game)", title="Pre-Matchup ELO Ratings of NFL Teams (post-1967)")

Both our distributions of pre-game ELO ratings for home and away teams look relatively normally distributed. This is helpful to know before we go into modeling, as some modeling algorithms expect normally distributed (and sometimes scaled & centered) data.

Let’s also take a look at the pair plot of our joint distribuions between our features and output variable (winner)

elo %>% select(winner, elo1_pre, elo2_pre, elo_prob1, elo_prob2, qbelo1_pre, qbelo2_pre) %>%
  ggpairs()

We’ll also want to check for any collinearity between our feature variables, as well as with our output. This will help to protect us against redundancy of input features into our model, as well as helping to make sure our predictors explain the highest amount of variance in the response possible.

elo %>% mutate(winner=ifelse(winner== "Home", 1, 0)) %>% select(winner, elo1_pre, elo2_pre, elo_prob1, elo_prob2, qbelo1_pre, qbelo2_pre) %>% cor() %>% corrplot()

freq.na(elo)

##                missing  %
## importance       16810 98
## total_rating     16810 98
## playoff          16492 96
## qbelo1_pre        2162 13
## qbelo2_pre        2162 13
## qb1               2162 13
## qb2               2162 13
## qb1_value_pre     2162 13
## qb2_value_pre     2162 13
## qb1_adj           2162 13
## qb2_adj           2162 13
## qbelo_prob1       2162 13
## qbelo_prob2       2162 13
## qb1_game_value    2162 13
## qb2_game_value    2162 13
## qb1_value_post    2162 13
## qb2_value_post    2162 13
## qbelo1_post       2162 13
## qbelo2_post       2162 13
## quality           2162 13
## date                 0  0
## season               0  0
## neutral              0  0
## team1                0  0
## team2                0  0
## elo1_pre             0  0
## elo2_pre             0  0
## elo_prob1            0  0
## elo_prob2            0  0
## elo1_post            0  0
## elo2_post            0  0
## score1               0  0
## score2               0  0
## winner               0  0

We see from our NA frequency table that we have some fields of interest for which we’ll want to impute missing values. We can fill in these missing values via the R mice package

# Impute all values in our training data
input_cols <- c("elo1_pre", "elo2_pre",
                "elo_prob1", "elo_prob2",
                "qbelo1_pre", "qbelo2_pre",
                "qb1_value_pre", "qb2_value_pre",
                "qb1_adj", "qb2_adj")
naive_inputs <- elo[, input_cols ]
imputed_qb_ratings <- mice(naive_inputs, meth='pmm')

## 
##  iter imp variable
##   1   1  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   1   2  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   1   3  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   1   4  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   1   5  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   2   1  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   2   2  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   2   3  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   2   4  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   2   5  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   3   1  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   3   2  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   3   3  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   3   4  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   3   5  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   4   1  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   4   2  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   4   3  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   4   4  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   4   5  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   5   1  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   5   2  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   5   3  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   5   4  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj
##   5   5  qbelo1_pre  qbelo2_pre  qb1_value_pre  qb2_value_pre  qb1_adj  qb2_adj

imputedData <- complete(imputed_qb_ratings, 1)

imputedData$winner <- elo$winner

Training-Test Split

Before we model our data using a decision tree, we’ll need to create a training-test split for later validation. We’ll hold out 20% of our dataset for testing our models, and to try to avoid overfitting to our dataset.

set.seed(1234)

# create ID column
imputedData$game_id <- 1:nrow(imputedData)

# use 70% of dataset as traininging set and 30% as test set
training <- imputedData %>% dplyr::sample_frac(0.80)
test  <- dplyr::anti_join(imputedData, training, by = 'game_id')

Building Models

First we can build a decision tree model using all of our selected features above. We’ll refer to this as our “naive” model, as it naively takes all possible features as inputs (regardless of how well correlated they may be)

naive_model_formula <- winner ~ elo1_pre + elo2_pre + elo_prob1 + elo_prob2 + qbelo1_pre + qbelo2_pre + qb1_value_pre + qb2_value_pre + qb1_adj + qb2_adj

# Train decision
tree <- caret::train(naive_model_formula, 
                     data=training, 
                     method="rpart", 
                     trControl = trainControl(method = "cv", number=5))

Now we can change a few of the variables that are passed into our decision tree. Let’s simplify the model (in terms of number of variables passed in) and train a decision tree with the same parameters

simple_formula <- winner ~ elo1_pre + elo2_pre +  qbelo1_pre + qbelo2_pre

tree_simple <- caret::train(simple_formula, 
                     data=training, 
                     method="rpart", 
                     trControl = trainControl(method = "cv", number=5))

Random Forest

Decision trees are very felxible, albeit weak, learners. They have a tendency to overfit, especially as the number of leaf nodes grows. Random forests, which are comprised of several differently-trained decision trees combined together to form a prediction, can be leveraged to reduce the variance in your model. Luckily, the caret package in R supports training random forest models as well with a similar syntax.

Now we can train a random forest model on our data. We’ll use the “simple” model formula, which only takes into account the team’s and quarterbacks’ ELO ratings prior to the matchup. This will construct a series of decision trees averaged together, which will hopefully make this model more robust against new data it hasn’t been trained to see yet.

set.seed(1234)

# Train random forest model using 5-fold cross-validation
rf_tree <- caret::train(simple_formula, 
                        data=training, 
                        method="rf", 
                        trControl = trainControl(method = "cv", number=10))

Evalutaion

One advantage of decision trees from the pre-work article is the idea of explainability. That is, decision trees are an explainable model because they often mirror the if-else logic of human decision making, even if conditional probabilities are used. Conversely, random forest methods will tend to improve the performance of a model, as they combine the “wisdom” of several weak learners (decision trees). However, these random forest models often lose out on explainability, since they are an amalgamation of several models together, each with a different structure.

One simple way to evaluate our decision trees is via a confusion matrix, which shows the relative counts of true and false positives and negatives. In our case, an Away team winning would be considered a member of the “positive” class, while a Home team victory would be a negative class instance. it shoudl be noted there’s nothing inherently “positive” or “negative” about these outcomes, they are simply an artifact of the classification task at hand.

predictions_naive <- predict(tree, test)

confusionMatrix(test$winner, predictions_naive)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Away Home
##       Away  718  714
##       Home  483 1504
##                                           
##                Accuracy : 0.6499          
##                  95% CI : (0.6336, 0.6659)
##     No Information Rate : 0.6487          
##     P-Value [Acc > NIR] : 0.4508          
##                                           
##                   Kappa : 0.2643          
##                                           
##  Mcnemar's Test P-Value : 2.974e-11       
##                                           
##             Sensitivity : 0.5978          
##             Specificity : 0.6781          
##          Pos Pred Value : 0.5014          
##          Neg Pred Value : 0.7569          
##              Prevalence : 0.3513          
##          Detection Rate : 0.2100          
##    Detection Prevalence : 0.4188          
##       Balanced Accuracy : 0.6380          
##                                           
##        'Positive' Class : Away            
##

And here’s a confusion matrix for our simple model. On new data we see about the same accuracy, but for a much simpler model (only 4 inputs which are more explainable: team and QB pre-match ELO ratings) we see improved diagnostic metrics.

predictions_simple <- predict(tree_simple, test)

confusionMatrix(test$winner, predictions_simple)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Away Home
##       Away  717  715
##       Home  510 1477
##                                           
##                Accuracy : 0.6417          
##                  95% CI : (0.6254, 0.6578)
##     No Information Rate : 0.6411          
##     P-Value [Acc > NIR] : 0.4793          
##                                           
##                   Kappa : 0.249           
##                                           
##  Mcnemar's Test P-Value : 5.59e-09        
##                                           
##             Sensitivity : 0.5844          
##             Specificity : 0.6738          
##          Pos Pred Value : 0.5007          
##          Neg Pred Value : 0.7433          
##              Prevalence : 0.3589          
##          Detection Rate : 0.2097          
##    Detection Prevalence : 0.4188          
##       Balanced Accuracy : 0.6291          
##                                           
##        'Positive' Class : Away            
##

In this case, the accuracies of our classification models are comparable (65% in the case of the naive model, and 64% in the case of the simplified model). Given this, we’d likely choose to use the simplified model, as it woul dbe more explainable and computationally simpler.

We can print out the confusion matrix for our random forest model as well. While the accuracy of our classifier has dropped from the wider decision tree we trained above, a random forest approach will lessen the variance of our model, and likely reduce the chance of overfitting to training data

# Print confusion matrix for RF model
rf_predictions <- predict(rf_tree, test)

confusionMatrix(test$winner, rf_predictions)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Away Home
##       Away  752  680
##       Home  578 1409
##                                           
##                Accuracy : 0.6321          
##                  95% CI : (0.6156, 0.6482)
##     No Information Rate : 0.611           
##     P-Value [Acc > NIR] : 0.005944        
##                                           
##                   Kappa : 0.2366          
##                                           
##  Mcnemar's Test P-Value : 0.004405        
##                                           
##             Sensitivity : 0.5654          
##             Specificity : 0.6745          
##          Pos Pred Value : 0.5251          
##          Neg Pred Value : 0.7091          
##              Prevalence : 0.3890          
##          Detection Rate : 0.2199          
##    Detection Prevalence : 0.4188          
##       Balanced Accuracy : 0.6199          
##                                           
##        'Positive' Class : Away            
##

Conclusion

This article outlines the benefits and drawbacks to a decision tree approaach within a business context. Under the note of Evolve, the decision trees will need to be updated as more NFL games are played. In a production environment, checks on these decision tree (and random forest) models will need to be implemented to avoid model drift, in which the model trained begins to move away from the real world environment. Static decision trees that are not updated will eventually become out-of-date, especially if they don’t update their trainning to reflect the real world