DATA 622: Homework 3

Articles

The articles linked are using different means to the same end: predicting Covid-19 using machine learning algorithms. The Ahmad paper relies on decision trees to predict the presence of Covid-19, while the Guhathakurata paper relies on support vector machines (SVMs) to predict the presence of the disease. At its core, both papers undertake what is fundamentally a classification task (whether Covid is present in a patient or not). However, the models used in each approach are different. Ahmad uses a decision tree, while the Guhathakurata paper leverages a support vector machine (SVM) for this classification task.

Additionally, Guhathakurata remained with a linear kernel. It’s possible their 87% accuracy figure could be improved upon by using a different SVM kernel function. However, this decision would come with the cost of compute complexity, and, in some cases, model explainability. In addition, accuracy won’t be a great evaluation metric for this classification task since Covid data is very likely to be significantly imbalanced (most test cases will be negative, and a naive classifier could simply predict the baseline rate of positive cases). Both studies had similar results in temr of accuracy, recall, and F-1 scores, with Guhathakurata improving precision for their SVM clsasifier.

Below are some articles I found that discuss using Support Vector Mahcines and Decision Trees within the field of nuclear safety, an area of interest for me professionally.

All three discuss the application of support vector machines within the context of nuclear safety; a field in which I have personal interest both having a physics background and policy interest. This set of articles differs from the Covid-19 articles as they all leverage SVMs/Decision trees for a machine learning task. However, two focus on predictive maintenance within nuclear power plants (using SVMs), while one uses decision trees for nuclear scenario planning. Specifically, the Manjunatha article focuses on anomaly detection with motor pumps, attempting to predict which will need maintenance before they break down/degrade. Int his case, a multi-kernel approach is used.

All in all, the fact that these methods are used in such varying context above speak to their robustness. SVMs and decision trees are no longer hot topics in terms of modeling, but in many cases simpler modeling approaches can still perform very well across disciplines.

Data Analysis Using SVM

Onto a lighter topic than Covid and Nuclear Safety: sports gambling. First, we’ll read in the datasets used from homework 2. This is

elo <- read.csv("data/nfl_elo.csv")

# Some basic handling of team renames, as well as a boolean winner column
elo <- elo %>% 
  mutate(team1 = ifelse(team1=="WSH", "WAS", team1),
         team2 = ifelse(team2=="WSH", "WAS", team2),
         winner = ifelse(score1 > score2, "Home", "Away"))

elo$winner <- as.factor(elo$winner)

We’ll also read in our NBA player performance dataset. This can be used for a regression task later on (predicting how many points a player will score in a given game, for instance).

nba <- read_csv("data/traditional.csv")

## Rows: 702387 Columns: 30
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (5): type, player, team, home, away
## dbl  (24): gameid, playerid, MIN, PTS, FGM, FGA, FG%, 3PM, 3PA, 3P%, FTM, FT...
## date  (1): date
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(nba)

## # A tibble: 6 × 30
##     gameid date       type   playerid player team  home  away    MIN   PTS   FGM
##      <dbl> <date>     <chr>     <dbl> <chr>  <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 29600001 1996-11-01 regul…      893 Micha… CHI   BOS   CHI      43    30    10
## 2 29600001 1996-11-01 regul…      937 Scott… CHI   BOS   CHI      40    18     8
## 3 29600001 1996-11-01 regul…      677 Eric … BOS   BOS   CHI      25    14     6
## 4 29600001 1996-11-01 regul…      146 Jud B… CHI   BOS   CHI       1     0     0
## 5 29600001 1996-11-01 regul…      166 Ron H… CHI   BOS   CHI      25     7     3
## 6 29600001 1996-11-01 regul…      442 Pervi… BOS   BOS   CHI      31     7     2
## # ℹ 19 more variables: FGA <dbl>, `FG%` <dbl>, `3PM` <dbl>, `3PA` <dbl>,
## #   `3P%` <dbl>, FTM <dbl>, FTA <dbl>, `FT%` <dbl>, OREB <dbl>, DREB <dbl>,
## #   REB <dbl>, AST <dbl>, STL <dbl>, BLK <dbl>, TOV <dbl>, PF <dbl>,
## #   `+/-` <dbl>, win <dbl>, season <dbl>

Similar to homework 2, we’ll be looking to classify games based on whether the hometeam (team1 in our raw dataset) or the away team wins. For a regression task, we’ll mirror our modeling from HW 1 and attempt to predict individual NBA player performances using our nba dataset

We’ll use the same imputation and train/test split methods we used in Homework 2. In this case, we use predictive mean matching to imput our values

# Impute all values in our training data
input_cols <- c("elo1_pre", "elo2_pre",
                "elo_prob1", "elo_prob2",
                "qbelo1_pre", "qbelo2_pre",
                "qb1_value_pre", "qb2_value_pre",
                "qb1_adj", "qb2_adj")
naive_inputs <- elo[, input_cols ]
imputed_qb_ratings <- mice(naive_inputs, meth='pmm', printFlag = FALSE)
imputedData <- complete(imputed_qb_ratings, 1)

imputedData$winner <- elo$winner

# Create train-test split
set.seed(1234)

# create ID column
imputedData$game_id <- 1:nrow(imputedData)

# use 70% of dataset as traininging set and 30% as test set
train <- imputedData %>% dplyr::sample_frac(0.70)
test  <- dplyr::anti_join(imputedData, train, by = 'game_id')

Now we can fit a Support Vector Machine to classify our winner factor. We’ll train our SVM model using the caret library

trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
svmLinear <- train(winner ~., data = train,
              method = "svmLinear",
              trControl = trainControl,
              tuneLength=10,
              preProcess = c("center","scale"))

Now we can predict against our test dataset and get the confusion matrix for our classifier. In Homework 2, we receivved accuracy for our random forest model of \(62%\) (our decision tree was about \(64%\), but is prone to high variance based on the training data). In the context of sports prediction, higher accuracy values aren’t always feasible due to the random nature of game outcomes. Within the sports gambling community, an accuracy (for binary predictions) in the 70% range is considered very good. This is an example of how modeling contexts can often determine the model performance benchmarks of interest

predictionsLinear <- predict(svmLinear, test, decision.values = TRUE)

# Print out confusion matrix for SVM Classifier with linear kernel
confusionMatrix(test$winner, predictionsLinear)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Away Home
##       Away 1063 1109
##       Home  672 2285
##                                           
##                Accuracy : 0.6528          
##                  95% CI : (0.6395, 0.6658)
##     No Information Rate : 0.6617          
##     P-Value [Acc > NIR] : 0.9148          
##                                           
##                   Kappa : 0.2693          
##                                           
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.6127          
##             Specificity : 0.6732          
##          Pos Pred Value : 0.4894          
##          Neg Pred Value : 0.7727          
##              Prevalence : 0.3383          
##          Detection Rate : 0.2073          
##    Detection Prevalence : 0.4235          
##       Balanced Accuracy : 0.6430          
##                                           
##        'Positive' Class : Away            
##

Now let’s try to train an SVM with a non-linear kernel function. First we’ll try a radial kernel

# train a radial SVM classifier with the same tuning params
svmRadial <- train(winner ~., data = train,
              method = "svmRadial",
              trControl = trainControl,
              preProcess = c("center","scale"))

We can use the plot function on our Radial SVM to see the classifier’s accuracy as a function of cost

# Plot Linear SVM tuning
plot(svmRadial)

Now that we have a radial-kernel SVM trained, we can predict against the test set and print our confusion matrix and model diagnostic metrics fo rthis classifier.

predictionsRadial <- predict(svmRadial, test, decision.values = TRUE)

# Print out confusion matrix for SVM Classifier with linear kernel
confusionMatrix(test$winner, predictionsRadial)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Away Home
##       Away 1026 1146
##       Home  638 2319
##                                          
##                Accuracy : 0.6522         
##                  95% CI : (0.639, 0.6652)
##     No Information Rate : 0.6756         
##     P-Value [Acc > NIR] : 0.9998         
##                                          
##                   Kappa : 0.2648         
##                                          
##  Mcnemar's Test P-Value : <2e-16         
##                                          
##             Sensitivity : 0.6166         
##             Specificity : 0.6693         
##          Pos Pred Value : 0.4724         
##          Neg Pred Value : 0.7842         
##              Prevalence : 0.3244         
##          Detection Rate : 0.2000         
##    Detection Prevalence : 0.4235         
##       Balanced Accuracy : 0.6429         
##                                          
##        'Positive' Class : Away           
##

We see decent performance in our radial-kernel SVM (\(65%\) accuracy overall, comparable with other methods tried). In this case, since we actually don’t see a marked improvement using a radial kernel, we’d likely stay with the linear SVM in order to keep with a simpler model.

While the SVM can be used for regression tasks, it is primarily used for classification, as the decision boundary created can separate distinct classes of data points within the feature space. In this case, I’d likely go with the SVM (using the linear kernel) in a business context. It produces similar results, but is simpler. This advantage in explainability is also partly why I’d prefer the SVM to the random forest approach, which is akin to a black box in terms of how the model is generated. While the SVM isn’t as intuitive as a linear regression, for instance, dimensionality reduction (such as PCA) could be leveraged first to reduce the data to two principal components, which are then plotted along with the decision boundary. This would at least allow for a more helpful visualization of our SVM classifier and its requisite decision boundary.

DATA 622: Homework 3

Andrew Bowen

2024-04-20

Articles

Data Analysis Using SVM