Cheatsheet Classification for Machine Learning (R)

Introduction

This classification cheatsheet demonstrates several common machine learning models using a binary wine quality dataset:

  • Source: https://github.com/sainsdataid/dataset/raw/refs/heads/main/wine-quality-binary.csv
  • Task: Predict a binary wine quality label based on physicochemical properties.

Models included:

  • Logistic Regression
  • K-Nearest Neighbors (KNN) Classification
  • Support Vector Machine (SVM) Classification
  • Decision Tree
  • Random Forest
  • XGBoost (Gradient Boosting)
  • Naive Bayes

Metrics:

  • Accuracy
  • F1-score (for the positive class)
  • Balanced Accuracy (accounts for class imbalance)

The code is written so you can adapt it easily to other binary classification datasets.

Required Packages

Main packages used:

  • tidyverse – data manipulation and basic visualization.
  • readr – fast CSV reading (part of tidyverse but loaded explicitly).
  • caret – unified interface for training ML models, cross-validation, preprocessing, and tuning.
  • e1071 – provides Naive Bayes and SVM implementations (also used internally by some caret methods).
  • kernlab – kernel-based methods, used by caret for svmRadial.
  • rpart – decision trees.
  • rpart.plot – easy visualization of rpart trees.
  • randomForest – Random Forest algorithm.
  • xgboost – efficient gradient boosting implementation.
  • MLmetrics – additional metrics such as F1-score, Balanced Accuracy, etc.
# Load required packages
library(tidyverse)
library(readr)
library(caret)
library(e1071)        # Naive Bayes and SVM implementations
library(kernlab)      # Kernel methods, used by caret for svmRadial
library(rpart)        # Decision trees
library(rpart.plot)   # Tree plotting
library(randomForest) # Random Forest
library(xgboost)      # Gradient boosting
library(MLmetrics)    # Extra metrics: F1, BalancedAccuracy

If you do not have these packages installed, you can run the following once:

install.packages(c(
  "tidyverse", "caret", "e1071", "kernlab",
  "rpart", "rpart.plot", "randomForest",
  "xgboost", "MLmetrics"
))

0. Data Loading & Train-Test Split

We use the wine-quality-binary dataset from the provided GitHub URL.

# URL for the wine quality binary dataset
url_wine <- "https://github.com/sainsdataid/dataset/raw/refs/heads/main/wine-quality-binary.csv"

# Read CSV from URL
df <- read_csv(url_wine)

# Inspect structure and column names
str(df)
## spc_tbl_ [1,143 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ id                  : num [1:1143] 1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num [1:1143] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
##  $ volatile.acidity    : num [1:1143] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
##  $ citric.acid         : num [1:1143] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
##  $ residual.sugar      : num [1:1143] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
##  $ chlorides           : num [1:1143] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
##  $ free.sulfur.dioxide : num [1:1143] 11 25 15 17 11 13 15 15 9 15 ...
##  $ total.sulfur.dioxide: num [1:1143] 34 67 54 60 34 40 59 21 18 65 ...
##  $ density             : num [1:1143] 0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num [1:1143] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
##  $ sulphates           : num [1:1143] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
##  $ alcohol             : num [1:1143] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
##  $ quality             : chr [1:1143] "LOW" "LOW" "LOW" "HIGH" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   id = col_double(),
##   ..   fixed.acidity = col_double(),
##   ..   volatile.acidity = col_double(),
##   ..   citric.acid = col_double(),
##   ..   residual.sugar = col_double(),
##   ..   chlorides = col_double(),
##   ..   free.sulfur.dioxide = col_double(),
##   ..   total.sulfur.dioxide = col_double(),
##   ..   density = col_double(),
##   ..   pH = col_double(),
##   ..   sulphates = col_double(),
##   ..   alcohol = col_double(),
##   ..   quality = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
names(df)
##  [1] "id"                   "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
# Assume the LAST column is the binary target
# Convert it to a factor (classification)
target_name <- names(df)[ncol(df)]
target_name
## [1] "quality"
df[[target_name]] <- as.factor(df[[target_name]])

# Check class distribution
table(df[[target_name]])
## 
## HIGH  LOW 
##  621  522
set.seed(123)

# Create an 80/20 train-test split stratified by the target
train_idx <- createDataPartition(df[[target_name]], p = 0.8, list = FALSE)
train <- df[train_idx, ]
test  <- df[-train_idx, ]

# Check dimensions of train and test sets
dim(train); dim(test)
## [1] 915  13
## [1] 228  13
table(train[[target_name]]); table(test[[target_name]])
## 
## HIGH  LOW 
##  497  418
## 
## HIGH  LOW 
##  124  104

To make the code generic, we will:

  • Build the model formula programmatically from target_name.
  • Always treat the second level of the factor as the positive class.
# Build a formula like: target ~ .
form <- as.formula(paste(target_name, "~ ."))

# Identify positive class as the second level of the factor
levels(train[[target_name]])
## [1] "HIGH" "LOW"
pos_class <- levels(train[[target_name]])[2]
pos_class
## [1] "LOW"

1. Logistic Regression

Logistic regression is a classic baseline model for binary classification.

# Fit a logistic regression model (binomial family for binary outcome)
mod_log <- glm(form, data = train, family = binomial)

# Inspect model summary (coefficients, significance, etc.)
summary(mod_log)
## 
## Call:
## glm(formula = form, family = binomial, data = train)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           5.677e+01  1.054e+02   0.539 0.590231    
## id                   -1.367e-04  2.655e-04  -0.515 0.606694    
## fixed.acidity        -6.337e-02  1.325e-01  -0.478 0.632370    
## volatile.acidity      3.539e+00  6.541e-01   5.411 6.28e-08 ***
## citric.acid           1.945e+00  7.454e-01   2.610 0.009057 ** 
## residual.sugar       -3.510e-03  7.594e-02  -0.046 0.963133    
## chlorides             5.578e+00  2.292e+00   2.434 0.014935 *  
## free.sulfur.dioxide  -1.240e-02  1.137e-02  -1.091 0.275308    
## total.sulfur.dioxide  1.240e-02  3.728e-03   3.326 0.000882 ***
## density              -5.068e+01  1.077e+02  -0.471 0.637914    
## pH                    9.044e-01  9.657e-01   0.937 0.349001    
## sulphates            -3.366e+00  6.557e-01  -5.134 2.84e-07 ***
## alcohol              -9.581e-01  1.408e-01  -6.805 1.01e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1261.63  on 914  degrees of freedom
## Residual deviance:  949.39  on 902  degrees of freedom
## AIC: 975.39
## 
## Number of Fisher Scoring iterations: 4
# Predict probabilities for the positive class on the test set
prob_log <- predict(mod_log, newdata = test, type = "response")

# Convert probabilities to class labels using 0.5 threshold
# - If prob > 0.5 → positive class
# - Otherwise → first level (negative class)
neg_class <- levels(train[[target_name]])[1]

pred_log <- ifelse(prob_log > 0.5, pos_class, neg_class)
pred_log <- factor(pred_log, levels = levels(train[[target_name]]))

# Confusion matrix and basic metrics
cm_log <- confusionMatrix(pred_log, test[[target_name]])
cm_log
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   98  25
##       LOW    26  79
##                                           
##                Accuracy : 0.7763          
##                  95% CI : (0.7166, 0.8287)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 2.553e-13       
##                                           
##                   Kappa : 0.5495          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7903          
##             Specificity : 0.7596          
##          Pos Pred Value : 0.7967          
##          Neg Pred Value : 0.7524          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4298          
##    Detection Prevalence : 0.5395          
##       Balanced Accuracy : 0.7750          
##                                           
##        'Positive' Class : HIGH            
## 

2. KNN Classification

K-Nearest Neighbors is a non-parametric method that classifies based on majority vote of nearest neighbors.

set.seed(123)

# Define cross-validation strategy: 5-fold CV
ctrl <- trainControl(method = "cv", number = 5)

# Train KNN classifier:
# - center and scale features
# - tune k over several values
mod_knn <- train(
  form,
  data = train,
  method = "knn",
  trControl = ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10
)

# Show best k chosen by cross-validation
mod_knn$bestTune
##    k
## 5 13
# Predict on test set
pred_knn <- predict(mod_knn, newdata = test)

# Confusion matrix for KNN
cm_knn <- confusionMatrix(pred_knn, test[[target_name]])
cm_knn
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH  101  35
##       LOW    23  69
##                                           
##                Accuracy : 0.7456          
##                  95% CI : (0.6839, 0.8008)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 2.615e-10       
##                                           
##                   Kappa : 0.4825          
##                                           
##  Mcnemar's Test P-Value : 0.1486          
##                                           
##             Sensitivity : 0.8145          
##             Specificity : 0.6635          
##          Pos Pred Value : 0.7426          
##          Neg Pred Value : 0.7500          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4430          
##    Detection Prevalence : 0.5965          
##       Balanced Accuracy : 0.7390          
##                                           
##        'Positive' Class : HIGH            
## 

3. SVM Classification

Support Vector Machines (SVM) can model complex decision boundaries using kernels.

set.seed(123)

# Train SVM with radial basis function kernel:
# - caret method "svmRadial" uses kernlab::ksvm under the hood
# - features are centered and scaled
# - tuneLength controls grid size for cost and sigma
mod_svm <- train(
  form,
  data = train,
  method = "svmRadial",
  trControl = ctrl,
  preProcess = c("center", "scale"),
  tuneLength = 10
)

# Best hyperparameters found (C and sigma)
mod_svm$bestTune
##        sigma C
## 6 0.07461024 8
# Predict on test set
pred_svm <- predict(mod_svm, newdata = test)

# Confusion matrix for SVM
cm_svm <- confusionMatrix(pred_svm, test[[target_name]])
cm_svm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   97  25
##       LOW    27  79
##                                           
##                Accuracy : 0.7719          
##                  95% CI : (0.7119, 0.8247)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 7.383e-13       
##                                           
##                   Kappa : 0.541           
##                                           
##  Mcnemar's Test P-Value : 0.8897          
##                                           
##             Sensitivity : 0.7823          
##             Specificity : 0.7596          
##          Pos Pred Value : 0.7951          
##          Neg Pred Value : 0.7453          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4254          
##    Detection Prevalence : 0.5351          
##       Balanced Accuracy : 0.7709          
##                                           
##        'Positive' Class : HIGH            
## 

4. Decision Tree Classification

Decision trees split the feature space with simple rules and are easy to visualize.

# Fit a decision tree (classification) using rpart
mod_tree <- rpart(form, data = train, method = "class")

# Plot the tree structure
rpart.plot(mod_tree, main = "Decision Tree for Wine Quality")

# Predict class labels on test set
pred_tree <- predict(mod_tree, newdata = test, type = "class")

# Confusion matrix for decision tree
cm_tree <- confusionMatrix(pred_tree, test[[target_name]])
cm_tree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   92  25
##       LOW    32  79
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6886, 0.8048)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 1.042e-10       
##                                           
##                   Kappa : 0.4988          
##                                           
##  Mcnemar's Test P-Value : 0.4268          
##                                           
##             Sensitivity : 0.7419          
##             Specificity : 0.7596          
##          Pos Pred Value : 0.7863          
##          Neg Pred Value : 0.7117          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4035          
##    Detection Prevalence : 0.5132          
##       Balanced Accuracy : 0.7508          
##                                           
##        'Positive' Class : HIGH            
## 

5. Random Forest Classification

Random Forest builds many trees on bootstrap samples and aggregates their predictions.

set.seed(123)

# Train a Random Forest classifier:
# - ntree: number of trees
# - mtry: number of variables sampled at each split
mod_rf <- randomForest(
  form,
  data = train,
  ntree = 300,
  mtry = floor(sqrt(ncol(train) - 1)), # heuristic: sqrt of number of predictors
  importance = TRUE
)

# Variable importance plot
importance(mod_rf)
##                           HIGH       LOW MeanDecreaseAccuracy MeanDecreaseGini
## id                   14.028997 10.564999             17.62732         35.10140
## fixed.acidity        11.349800  9.440919             15.22809         27.87865
## volatile.acidity     15.973863 16.112758             22.32810         46.66662
## citric.acid          10.236717  9.417036             14.44282         26.75317
## residual.sugar        5.126901  8.774959             10.21065         21.10028
## chlorides            10.583338 11.525067             15.46892         29.40577
## free.sulfur.dioxide   8.763134 10.476440             15.27219         23.55293
## total.sulfur.dioxide 21.249480 14.951319             26.16640         44.49082
## density              13.393207 10.726715             18.12404         34.90799
## pH                   12.895002  9.506639             15.88388         28.59011
## sulphates            28.404499 24.152408             34.65636         57.86420
## alcohol              28.504829 31.735833             40.70793         77.16747
varImpPlot(mod_rf, main = "Random Forest Variable Importance")

# Predict on test set
pred_rf <- predict(mod_rf, newdata = test)

# Confusion matrix for Random Forest
cm_rf <- confusionMatrix(pred_rf, test[[target_name]])
cm_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH  101  24
##       LOW    23  80
##                                           
##                Accuracy : 0.7939          
##                  95% CI : (0.7355, 0.8444)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 2.84e-15        
##                                           
##                   Kappa : 0.5842          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8145          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.8080          
##          Neg Pred Value : 0.7767          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4430          
##    Detection Prevalence : 0.5482          
##       Balanced Accuracy : 0.7919          
##                                           
##        'Positive' Class : HIGH            
## 

6. XGBoost Classification

XGBoost is a powerful gradient boosting library; here we use it for binary classification.

# Prepare model matrices (one-hot encoding if needed) for xgboost
x_train <- model.matrix(form, data = train)[, -1]  # remove intercept
y_train <- as.numeric(train[[target_name]]) - 1    # convert factor to 0/1

x_test  <- model.matrix(form, data = test)[, -1]
y_test  <- as.numeric(test[[target_name]]) - 1

# Create DMatrix objects for xgboost
dtrain <- xgb.DMatrix(data = x_train, label = y_train)
dtest  <- xgb.DMatrix(data = x_test, label = y_test)

# Set basic parameters for binary classification
params <- list(
  objective = "binary:logistic",  # logistic output for binary class
  eval_metric = "logloss",
  max_depth = 5,
  eta = 0.2,
  subsample = 0.8,
  colsample_bytree = 0.8
)

set.seed(123)

# Train the XGBoost model
mod_xgb <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 200,
  verbose = 0
)

# Predict probabilities for the positive class
prob_xgb <- predict(mod_xgb, newdata = dtest)

# Convert probabilities to class labels using 0.5 threshold
pred_xgb <- ifelse(prob_xgb > 0.5, pos_class, neg_class)
pred_xgb <- factor(pred_xgb, levels = levels(train[[target_name]]))

# Confusion matrix for XGBoost
cm_xgb <- confusionMatrix(pred_xgb, test[[target_name]])
cm_xgb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   97  24
##       LOW    27  80
##                                           
##                Accuracy : 0.7763          
##                  95% CI : (0.7166, 0.8287)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 2.553e-13       
##                                           
##                   Kappa : 0.5502          
##                                           
##  Mcnemar's Test P-Value : 0.7794          
##                                           
##             Sensitivity : 0.7823          
##             Specificity : 0.7692          
##          Pos Pred Value : 0.8017          
##          Neg Pred Value : 0.7477          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4254          
##    Detection Prevalence : 0.5307          
##       Balanced Accuracy : 0.7757          
##                                           
##        'Positive' Class : HIGH            
## 

7. Naive Bayes

Naive Bayes assumes conditional independence between features given the class label.

# Train a Naive Bayes classifier using e1071
mod_nb <- naiveBayes(form, data = train)

# Predict on test set
pred_nb <- predict(mod_nb, newdata = test)

# Confusion matrix for Naive Bayes
cm_nb <- confusionMatrix(pred_nb, test[[target_name]])
cm_nb
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction HIGH LOW
##       HIGH   96  29
##       LOW    28  75
##                                           
##                Accuracy : 0.75            
##                  95% CI : (0.6886, 0.8048)
##     No Information Rate : 0.5439          
##     P-Value [Acc > NIR] : 1.042e-10       
##                                           
##                   Kappa : 0.4957          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7742          
##             Specificity : 0.7212          
##          Pos Pred Value : 0.7680          
##          Neg Pred Value : 0.7282          
##              Prevalence : 0.5439          
##          Detection Rate : 0.4211          
##    Detection Prevalence : 0.5482          
##       Balanced Accuracy : 0.7477          
##                                           
##        'Positive' Class : HIGH            
## 

8. Model Comparison (Accuracy, F1, Balanced Accuracy)

We now compare all models using three metrics:

  • Accuracy
  • F1-score (for positive class = pos_class)
  • Balanced Accuracy (average of recall for each class)
get_balanced_accuracy <- function(pred, truth) {
  cm <- caret::confusionMatrix(pred, truth)
  as.numeric(cm$byClass["Balanced Accuracy"])
}

# Build a summary table of metrics for each model
results <- tibble(
  Model = c(
    "Logistic Regression",
    "KNN",
    "SVM",
    "Decision Tree",
    "Random Forest",
    "XGBoost",
    "Naive Bayes"
  ),
  Accuracy = c(
    Accuracy(pred_log, test[[target_name]]),
    Accuracy(pred_knn, test[[target_name]]),
    Accuracy(pred_svm, test[[target_name]]),
    Accuracy(pred_tree, test[[target_name]]),
    Accuracy(pred_rf, test[[target_name]]),
    Accuracy(pred_xgb, test[[target_name]]),
    Accuracy(pred_nb, test[[target_name]])
  ),
  F1 = c(
    F1_Score(pred_log, test[[target_name]], positive = pos_class),
    F1_Score(pred_knn, test[[target_name]], positive = pos_class),
    F1_Score(pred_svm, test[[target_name]], positive = pos_class),
    F1_Score(pred_tree, test[[target_name]], positive = pos_class),
    F1_Score(pred_rf, test[[target_name]], positive = pos_class),
    F1_Score(pred_xgb, test[[target_name]], positive = pos_class),
    F1_Score(pred_nb, test[[target_name]], positive = pos_class)
  ),
  Balanced_Accuracy = c(
    get_balanced_accuracy(pred_log,  test[[target_name]]),
    get_balanced_accuracy(pred_knn,  test[[target_name]]),
    get_balanced_accuracy(pred_svm,  test[[target_name]]),
    get_balanced_accuracy(pred_tree, test[[target_name]]),
    get_balanced_accuracy(pred_rf,   test[[target_name]]),
    get_balanced_accuracy(pred_xgb,  test[[target_name]]),
    get_balanced_accuracy(pred_nb,   test[[target_name]])
  )
) %>%
  arrange(desc(Accuracy))

results
## # A tibble: 7 × 4
##   Model               Accuracy    F1 Balanced_Accuracy
##   <chr>                  <dbl> <dbl>             <dbl>
## 1 Random Forest          0.794 0.773             0.792
## 2 Logistic Regression    0.776 0.756             0.775
## 3 XGBoost                0.776 0.758             0.776
## 4 SVM                    0.772 0.752             0.771
## 5 Decision Tree          0.75  0.735             0.751
## 6 Naive Bayes            0.75  0.725             0.748
## 7 KNN                    0.746 0.704             0.739

9. Final Notes

  • This cheatsheet is meant as a practical reference for quickly fitting and comparing common classification models on a real binary dataset.
  • For real-world projects, you should also consider:
    • More thorough hyperparameter tuning (e.g., grid search, random search, Bayesian optimization).
    • Proper handling of missing values and outliers.
    • Feature engineering and domain-specific transformations.
    • Using ROC curves, PR curves, and calibration plots for detailed evaluation.

You can adapt these code snippets for other binary classification problems by changing the data loading step and letting target_name adjust automatically.