Cheatsheet Classification for Machine Learning (R)
Introduction
This classification cheatsheet demonstrates several common machine learning models using a binary wine quality dataset:
- Source:
https://github.com/sainsdataid/dataset/raw/refs/heads/main/wine-quality-binary.csv
- Task: Predict a binary wine quality label based on physicochemical properties.
Models included:
- Logistic Regression
- K-Nearest Neighbors (KNN) Classification
- Support Vector Machine (SVM) Classification
- Decision Tree
- Random Forest
- XGBoost (Gradient Boosting)
- Naive Bayes
Metrics:
- Accuracy
- F1-score (for the positive class)
- Balanced Accuracy (accounts for class imbalance)
The code is written so you can adapt it easily to other binary classification datasets.
Required Packages
Main packages used:
- tidyverse – data manipulation and basic
visualization.
- readr – fast CSV reading (part of tidyverse but
loaded explicitly).
- caret – unified interface for training ML models,
cross-validation, preprocessing, and tuning.
- e1071 – provides Naive Bayes and SVM
implementations (also used internally by some caret methods).
- kernlab – kernel-based methods, used by
caretforsvmRadial.
- rpart – decision trees.
- rpart.plot – easy visualization of rpart
trees.
- randomForest – Random Forest algorithm.
- xgboost – efficient gradient boosting
implementation.
- MLmetrics – additional metrics such as F1-score, Balanced Accuracy, etc.
# Load required packages
library(tidyverse)
library(readr)
library(caret)
library(e1071) # Naive Bayes and SVM implementations
library(kernlab) # Kernel methods, used by caret for svmRadial
library(rpart) # Decision trees
library(rpart.plot) # Tree plotting
library(randomForest) # Random Forest
library(xgboost) # Gradient boosting
library(MLmetrics) # Extra metrics: F1, BalancedAccuracyIf you do not have these packages installed, you can run the following once:
0. Data Loading & Train-Test Split
We use the wine-quality-binary dataset from the provided GitHub URL.
# URL for the wine quality binary dataset
url_wine <- "https://github.com/sainsdataid/dataset/raw/refs/heads/main/wine-quality-binary.csv"
# Read CSV from URL
df <- read_csv(url_wine)
# Inspect structure and column names
str(df)## spc_tbl_ [1,143 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ id : num [1:1143] 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num [1:1143] 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 6.7 ...
## $ volatile.acidity : num [1:1143] 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.58 ...
## $ citric.acid : num [1:1143] 0 0 0.04 0.56 0 0 0.06 0 0.02 0.08 ...
## $ residual.sugar : num [1:1143] 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 1.8 ...
## $ chlorides : num [1:1143] 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.097 ...
## $ free.sulfur.dioxide : num [1:1143] 11 25 15 17 11 13 15 15 9 15 ...
## $ total.sulfur.dioxide: num [1:1143] 34 67 54 60 34 40 59 21 18 65 ...
## $ density : num [1:1143] 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num [1:1143] 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.28 ...
## $ sulphates : num [1:1143] 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.54 ...
## $ alcohol : num [1:1143] 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 9.2 ...
## $ quality : chr [1:1143] "LOW" "LOW" "LOW" "HIGH" ...
## - attr(*, "spec")=
## .. cols(
## .. id = col_double(),
## .. fixed.acidity = col_double(),
## .. volatile.acidity = col_double(),
## .. citric.acid = col_double(),
## .. residual.sugar = col_double(),
## .. chlorides = col_double(),
## .. free.sulfur.dioxide = col_double(),
## .. total.sulfur.dioxide = col_double(),
## .. density = col_double(),
## .. pH = col_double(),
## .. sulphates = col_double(),
## .. alcohol = col_double(),
## .. quality = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
## [1] "id" "fixed.acidity" "volatile.acidity"
## [4] "citric.acid" "residual.sugar" "chlorides"
## [7] "free.sulfur.dioxide" "total.sulfur.dioxide" "density"
## [10] "pH" "sulphates" "alcohol"
## [13] "quality"
# Assume the LAST column is the binary target
# Convert it to a factor (classification)
target_name <- names(df)[ncol(df)]
target_name## [1] "quality"
df[[target_name]] <- as.factor(df[[target_name]])
# Check class distribution
table(df[[target_name]])##
## HIGH LOW
## 621 522
set.seed(123)
# Create an 80/20 train-test split stratified by the target
train_idx <- createDataPartition(df[[target_name]], p = 0.8, list = FALSE)
train <- df[train_idx, ]
test <- df[-train_idx, ]
# Check dimensions of train and test sets
dim(train); dim(test)## [1] 915 13
## [1] 228 13
##
## HIGH LOW
## 497 418
##
## HIGH LOW
## 124 104
To make the code generic, we will:
- Build the model formula programmatically from
target_name.
- Always treat the second level of the factor as the positive class.
# Build a formula like: target ~ .
form <- as.formula(paste(target_name, "~ ."))
# Identify positive class as the second level of the factor
levels(train[[target_name]])## [1] "HIGH" "LOW"
## [1] "LOW"
1. Logistic Regression
Logistic regression is a classic baseline model for binary classification.
# Fit a logistic regression model (binomial family for binary outcome)
mod_log <- glm(form, data = train, family = binomial)
# Inspect model summary (coefficients, significance, etc.)
summary(mod_log)##
## Call:
## glm(formula = form, family = binomial, data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.677e+01 1.054e+02 0.539 0.590231
## id -1.367e-04 2.655e-04 -0.515 0.606694
## fixed.acidity -6.337e-02 1.325e-01 -0.478 0.632370
## volatile.acidity 3.539e+00 6.541e-01 5.411 6.28e-08 ***
## citric.acid 1.945e+00 7.454e-01 2.610 0.009057 **
## residual.sugar -3.510e-03 7.594e-02 -0.046 0.963133
## chlorides 5.578e+00 2.292e+00 2.434 0.014935 *
## free.sulfur.dioxide -1.240e-02 1.137e-02 -1.091 0.275308
## total.sulfur.dioxide 1.240e-02 3.728e-03 3.326 0.000882 ***
## density -5.068e+01 1.077e+02 -0.471 0.637914
## pH 9.044e-01 9.657e-01 0.937 0.349001
## sulphates -3.366e+00 6.557e-01 -5.134 2.84e-07 ***
## alcohol -9.581e-01 1.408e-01 -6.805 1.01e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1261.63 on 914 degrees of freedom
## Residual deviance: 949.39 on 902 degrees of freedom
## AIC: 975.39
##
## Number of Fisher Scoring iterations: 4
# Predict probabilities for the positive class on the test set
prob_log <- predict(mod_log, newdata = test, type = "response")
# Convert probabilities to class labels using 0.5 threshold
# - If prob > 0.5 → positive class
# - Otherwise → first level (negative class)
neg_class <- levels(train[[target_name]])[1]
pred_log <- ifelse(prob_log > 0.5, pos_class, neg_class)
pred_log <- factor(pred_log, levels = levels(train[[target_name]]))
# Confusion matrix and basic metrics
cm_log <- confusionMatrix(pred_log, test[[target_name]])
cm_log## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 98 25
## LOW 26 79
##
## Accuracy : 0.7763
## 95% CI : (0.7166, 0.8287)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 2.553e-13
##
## Kappa : 0.5495
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7903
## Specificity : 0.7596
## Pos Pred Value : 0.7967
## Neg Pred Value : 0.7524
## Prevalence : 0.5439
## Detection Rate : 0.4298
## Detection Prevalence : 0.5395
## Balanced Accuracy : 0.7750
##
## 'Positive' Class : HIGH
##
2. KNN Classification
K-Nearest Neighbors is a non-parametric method that classifies based on majority vote of nearest neighbors.
set.seed(123)
# Define cross-validation strategy: 5-fold CV
ctrl <- trainControl(method = "cv", number = 5)
# Train KNN classifier:
# - center and scale features
# - tune k over several values
mod_knn <- train(
form,
data = train,
method = "knn",
trControl = ctrl,
preProcess = c("center", "scale"),
tuneLength = 10
)
# Show best k chosen by cross-validation
mod_knn$bestTune## k
## 5 13
# Predict on test set
pred_knn <- predict(mod_knn, newdata = test)
# Confusion matrix for KNN
cm_knn <- confusionMatrix(pred_knn, test[[target_name]])
cm_knn## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 101 35
## LOW 23 69
##
## Accuracy : 0.7456
## 95% CI : (0.6839, 0.8008)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 2.615e-10
##
## Kappa : 0.4825
##
## Mcnemar's Test P-Value : 0.1486
##
## Sensitivity : 0.8145
## Specificity : 0.6635
## Pos Pred Value : 0.7426
## Neg Pred Value : 0.7500
## Prevalence : 0.5439
## Detection Rate : 0.4430
## Detection Prevalence : 0.5965
## Balanced Accuracy : 0.7390
##
## 'Positive' Class : HIGH
##
3. SVM Classification
Support Vector Machines (SVM) can model complex decision boundaries using kernels.
set.seed(123)
# Train SVM with radial basis function kernel:
# - caret method "svmRadial" uses kernlab::ksvm under the hood
# - features are centered and scaled
# - tuneLength controls grid size for cost and sigma
mod_svm <- train(
form,
data = train,
method = "svmRadial",
trControl = ctrl,
preProcess = c("center", "scale"),
tuneLength = 10
)
# Best hyperparameters found (C and sigma)
mod_svm$bestTune## sigma C
## 6 0.07461024 8
# Predict on test set
pred_svm <- predict(mod_svm, newdata = test)
# Confusion matrix for SVM
cm_svm <- confusionMatrix(pred_svm, test[[target_name]])
cm_svm## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 97 25
## LOW 27 79
##
## Accuracy : 0.7719
## 95% CI : (0.7119, 0.8247)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 7.383e-13
##
## Kappa : 0.541
##
## Mcnemar's Test P-Value : 0.8897
##
## Sensitivity : 0.7823
## Specificity : 0.7596
## Pos Pred Value : 0.7951
## Neg Pred Value : 0.7453
## Prevalence : 0.5439
## Detection Rate : 0.4254
## Detection Prevalence : 0.5351
## Balanced Accuracy : 0.7709
##
## 'Positive' Class : HIGH
##
4. Decision Tree Classification
Decision trees split the feature space with simple rules and are easy to visualize.
# Fit a decision tree (classification) using rpart
mod_tree <- rpart(form, data = train, method = "class")
# Plot the tree structure
rpart.plot(mod_tree, main = "Decision Tree for Wine Quality")# Predict class labels on test set
pred_tree <- predict(mod_tree, newdata = test, type = "class")
# Confusion matrix for decision tree
cm_tree <- confusionMatrix(pred_tree, test[[target_name]])
cm_tree## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 92 25
## LOW 32 79
##
## Accuracy : 0.75
## 95% CI : (0.6886, 0.8048)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 1.042e-10
##
## Kappa : 0.4988
##
## Mcnemar's Test P-Value : 0.4268
##
## Sensitivity : 0.7419
## Specificity : 0.7596
## Pos Pred Value : 0.7863
## Neg Pred Value : 0.7117
## Prevalence : 0.5439
## Detection Rate : 0.4035
## Detection Prevalence : 0.5132
## Balanced Accuracy : 0.7508
##
## 'Positive' Class : HIGH
##
5. Random Forest Classification
Random Forest builds many trees on bootstrap samples and aggregates their predictions.
set.seed(123)
# Train a Random Forest classifier:
# - ntree: number of trees
# - mtry: number of variables sampled at each split
mod_rf <- randomForest(
form,
data = train,
ntree = 300,
mtry = floor(sqrt(ncol(train) - 1)), # heuristic: sqrt of number of predictors
importance = TRUE
)
# Variable importance plot
importance(mod_rf)## HIGH LOW MeanDecreaseAccuracy MeanDecreaseGini
## id 14.028997 10.564999 17.62732 35.10140
## fixed.acidity 11.349800 9.440919 15.22809 27.87865
## volatile.acidity 15.973863 16.112758 22.32810 46.66662
## citric.acid 10.236717 9.417036 14.44282 26.75317
## residual.sugar 5.126901 8.774959 10.21065 21.10028
## chlorides 10.583338 11.525067 15.46892 29.40577
## free.sulfur.dioxide 8.763134 10.476440 15.27219 23.55293
## total.sulfur.dioxide 21.249480 14.951319 26.16640 44.49082
## density 13.393207 10.726715 18.12404 34.90799
## pH 12.895002 9.506639 15.88388 28.59011
## sulphates 28.404499 24.152408 34.65636 57.86420
## alcohol 28.504829 31.735833 40.70793 77.16747
# Predict on test set
pred_rf <- predict(mod_rf, newdata = test)
# Confusion matrix for Random Forest
cm_rf <- confusionMatrix(pred_rf, test[[target_name]])
cm_rf## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 101 24
## LOW 23 80
##
## Accuracy : 0.7939
## 95% CI : (0.7355, 0.8444)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 2.84e-15
##
## Kappa : 0.5842
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8145
## Specificity : 0.7692
## Pos Pred Value : 0.8080
## Neg Pred Value : 0.7767
## Prevalence : 0.5439
## Detection Rate : 0.4430
## Detection Prevalence : 0.5482
## Balanced Accuracy : 0.7919
##
## 'Positive' Class : HIGH
##
6. XGBoost Classification
XGBoost is a powerful gradient boosting library; here we use it for binary classification.
# Prepare model matrices (one-hot encoding if needed) for xgboost
x_train <- model.matrix(form, data = train)[, -1] # remove intercept
y_train <- as.numeric(train[[target_name]]) - 1 # convert factor to 0/1
x_test <- model.matrix(form, data = test)[, -1]
y_test <- as.numeric(test[[target_name]]) - 1
# Create DMatrix objects for xgboost
dtrain <- xgb.DMatrix(data = x_train, label = y_train)
dtest <- xgb.DMatrix(data = x_test, label = y_test)
# Set basic parameters for binary classification
params <- list(
objective = "binary:logistic", # logistic output for binary class
eval_metric = "logloss",
max_depth = 5,
eta = 0.2,
subsample = 0.8,
colsample_bytree = 0.8
)
set.seed(123)
# Train the XGBoost model
mod_xgb <- xgb.train(
params = params,
data = dtrain,
nrounds = 200,
verbose = 0
)
# Predict probabilities for the positive class
prob_xgb <- predict(mod_xgb, newdata = dtest)
# Convert probabilities to class labels using 0.5 threshold
pred_xgb <- ifelse(prob_xgb > 0.5, pos_class, neg_class)
pred_xgb <- factor(pred_xgb, levels = levels(train[[target_name]]))
# Confusion matrix for XGBoost
cm_xgb <- confusionMatrix(pred_xgb, test[[target_name]])
cm_xgb## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 97 24
## LOW 27 80
##
## Accuracy : 0.7763
## 95% CI : (0.7166, 0.8287)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 2.553e-13
##
## Kappa : 0.5502
##
## Mcnemar's Test P-Value : 0.7794
##
## Sensitivity : 0.7823
## Specificity : 0.7692
## Pos Pred Value : 0.8017
## Neg Pred Value : 0.7477
## Prevalence : 0.5439
## Detection Rate : 0.4254
## Detection Prevalence : 0.5307
## Balanced Accuracy : 0.7757
##
## 'Positive' Class : HIGH
##
7. Naive Bayes
Naive Bayes assumes conditional independence between features given the class label.
# Train a Naive Bayes classifier using e1071
mod_nb <- naiveBayes(form, data = train)
# Predict on test set
pred_nb <- predict(mod_nb, newdata = test)
# Confusion matrix for Naive Bayes
cm_nb <- confusionMatrix(pred_nb, test[[target_name]])
cm_nb## Confusion Matrix and Statistics
##
## Reference
## Prediction HIGH LOW
## HIGH 96 29
## LOW 28 75
##
## Accuracy : 0.75
## 95% CI : (0.6886, 0.8048)
## No Information Rate : 0.5439
## P-Value [Acc > NIR] : 1.042e-10
##
## Kappa : 0.4957
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7742
## Specificity : 0.7212
## Pos Pred Value : 0.7680
## Neg Pred Value : 0.7282
## Prevalence : 0.5439
## Detection Rate : 0.4211
## Detection Prevalence : 0.5482
## Balanced Accuracy : 0.7477
##
## 'Positive' Class : HIGH
##
8. Model Comparison (Accuracy, F1, Balanced Accuracy)
We now compare all models using three metrics:
- Accuracy
- F1-score (for positive class =
pos_class)
- Balanced Accuracy (average of recall for each class)
get_balanced_accuracy <- function(pred, truth) {
cm <- caret::confusionMatrix(pred, truth)
as.numeric(cm$byClass["Balanced Accuracy"])
}
# Build a summary table of metrics for each model
results <- tibble(
Model = c(
"Logistic Regression",
"KNN",
"SVM",
"Decision Tree",
"Random Forest",
"XGBoost",
"Naive Bayes"
),
Accuracy = c(
Accuracy(pred_log, test[[target_name]]),
Accuracy(pred_knn, test[[target_name]]),
Accuracy(pred_svm, test[[target_name]]),
Accuracy(pred_tree, test[[target_name]]),
Accuracy(pred_rf, test[[target_name]]),
Accuracy(pred_xgb, test[[target_name]]),
Accuracy(pred_nb, test[[target_name]])
),
F1 = c(
F1_Score(pred_log, test[[target_name]], positive = pos_class),
F1_Score(pred_knn, test[[target_name]], positive = pos_class),
F1_Score(pred_svm, test[[target_name]], positive = pos_class),
F1_Score(pred_tree, test[[target_name]], positive = pos_class),
F1_Score(pred_rf, test[[target_name]], positive = pos_class),
F1_Score(pred_xgb, test[[target_name]], positive = pos_class),
F1_Score(pred_nb, test[[target_name]], positive = pos_class)
),
Balanced_Accuracy = c(
get_balanced_accuracy(pred_log, test[[target_name]]),
get_balanced_accuracy(pred_knn, test[[target_name]]),
get_balanced_accuracy(pred_svm, test[[target_name]]),
get_balanced_accuracy(pred_tree, test[[target_name]]),
get_balanced_accuracy(pred_rf, test[[target_name]]),
get_balanced_accuracy(pred_xgb, test[[target_name]]),
get_balanced_accuracy(pred_nb, test[[target_name]])
)
) %>%
arrange(desc(Accuracy))
results## # A tibble: 7 × 4
## Model Accuracy F1 Balanced_Accuracy
## <chr> <dbl> <dbl> <dbl>
## 1 Random Forest 0.794 0.773 0.792
## 2 Logistic Regression 0.776 0.756 0.775
## 3 XGBoost 0.776 0.758 0.776
## 4 SVM 0.772 0.752 0.771
## 5 Decision Tree 0.75 0.735 0.751
## 6 Naive Bayes 0.75 0.725 0.748
## 7 KNN 0.746 0.704 0.739
9. Final Notes
- This cheatsheet is meant as a practical reference for quickly
fitting and comparing common classification models on a real
binary dataset.
- For real-world projects, you should also consider:
- More thorough hyperparameter tuning (e.g., grid search, random
search, Bayesian optimization).
- Proper handling of missing values and outliers.
- Feature engineering and domain-specific transformations.
- Using ROC curves, PR curves, and calibration plots for detailed evaluation.
- More thorough hyperparameter tuning (e.g., grid search, random
search, Bayesian optimization).
You can adapt these code snippets for other binary classification
problems by changing the data loading step and letting
target_name adjust automatically.