The OneR (One Rule) classification model is a simple yet effective rule-based machine learning algorithm. It generates one rule for each feature in the dataset and then selects the rule with the lowest error rate for classification.
Despite its simplicity, OneR performs surprisingly well on many classification tasks, often competing with more complex machine learning models.
This algorithm was invented by Robert C. Holte in 1993. He introduced it in his paper titled “Very Simple Classification Rules Perform Well on Most Commonly Used Datasets”.
The steps for creating a OneR classification model are as follows:
Step 1: Install and Load the OneR Package.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.3.0
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.1.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("OneR")
library(OneR)
Step 2: Load & Prepare the Data.
# Load dataset
library(mlbench)
data(PimaIndiansDiabetes)
# Get a glimpse of the data
PimaIndiansDiabetes %>% glimpse()
## Rows: 768
## Columns: 9
## $ pregnant <dbl> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5, 7, 0, 7, 1, 1…
## $ glucose <dbl> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, 110, 168, 139,…
## $ pressure <dbl> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74, 80, 60, 72, 0,…
## $ triceps <dbl> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, 23, 19, 0, 47, 0…
## $ insulin <dbl> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, 846, 175, 0, 230…
## $ mass <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.3, 30.5, 0.0, 37…
## $ pedigree <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248, 0.134, 0.158…
## $ age <dbl> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34, 57, 59, 51, 3…
## $ diabetes <fct> pos, neg, pos, neg, pos, neg, pos, neg, pos, pos, neg, pos, n…
Step 3: Check if data is balanced.
janitor::tabyl(PimaIndiansDiabetes$diabetes)
## PimaIndiansDiabetes$diabetes n percent
## neg 500 0.6510417
## pos 268 0.3489583
Note. The data is not balance.
Step 4: Balance data using {ROSE} package.
library(ROSE)
## Loaded ROSE 0.0-4
balanced_df <- ROSE(diabetes ~ ., data = PimaIndiansDiabetes, seed = 1980)$data
# Check if data is balanced.
janitor::tabyl(balanced_df$diabetes)
## balanced_df$diabetes n percent
## neg 384 0.5
## pos 384 0.5
Step 5: Train the OneR Model.
# Discretizes all numerical data in a data frame into categorical bins
data <- optbin(balanced_df)
# Build model with best predictor
model <- OneR(diabetes ~ ., data = data, verbose = TRUE)
##
## Attribute Accuracy
## 1 * glucose 66.67%
## 2 age 61.46%
## 3 pregnant 60.81%
## 4 mass 60.55%
## 5 pressure 55.47%
## 6 pedigree 55.21%
## 7 insulin 54.43%
## 8 triceps 52.08%
## ---
## Chosen attribute due to accuracy
## and ties method (if applicable): '*'
summary(model) # View model details
##
## Call:
## OneR.formula(formula = diabetes ~ ., data = data, verbose = TRUE)
##
## Rules:
## If glucose = (-25.2,125] then diabetes = neg
## If glucose = (125,250] then diabetes = pos
##
## Accuracy:
## 512 of 768 instances classified correctly (66.67%)
##
## Contingency table:
## glucose
## diabetes (-25.2,125] (125,250] Sum
## neg * 262 122 384
## pos 134 * 250 384
## Sum 396 372 768
## ---
## Maximum in each column: '*'
##
## Pearson's Chi-squared test:
## X-squared = 84.087, df = 1, p-value < 2.2e-16
plot(model) # Plot model
Single classification
The OneR model evaluated all predictor variables and selected glucose as the best attribute for classification, achieving the highest accuracy of 66.7%. Other variables, such as age (61.5%), pregnant (60.8%), and mass (60.6%), also showed moderate predictive power, while triceps (52.1%) and insulin (54.4%) were the weakest predictors. This indicates that glucose is the most informative single variable for predicting the outcome in this dataset, while other features contribute less individually.
Step 6: Make Predictions & Evaluate Model Performance.
# Use model to predict data
prediction <- predict(model, data)
# Evaluate prediction statistics
eval_model(prediction, data)
##
## Confusion matrix (absolute):
## Actual
## Prediction neg pos Sum
## neg 262 134 396
## pos 122 250 372
## Sum 384 384 768
##
## Confusion matrix (relative):
## Actual
## Prediction neg pos Sum
## neg 0.34 0.17 0.52
## pos 0.16 0.33 0.48
## Sum 0.50 0.50 1.00
##
## Accuracy:
## 0.6667 (512/768)
##
## Error rate:
## 0.3333 (256/768)
##
## Error rate reduction (vs. base rate):
## 0.3333 (p-value < 2.2e-16)
Note. Please note that model’s accuracy of 67% is considered moderate.
Conclusion
The OneR model achieved an overall accuracy of 66.7% (512 correct predictions out of 768), which is significantly better than the baseline accuracy (p-value < 2.2e-16).
The confusion matrix shows that the model correctly classified 262 negative cases and 250 positive cases, but also misclassified 134 positives as negatives and 122 negatives as positives.
The error rate was 33.3%, indicating that one in three cases is misclassified. While the model demonstrates predictive power beyond random chance, its moderate accuracy suggests that there is room for improvement, and more complex models may be needed for higher classification performance.
A.M.D.G