End-to-End Machine Learning Pipeline using Iris Dataset (R Implementation)
Submitted by
N. A. Adarsh Pritam (011)
Shruti Bhimjibhai Thakkar (016)
Amrutha S Prasad (006)
Course: Machine Learning Operations
Institution: (Alliance University, Bengaluru)
Submission Date: January 2026
This project implements a complete end-to-end machine learning workflow using the UCI Iris dataset, adapted to the R ecosystem. Rather than focusing solely on model accuracy, the pipeline emphasizes structured data ingestion, feature engineering, reproducible experimentation, model comparison, deployment, and validation, following Machine Learning Operations (MLOps) best practices.
The project demonstrates how traditional Python-based MLOps concepts can be faithfully translated into R using equivalent industry-standard tools.
The dataset used in this project is the Iris dataset sourced from the UCI Machine Learning Repository. It consists of 150 samples, each representing a flower from one of three species:
The dataset was loaded into R as a data frame and processed using
base R and tidyverse utilities.
After ingestion, all feature names were standardized by converting them to lowercase and replacing spaces with underscores. This ensures compatibility with programmatic feature selection and downstream pipelines.
| Attribute | Data Type | Non-Null Count | Description |
|---|---|---|---|
| sepal_length | Numeric | 150 | Sepal length (cm) |
| sepal_width | Numeric | 150 | Sepal width (cm) |
| petal_length | Numeric | 150 | Petal length (cm) |
| petal_width | Numeric | 150 | Petal width (cm) |
| class | Factor | 150 | Species label |
Schema validation confirmed that all attributes contain valid entries and no missing values.
Data cleaning and validation steps were performed to ensure data consistency and reliability. These steps included:
Three duplicate records were identified and removed, reducing the dataset size from 150 to 147 samples.
Descriptive statistics were computed for all numerical features after data cleaning. The statistics confirmed stable distributions and biologically reasonable ranges for all attributes.
Outliers were detected using the Interquartile Range (IQR) method. A
small number of potential outliers were observed in the
sepal_width feature. Given the biological nature of the
dataset, these values were retained to preserve natural variation.
Exploratory Data Analysis (EDA) was conducted to analyze feature distributions, relationships, and class separability. Visualization techniques such as histograms, boxplots, correlation matrices, and pair plots were used to understand the dataset structure.
The analysis revealed that petal-based features exhibit stronger class separation than sepal-based features, motivating their importance in subsequent feature engineering and model development stages.
Feature engineering was performed to capture proportional and geometric relationships between floral components. Based on EDA and ANOVA-based class separation analysis, several derived features were evaluated.
The final feature set balances discriminative power and redundancy reduction, retaining both absolute and proportional measurements that contribute significantly to class separation.
The following classification models were implemented using the
caret framework in R:
Before training, all features were standardized using the
preProcess() function from the caret package,
which centers and scales features to zero mean and unit variance. The
fitted scaler was saved as an artifact and reused during inference.
Each model was evaluated using accuracy, precision, and recall. Recall was selected as the primary model selection metric to ensure robustness in class identification.
Logistic Regression and Naïve Bayes achieved perfect performance. Logistic Regression was selected as the final model due to its deterministic behavior and interpretability.
The training pipeline is implemented as a modular and reproducible R script. Each pipeline stage: data validation, feature engineering, train-test splitting, scaling, model training, evaluation, and artifact persistence, is executed sequentially.
Reproducibility is ensured through fixed random seeds and consistent reuse of saved preprocessing and model artifacts.
Docker was used to containerize the R-based machine learning workflow. The container installs all required R packages and launches the Plumber API, ensuring portability and environment consistency across systems.
To enable real-time inference, the trained model is deployed as a
RESTful web service using the Plumber framework in R.
Plumber allows R functions to be exposed as HTTP endpoints with
automatic request parsing and interactive Swagger documentation.
The API exposes endpoints for:
Input validation and structured error handling are implemented to ensure robustness and reliability.
A sample request using the /predict endpoint with the
following query parameters:
The response body shows that the model successfully classified the
input sample as setosa, with the key
"success": true indicating a valid inference.
This confirms that the model serving layer is functioning correctly and responding with accurate predictions.
The API is designed to run locally or inside a Docker container. While the R implementation was not deployed as a public cloud service, it is fully containerized and can be deployed on any compatible platform if required.
This project successfully demonstrates a complete end-to-end MLOps workflow implemented in R. The pipeline covers data ingestion, preprocessing, feature engineering, model training, evaluation, deployment, and validation, while maintaining reproducibility and modularity.
The report illustrates how Python-based MLOps concepts can be systematically adapted to the R ecosystem without compromising architectural rigor or depth.
{width=60%}
The response body shows that the model
successfully classified the input sample as *setosa*, with the key
`"success": true` indicating a valid inference.
This confirms that the model serving layer is functioning correctly and
responding with accurate predictions.
## Deployment and Accessibility
The API is designed to run locally or inside a Docker container. While the R
implementation was not deployed as a public cloud service, it is fully
containerized and can be deployed on any compatible platform if required.
# Summary
This project successfully demonstrates a complete end-to-end MLOps workflow
implemented in R. The pipeline covers data ingestion, preprocessing, feature
engineering, model training, evaluation, deployment, and validation, while
maintaining reproducibility and modularity.
The report illustrates how Python-based MLOps concepts can be systematically
adapted to the R ecosystem without compromising architectural rigor or depth.
# Appendix
## Data Validation and Feature Engineering (R)
```r
library(dplyr)
validate_original_schema <- function(df) {
required_cols <- c(
"sepal_length",
"sepal_width",
"petal_length",
"petal_width",
"class"
)
if (!all(required_cols %in% colnames(df))) {
stop("Missing required columns")
}
if (any(df$sepal_length <= 0) ||
any(df$sepal_width <= 0) ||
any(df$petal_length <= 0) ||
any(df$petal_width <= 0)) {
stop("All numeric values must be > 0")
}
allowed_classes <- c("setosa", "versicolor", "virginica")
if (!all(df$class %in% allowed_classes)) {
stop("Invalid class labels")
}
}
create_features <- function(df) {
df_features <- df %>%
mutate(
petal_to_sepal_length = petal_length / sepal_length,
petal_to_sepal_width = petal_width / sepal_width,
sepal_ratio = sepal_length / sepal_width
) %>%
select(
petal_to_sepal_length,
petal_to_sepal_width,
petal_length,
petal_width,
sepal_ratio,
class
)
return(df_features)
}
validate_feature_schema <- function(df) {
required_cols <- c(
"petal_to_sepal_length",
"petal_to_sepal_width",
"petal_length",
"petal_width",
"sepal_ratio",
"class"
)
if (!all(required_cols %in% colnames(df))) {
stop("Feature schema mismatch")
}
}
library(caret)
library(tidyverse)
library(nnet)
library(randomForest)
library(rpart)
library(e1071)
# ---------------------------
# Load data
# ---------------------------
df <- read.csv("data/iris-processed.csv")
df$class <- as.factor(df$class)
# ---------------------------
# Feature Engineering (ONE SOURCE OF TRUTH)
# ---------------------------
feature_engineering <- function(df) {
data.frame(
petal_to_sepal_length = df$petal_length / df$sepal_length,
petal_to_sepal_width = df$petal_width / df$sepal_width,
petal_length = df$petal_length,
petal_width = df$petal_width,
sepal_ratio = df$sepal_length / df$sepal_width,
class = df$class
)
}
df_fe <- feature_engineering(df)
# ---------------------------
# Train / Test split
# ---------------------------
set.seed(66)
idx <- createDataPartition(df_fe$class, p = 0.8, list = FALSE)
train_df <- df_fe[idx, ]
test_df <- df_fe[-idx, ]
# ---------------------------
# Scaling (ONLY engineered features)
# ---------------------------
preproc <- preProcess(train_df[, -6], method = c("center", "scale"))
x_train <- predict(preproc, train_df[, -6])
x_test <- predict(preproc, test_df[, -6])
y_train <- train_df$class
y_test <- test_df$class
# ---------------------------
# Train control
# ---------------------------
ctrl <- trainControl(
method = "cv",
number = 5,
classProbs = TRUE
)
# ---------------------------
# Models to compare
# ---------------------------
models <- list(
LogisticRegression = "multinom",
RandomForest = "rf",
DecisionTree = "rpart",
NaiveBayes = "nb"
)
results <- list()
# ---------------------------
# Train & evaluate
# ---------------------------
for (model_name in names(models)) {
cat("\n==============================\n")
cat("Training:", model_name, "\n")
cat("==============================\n")
model <- train(
x = x_train,
y = y_train,
method = models[[model_name]],
trControl = ctrl
)
preds <- predict(model, x_test)
preds <- factor(preds, levels = levels(y_test))
cm <- confusionMatrix(preds, y_test)
metrics <- list(
accuracy = as.numeric(cm$overall["Accuracy"]),
recall = mean(cm$byClass[, "Recall"]),
precision = mean(cm$byClass[, "Precision"])
)
print(metrics)
results[[model_name]] <- list(
model = model,
metrics = metrics
)
}
# ---------------------------
# Select best model (Recall)
# ---------------------------
best_model_name <- names(results)[which.max(
sapply(results, function(x) x$metrics$recall)
)]
best_model <- results[[best_model_name]]$model
# ---------------------------
# Save artifacts (IMPORTANT)
# ---------------------------
dir.create("artifacts", showWarnings = FALSE)
saveRDS(best_model, "artifacts/model.rds")
saveRDS(preproc, "artifacts/scaler.rds")
saveRDS(colnames(x_train), "artifacts/feature_names.rds")
cat("\n=====================================\n")
cat("Best Model Selected:", best_model_name, "\n")
cat("Artifacts saved in /artifacts\n")
cat("=====================================\n")
library(plumber)
library(here)
# ---------------------------
# Load artifacts
# ---------------------------
model <- readRDS(here("artifacts", "model.rds"))
scaler <- readRDS(here("artifacts", "scaler.rds"))
feature_names <- readRDS(here("artifacts", "feature_names.rds"))
#* @apiTitle Iris Classification API (R + Plumber)
#* Health check
#* @get /health
function() {
list(status = "ok")
}
#* Predict a single sample
#* @post /predict
function(sepal_length,
sepal_width,
petal_length,
petal_width) {
tryCatch({
# Convert inputs
sepal_length <- as.numeric(sepal_length)
sepal_width <- as.numeric(sepal_width)
petal_length <- as.numeric(petal_length)
petal_width <- as.numeric(petal_width)
if (any(is.na(c(sepal_length, sepal_width, petal_length, petal_width))))
stop("All inputs must be numeric")
if (sepal_length <= 0 || sepal_width <= 0)
stop("sepal_length and sepal_width must be > 0")
# Feature engineering (IDENTICAL to training)
input_df <- data.frame(
petal_to_sepal_length = petal_length / sepal_length,
petal_to_sepal_width = petal_width / sepal_width,
petal_length = petal_length,
petal_width = petal_width,
sepal_ratio = sepal_length / sepal_width
)
# Enforce schema
input_df <- input_df[, feature_names, drop = FALSE]
# Scale + predict
input_scaled <- predict(scaler, input_df)
pred <- predict(model, input_scaled)
list(
success = TRUE,
prediction = as.character(pred)
)
}, error = function(e) {
list(
success = FALSE,
error = e$message
)
})
}
#* Batch prediction using CSV
#* @post /predict-csv
function(file) {
tryCatch({
df <- read.csv(file$datapath)
required_cols <- c(
"sepal_length",
"sepal_width",
"petal_length",
"petal_width"
)
missing <- setdiff(required_cols, colnames(df))
if (length(missing) > 0)
stop(paste("Missing columns:", paste(missing, collapse = ", ")))
features <- data.frame(
petal_to_sepal_length = df$petal_length / df$sepal_length,
petal_to_sepal_width = df$petal_width / df$sepal_width,
petal_length = df$petal_length,
petal_width = df$petal_width,
sepal_ratio = df$sepal_length / df$sepal_width
)
features <- features[, feature_names, drop = FALSE]
features_scaled <- predict(scaler, features)
df$prediction <- predict(model, features_scaled)
df
}, error = function(e) {
list(success = FALSE, error = e$message)
})
}
FROM rocker/r-ver:4.3.2
WORKDIR /app
RUN R -e "install.packages(c('caret','plumber','nnet','randomForest','e1071','mlflow','tidyverse'))"
COPY . .
EXPOSE 7860
CMD ["R", "-e", "pr <- plumber::plumb('api/plumber.R'); pr$run(host='0.0.0.0', port=7860)"]