M.Sc. Data Science MLOps Project Documentation

Title: ML Project Setup & Version Control


End-to-End Machine Learning Pipeline using Iris Dataset (R Implementation)


Submitted by

N. A. Adarsh Pritam (011)
Shruti Bhimjibhai Thakkar (016)
Amrutha S Prasad (006)

Course: Machine Learning Operations
Institution: (Alliance University, Bengaluru)
Submission Date: January 2026

Overview

This project implements a complete end-to-end machine learning workflow using the UCI Iris dataset, adapted to the R ecosystem. Rather than focusing solely on model accuracy, the pipeline emphasizes structured data ingestion, feature engineering, reproducible experimentation, model comparison, deployment, and validation, following Machine Learning Operations (MLOps) best practices.

The project demonstrates how traditional Python-based MLOps concepts can be faithfully translated into R using equivalent industry-standard tools.

Data Development

Dataset Description

The dataset used in this project is the Iris dataset sourced from the UCI Machine Learning Repository. It consists of 150 samples, each representing a flower from one of three species:

  • Iris-setosa
  • Iris-versicolor
  • Iris-virginica

The dataset was loaded into R as a data frame and processed using base R and tidyverse utilities.

Schema

After ingestion, all feature names were standardized by converting them to lowercase and replacing spaces with underscores. This ensures compatibility with programmatic feature selection and downstream pipelines.

Attribute Data Type Non-Null Count Description
sepal_length Numeric 150 Sepal length (cm)
sepal_width Numeric 150 Sepal width (cm)
petal_length Numeric 150 Petal length (cm)
petal_width Numeric 150 Petal width (cm)
class Factor 150 Species label

Schema validation confirmed that all attributes contain valid entries and no missing values.

Data Cleaning and Validation

Data cleaning and validation steps were performed to ensure data consistency and reliability. These steps included:

  • Validation of numerical ranges
  • Detection and removal of duplicate records
  • Verification of data types

Three duplicate records were identified and removed, reducing the dataset size from 150 to 147 samples.

Statistical Summary

Descriptive statistics were computed for all numerical features after data cleaning. The statistics confirmed stable distributions and biologically reasonable ranges for all attributes.

Outlier Detection

Outliers were detected using the Interquartile Range (IQR) method. A small number of potential outliers were observed in the sepal_width feature. Given the biological nature of the dataset, these values were retained to preserve natural variation.

Exploratory Data Analysis

Exploratory Data Analysis (EDA) was conducted to analyze feature distributions, relationships, and class separability. Visualization techniques such as histograms, boxplots, correlation matrices, and pair plots were used to understand the dataset structure.

The analysis revealed that petal-based features exhibit stronger class separation than sepal-based features, motivating their importance in subsequent feature engineering and model development stages.

Feature Engineering and Selection

Feature engineering was performed to capture proportional and geometric relationships between floral components. Based on EDA and ANOVA-based class separation analysis, several derived features were evaluated.

Engineered Features

  • Petal-to-sepal length ratio
  • Petal-to-sepal width ratio
  • Petal length
  • Petal width
  • Sepal ratio

Final Feature Set

The final feature set balances discriminative power and redundancy reduction, retaining both absolute and proportional measurements that contribute significantly to class separation.

Model Development

Candidate Models

The following classification models were implemented using the caret framework in R:

  1. Logistic Regression
  2. Random Forest
  3. Decision Tree
  4. Gaussian Naïve Bayes

Feature Scaling

Before training, all features were standardized using the preProcess() function from the caret package, which centers and scales features to zero mean and unit variance. The fitted scaler was saved as an artifact and reused during inference.

Evaluation Metrics

Each model was evaluated using accuracy, precision, and recall. Recall was selected as the primary model selection metric to ensure robustness in class identification.

Model Selection

Logistic Regression and Naïve Bayes achieved perfect performance. Logistic Regression was selected as the final model due to its deterministic behavior and interpretability.

Training Pipeline

The training pipeline is implemented as a modular and reproducible R script. Each pipeline stage: data validation, feature engineering, train-test splitting, scaling, model training, evaluation, and artifact persistence, is executed sequentially.

Reproducibility is ensured through fixed random seeds and consistent reuse of saved preprocessing and model artifacts.

Containerization

Docker was used to containerize the R-based machine learning workflow. The container installs all required R packages and launches the Plumber API, ensuring portability and environment consistency across systems.

Model Serving

API-Based Model Serving Using Plumber

To enable real-time inference, the trained model is deployed as a RESTful web service using the Plumber framework in R. Plumber allows R functions to be exposed as HTTP endpoints with automatic request parsing and interactive Swagger documentation.

The API exposes endpoints for:

  • Health checks
  • Single-sample predictions
  • Batch predictions using CSV uploads

Input validation and structured error handling are implemented to ensure robustness and reliability.

API Testing and Output

A sample request using the /predict endpoint with the following query parameters:

The response body shows that the model successfully classified the input sample as setosa, with the key "success": true indicating a valid inference.

This confirms that the model serving layer is functioning correctly and responding with accurate predictions.

Deployment and Accessibility

The API is designed to run locally or inside a Docker container. While the R implementation was not deployed as a public cloud service, it is fully containerized and can be deployed on any compatible platform if required.

Summary

This project successfully demonstrates a complete end-to-end MLOps workflow implemented in R. The pipeline covers data ingestion, preprocessing, feature engineering, model training, evaluation, deployment, and validation, while maintaining reproducibility and modularity.

The report illustrates how Python-based MLOps concepts can be systematically adapted to the R ecosystem without compromising architectural rigor or depth.

Appendix

Data Validation and Feature Engineering (R)


![](figures/2.jpeg){width=60%}

The response body shows that the model
successfully classified the input sample as *setosa*, with the key
`"success": true` indicating a valid inference.

This confirms that the model serving layer is functioning correctly and
responding with accurate predictions.

## Deployment and Accessibility

The API is designed to run locally or inside a Docker container. While the R
implementation was not deployed as a public cloud service, it is fully
containerized and can be deployed on any compatible platform if required.

# Summary

This project successfully demonstrates a complete end-to-end MLOps workflow
implemented in R. The pipeline covers data ingestion, preprocessing, feature
engineering, model training, evaluation, deployment, and validation, while
maintaining reproducibility and modularity.

The report illustrates how Python-based MLOps concepts can be systematically
adapted to the R ecosystem without compromising architectural rigor or depth.

# Appendix

## Data Validation and Feature Engineering (R)

```r
library(dplyr)

validate_original_schema <- function(df) {
  
  required_cols <- c(
    "sepal_length",
    "sepal_width",
    "petal_length",
    "petal_width",
    "class"
  )
  
  if (!all(required_cols %in% colnames(df))) {
    stop("Missing required columns")
  }
  
  if (any(df$sepal_length <= 0) ||
      any(df$sepal_width <= 0) ||
      any(df$petal_length <= 0) ||
      any(df$petal_width <= 0)) {
    stop("All numeric values must be > 0")
  }
  
  allowed_classes <- c("setosa", "versicolor", "virginica")
  
  if (!all(df$class %in% allowed_classes)) {
    stop("Invalid class labels")
  }
}

create_features <- function(df) {
  
  df_features <- df %>%
    mutate(
      petal_to_sepal_length = petal_length / sepal_length,
      petal_to_sepal_width  = petal_width / sepal_width,
      sepal_ratio           = sepal_length / sepal_width
    ) %>%
    select(
      petal_to_sepal_length,
      petal_to_sepal_width,
      petal_length,
      petal_width,
      sepal_ratio,
      class
    )
  
  return(df_features)
}

validate_feature_schema <- function(df) {
  
  required_cols <- c(
    "petal_to_sepal_length",
    "petal_to_sepal_width",
    "petal_length",
    "petal_width",
    "sepal_ratio",
    "class"
  )
  
  if (!all(required_cols %in% colnames(df))) {
    stop("Feature schema mismatch")
  }
}

Model Training Pipeline (R)

library(caret)
library(tidyverse)
library(nnet)
library(randomForest)
library(rpart)
library(e1071)

# ---------------------------
# Load data
# ---------------------------
df <- read.csv("data/iris-processed.csv")
df$class <- as.factor(df$class)

# ---------------------------
# Feature Engineering (ONE SOURCE OF TRUTH)
# ---------------------------
feature_engineering <- function(df) {
  data.frame(
    petal_to_sepal_length = df$petal_length / df$sepal_length,
    petal_to_sepal_width  = df$petal_width / df$sepal_width,
    petal_length = df$petal_length,
    petal_width  = df$petal_width,
    sepal_ratio  = df$sepal_length / df$sepal_width,
    class = df$class
  )
}

df_fe <- feature_engineering(df)

# ---------------------------
# Train / Test split
# ---------------------------
set.seed(66)
idx <- createDataPartition(df_fe$class, p = 0.8, list = FALSE)
train_df <- df_fe[idx, ]
test_df  <- df_fe[-idx, ]

# ---------------------------
# Scaling (ONLY engineered features)
# ---------------------------
preproc <- preProcess(train_df[, -6], method = c("center", "scale"))
x_train <- predict(preproc, train_df[, -6])
x_test  <- predict(preproc, test_df[, -6])

y_train <- train_df$class
y_test  <- test_df$class

# ---------------------------
# Train control
# ---------------------------
ctrl <- trainControl(
  method = "cv",
  number = 5,
  classProbs = TRUE
)

# ---------------------------
# Models to compare
# ---------------------------
models <- list(
  LogisticRegression = "multinom",
  RandomForest       = "rf",
  DecisionTree       = "rpart",
  NaiveBayes         = "nb"
)

results <- list()

# ---------------------------
# Train & evaluate
# ---------------------------
for (model_name in names(models)) {
  
  cat("\n==============================\n")
  cat("Training:", model_name, "\n")
  cat("==============================\n")
  
  model <- train(
    x = x_train,
    y = y_train,
    method = models[[model_name]],
    trControl = ctrl
  )
  
  preds <- predict(model, x_test)
  preds <- factor(preds, levels = levels(y_test))
  
  cm <- confusionMatrix(preds, y_test)
  
  metrics <- list(
    accuracy  = as.numeric(cm$overall["Accuracy"]),
    recall    = mean(cm$byClass[, "Recall"]),
    precision = mean(cm$byClass[, "Precision"])
  )
  
  print(metrics)
  
  results[[model_name]] <- list(
    model = model,
    metrics = metrics
  )
}

# ---------------------------
# Select best model (Recall)
# ---------------------------
best_model_name <- names(results)[which.max(
  sapply(results, function(x) x$metrics$recall)
)]

best_model <- results[[best_model_name]]$model

# ---------------------------
# Save artifacts (IMPORTANT)
# ---------------------------
dir.create("artifacts", showWarnings = FALSE)

saveRDS(best_model, "artifacts/model.rds")
saveRDS(preproc, "artifacts/scaler.rds")
saveRDS(colnames(x_train), "artifacts/feature_names.rds")

cat("\n=====================================\n")
cat("Best Model Selected:", best_model_name, "\n")
cat("Artifacts saved in /artifacts\n")
cat("=====================================\n")

API Serving (Plumber)

library(plumber)
library(here)

# ---------------------------
# Load artifacts
# ---------------------------
model         <- readRDS(here("artifacts", "model.rds"))
scaler        <- readRDS(here("artifacts", "scaler.rds"))
feature_names <- readRDS(here("artifacts", "feature_names.rds"))

#* @apiTitle Iris Classification API (R + Plumber)

#* Health check
#* @get /health
function() {
  list(status = "ok")
}

#* Predict a single sample
#* @post /predict
function(sepal_length,
         sepal_width,
         petal_length,
         petal_width) {
  
  tryCatch({
    
    # Convert inputs
    sepal_length <- as.numeric(sepal_length)
    sepal_width  <- as.numeric(sepal_width)
    petal_length <- as.numeric(petal_length)
    petal_width  <- as.numeric(petal_width)
    
    if (any(is.na(c(sepal_length, sepal_width, petal_length, petal_width))))
      stop("All inputs must be numeric")
    
    if (sepal_length <= 0 || sepal_width <= 0)
      stop("sepal_length and sepal_width must be > 0")
    
    # Feature engineering (IDENTICAL to training)
    input_df <- data.frame(
      petal_to_sepal_length = petal_length / sepal_length,
      petal_to_sepal_width  = petal_width / sepal_width,
      petal_length = petal_length,
      petal_width  = petal_width,
      sepal_ratio  = sepal_length / sepal_width
    )
    
    # Enforce schema
    input_df <- input_df[, feature_names, drop = FALSE]
    
    # Scale + predict
    input_scaled <- predict(scaler, input_df)
    pred <- predict(model, input_scaled)
    
    list(
      success = TRUE,
      prediction = as.character(pred)
    )
    
  }, error = function(e) {
    list(
      success = FALSE,
      error = e$message
    )
  })
}

#* Batch prediction using CSV
#* @post /predict-csv
function(file) {
  
  tryCatch({
    
    df <- read.csv(file$datapath)
    
    required_cols <- c(
      "sepal_length",
      "sepal_width",
      "petal_length",
      "petal_width"
    )
    
    missing <- setdiff(required_cols, colnames(df))
    if (length(missing) > 0)
      stop(paste("Missing columns:", paste(missing, collapse = ", ")))
    
    features <- data.frame(
      petal_to_sepal_length = df$petal_length / df$sepal_length,
      petal_to_sepal_width  = df$petal_width / df$sepal_width,
      petal_length = df$petal_length,
      petal_width  = df$petal_width,
      sepal_ratio  = df$sepal_length / df$sepal_width
    )
    
    features <- features[, feature_names, drop = FALSE]
    features_scaled <- predict(scaler, features)
    
    df$prediction <- predict(model, features_scaled)
    df
    
  }, error = function(e) {
    list(success = FALSE, error = e$message)
  })
}

Containerization

FROM rocker/r-ver:4.3.2

WORKDIR /app

RUN R -e "install.packages(c('caret','plumber','nnet','randomForest','e1071','mlflow','tidyverse'))"

COPY . .

EXPOSE 7860

CMD ["R", "-e", "pr <- plumber::plumb('api/plumber.R'); pr$run(host='0.0.0.0', port=7860)"]