SECTION 1 — INTRODUCTION

Introduction

In this project, I developed an end-to-end MLOps pipeline in R to solve a real-world machine learning problem. The aim was not only to build accurate models but also to follow a structured workflow similar to industry practices, including data ingestion, validation, preprocessing, model training, evaluation, experiment tracking, and CI/CD automation.

The dataset used in this project is the Adult Income dataset from the UCI Machine Learning Repository. It contains demographic and employment-related attributes such as age, education, occupation, working hours, and marital status.

Machine Learning Task

The objective is a Binary Classification task where the goal is to predict whether a person earns more than 50K annually.

Classes: - Low Income (≤ 50K) - High Income (> 50K)

SECTION 2 — MLOps WORKFLOW MAPPING

This project follows a structured Machine Learning Operations (MLOps) lifecycle.

MLOps Stage	Implementation in This Project
Problem Definition	Income classification problem defined
Data Ingestion	Dataset automatically loaded using script
Data Validation	Schema and structure checks performed
Data Preprocessing	Encoding, cleaning, and splitting
Model Training	Logistic Regression, Random Forest, GBM
Model Evaluation	Accuracy measured on test dataset
Experiment Tracking	Metrics saved in JSON format
Reproducibility	Fixed random seed and script-based pipeline
Deployment Readiness	Handling unseen categorical levels
Monitoring (Concept)	Model can be re-evaluated on new data

SECTION 3 — DATA INGESTION & UNDERSTANDING

library(readr)
data <- read_csv("data/raw/adult.csv", show_col_types = FALSE)
head(data)

## # A tibble: 6 × 15
##     age workclass       fnlwgt education education_num marital_status occupation
##   <dbl> <chr>            <dbl> <chr>             <dbl> <chr>          <chr>     
## 1    39 State-gov        77516 Bachelors            13 Never-married  Adm-cleri…
## 2    50 Self-emp-not-i…  83311 Bachelors            13 Married-civ-s… Exec-mana…
## 3    38 Private         215646 HS-grad               9 Divorced       Handlers-…
## 4    53 Private         234721 11th                  7 Married-civ-s… Handlers-…
## 5    28 Private         338409 Bachelors            13 Married-civ-s… Prof-spec…
## 6    37 Private         284582 Masters              14 Married-civ-s… Exec-mana…
## # ℹ 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## #   capital_gain <dbl>, capital_loss <dbl>, hours_per_week <dbl>,
## #   native_country <chr>, income <chr>

Interpretation
The dataset is automatically ingested and loaded as part of the pipeline.

str(data)

## spc_tbl_ [32,561 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age           : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : num [1:32561] 77516 83311 215646 234721 338409 ...
##  $ education     : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num : num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation    : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship  : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race          : chr [1:32561] "White" "White" "White" "Black" ...
##  $ sex           : chr [1:32561] "Male" "Male" "Male" "Male" ...
##  $ capital_gain  : num [1:32561] 2174 0 0 0 0 ...
##  $ capital_loss  : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr [1:32561] "<=50K" "<=50K" "<=50K" "<=50K" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   workclass = col_character(),
##   ..   fnlwgt = col_double(),
##   ..   education = col_character(),
##   ..   education_num = col_double(),
##   ..   marital_status = col_character(),
##   ..   occupation = col_character(),
##   ..   relationship = col_character(),
##   ..   race = col_character(),
##   ..   sex = col_character(),
##   ..   capital_gain = col_double(),
##   ..   capital_loss = col_double(),
##   ..   hours_per_week = col_double(),
##   ..   native_country = col_character(),
##   ..   income = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

summary(data)

##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education_num   marital_status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital_gain    capital_loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours_per_week  native_country        income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

Interpretation
The dataset contains both numerical and categorical features with income as the target variable.

SECTION 4 — DATA PREPROCESSING

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

data <- data %>% mutate(across(where(is.character), as.factor))

data$income <- trimws(data$income)
data$income <- ifelse(data$income == ">50K", "high", "low")
data$income <- factor(data$income, levels = c("low", "high"))

library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

set.seed(42)
train_index <- createDataPartition(data$income, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data  <- data[-train_index, ]
dim(train_data)

## [1] 26049    15

dim(test_data)

## [1] 6512   15

Interpretation
Categorical encoding and train-test split ensure data is ready for ML and reproducible.

SECTION 5 — MODEL TRAINING

library(randomForest)

## randomForest 4.7-1.2

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:dplyr':
## 
##     combine

library(gbm)

## Loaded gbm 2.2.2

## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3

logistic_model <- glm(income ~ ., data = train_data, family = "binomial")

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

rf_model <- randomForest(income ~ ., data = train_data, ntree = 50)

train_data$income_num <- ifelse(train_data$income == "high", 1, 0)
gbm_model <- gbm(income_num ~ . - income, data = train_data,
                 distribution = "bernoulli", n.trees = 50,
                 interaction.depth = 3, shrinkage = 0.1, verbose = FALSE)

SECTION 6 — MODEL EVALUATION & RESULTS

for(col in names(train_data)) {
  if(is.factor(train_data[[col]])) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
  }
}

test_data <- na.omit(test_data)
test_data$income <- factor(test_data$income, levels = c("low", "high"))

pred_log <- ifelse(predict(logistic_model, test_data, type = "response") > 0.5, "high", "low")
acc_log <- mean(pred_log == test_data$income)

pred_rf <- predict(rf_model, test_data)
acc_rf <- mean(pred_rf == test_data$income)

test_data$income_num <- ifelse(test_data$income == "high", 1, 0)
pred_gbm <- ifelse(predict(gbm_model, test_data, n.trees = 50, type = "response") > 0.5, 1, 0)
acc_gbm <- mean(pred_gbm == test_data$income_num)

acc_log

## [1] 0.846898

acc_rf

## [1] 0.8559582

acc_gbm

## [1] 0.854269

SECTION 7 — EXPERIMENT TRACKING

library(jsonlite)
metrics <- list(logistic_accuracy = acc_log,
                rf_accuracy = acc_rf,
                gbm_accuracy = acc_gbm)
write_json(metrics, "metrics/metrics.json", pretty = TRUE)
metrics

## $logistic_accuracy
## [1] 0.846898
## 
## $rf_accuracy
## [1] 0.8559582
## 
## $gbm_accuracy
## [1] 0.854269

SECTION 8 — CI/CD PIPELINE IMPLEMENTATION

This project implements Continuous Integration and Continuous Deployment (CI/CD) for the machine learning pipeline using GitHub and GitHub Actions. Whenever code is pushed to the repository, the entire ML workflow runs automatically without manual intervention.

Workflow Automation

A workflow configuration file (ml_pipeline.yml) is created inside:

.github/workflows/

This file defines the automated ML pipeline steps executed in a cloud environment.

Pipeline Stages Executed Automatically

Step	Purpose
Checkout Code	Downloads the latest version of project files
Setup R Environment	Installs R on the runner machine
Install Dependencies	Installs required R packages
Data Ingestion	Loads dataset automatically
Data Validation	Checks schema and data integrity
Preprocessing	Cleans and prepares features
Model Training	Trains Logistic Regression, Random Forest, and GBM models
Evaluation	Computes accuracy metrics

Continuous Integration (CI)

CI is achieved by automatically testing the ML pipeline whenever new code is pushed. If data validation fails or scripts produce an error, the workflow stops immediately. This prevents incorrect models from being built.

Continuous Deployment (CD)

Although the project does not deploy a web application, CD is demonstrated by automatically generating:

Trained model files
Performance metrics
Updated experiment results

This simulates deployment readiness, where models can be moved to production after successful validation.

Benefits of This CI/CD Setup

Ensures pipeline reproducibility
Detects errors early
Automates model building
Maintains consistent results across environments
Reduces manual effort

Interpretation

This integration shows that the project follows real-world MLOps practices by automating the machine learning workflow using GitHub Actions, ensuring reliability, repeatability, and continuous testing.

SECTION 9 — CONCLUSION

This project demonstrates a complete MLOps pipeline in R, including automation, validation, reproducibility, experiment tracking, and CI/CD integration. Random Forest achieved the highest accuracy, while Logistic Regression and GBM also performed strongly. The structured pipeline ensures the workflow is reliable and aligned with real-world ML systems.

Adult Income MLOps Project

Mihir

2026-02-06