SECTION 1 — INTRODUCTION

Introduction

In this project, I developed an end-to-end MLOps pipeline in R to solve a real-world machine learning problem. The aim was not only to build accurate models but also to follow a structured workflow similar to industry practices, including data ingestion, validation, preprocessing, model training, evaluation, experiment tracking, and CI/CD automation.

The dataset used in this project is the Adult Income dataset from the UCI Machine Learning Repository. It contains demographic and employment-related attributes such as age, education, occupation, working hours, and marital status.

Machine Learning Task

The objective is a Binary Classification task where the goal is to predict whether a person earns more than 50K annually.

Classes: - Low Income (≤ 50K) - High Income (> 50K)


SECTION 2 — MLOps WORKFLOW MAPPING

This project follows a structured Machine Learning Operations (MLOps) lifecycle.

MLOps Stage Implementation in This Project
Problem Definition Income classification problem defined
Data Ingestion Dataset automatically loaded using script
Data Validation Schema and structure checks performed
Data Preprocessing Encoding, cleaning, and splitting
Model Training Logistic Regression, Random Forest, GBM
Model Evaluation Accuracy measured on test dataset
Experiment Tracking Metrics saved in JSON format
Reproducibility Fixed random seed and script-based pipeline
Deployment Readiness Handling unseen categorical levels
Monitoring (Concept) Model can be re-evaluated on new data

SECTION 3 — DATA INGESTION & UNDERSTANDING

library(readr)
data <- read_csv("data/raw/adult.csv", show_col_types = FALSE)
head(data)
## # A tibble: 6 × 15
##     age workclass       fnlwgt education education_num marital_status occupation
##   <dbl> <chr>            <dbl> <chr>             <dbl> <chr>          <chr>     
## 1    39 State-gov        77516 Bachelors            13 Never-married  Adm-cleri…
## 2    50 Self-emp-not-i…  83311 Bachelors            13 Married-civ-s… Exec-mana…
## 3    38 Private         215646 HS-grad               9 Divorced       Handlers-…
## 4    53 Private         234721 11th                  7 Married-civ-s… Handlers-…
## 5    28 Private         338409 Bachelors            13 Married-civ-s… Prof-spec…
## 6    37 Private         284582 Masters              14 Married-civ-s… Exec-mana…
## # ℹ 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## #   capital_gain <dbl>, capital_loss <dbl>, hours_per_week <dbl>,
## #   native_country <chr>, income <chr>

Interpretation
The dataset is automatically ingested and loaded as part of the pipeline.

str(data)
## spc_tbl_ [32,561 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ age           : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
##  $ workclass     : chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
##  $ fnlwgt        : num [1:32561] 77516 83311 215646 234721 338409 ...
##  $ education     : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
##  $ education_num : num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
##  $ marital_status: chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
##  $ occupation    : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
##  $ relationship  : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
##  $ race          : chr [1:32561] "White" "White" "White" "Black" ...
##  $ sex           : chr [1:32561] "Male" "Male" "Male" "Male" ...
##  $ capital_gain  : num [1:32561] 2174 0 0 0 0 ...
##  $ capital_loss  : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
##  $ hours_per_week: num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
##  $ native_country: chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
##  $ income        : chr [1:32561] "<=50K" "<=50K" "<=50K" "<=50K" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   age = col_double(),
##   ..   workclass = col_character(),
##   ..   fnlwgt = col_double(),
##   ..   education = col_character(),
##   ..   education_num = col_double(),
##   ..   marital_status = col_character(),
##   ..   occupation = col_character(),
##   ..   relationship = col_character(),
##   ..   race = col_character(),
##   ..   sex = col_character(),
##   ..   capital_gain = col_double(),
##   ..   capital_loss = col_double(),
##   ..   hours_per_week = col_double(),
##   ..   native_country = col_character(),
##   ..   income = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
summary(data)
##       age         workclass             fnlwgt         education        
##  Min.   :17.00   Length:32561       Min.   :  12285   Length:32561      
##  1st Qu.:28.00   Class :character   1st Qu.: 117827   Class :character  
##  Median :37.00   Mode  :character   Median : 178356   Mode  :character  
##  Mean   :38.58                      Mean   : 189778                     
##  3rd Qu.:48.00                      3rd Qu.: 237051                     
##  Max.   :90.00                      Max.   :1484705                     
##  education_num   marital_status      occupation        relationship      
##  Min.   : 1.00   Length:32561       Length:32561       Length:32561      
##  1st Qu.: 9.00   Class :character   Class :character   Class :character  
##  Median :10.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :10.08                                                           
##  3rd Qu.:12.00                                                           
##  Max.   :16.00                                                           
##      race               sex             capital_gain    capital_loss   
##  Length:32561       Length:32561       Min.   :    0   Min.   :   0.0  
##  Class :character   Class :character   1st Qu.:    0   1st Qu.:   0.0  
##  Mode  :character   Mode  :character   Median :    0   Median :   0.0  
##                                        Mean   : 1078   Mean   :  87.3  
##                                        3rd Qu.:    0   3rd Qu.:   0.0  
##                                        Max.   :99999   Max.   :4356.0  
##  hours_per_week  native_country        income         
##  Min.   : 1.00   Length:32561       Length:32561      
##  1st Qu.:40.00   Class :character   Class :character  
##  Median :40.00   Mode  :character   Mode  :character  
##  Mean   :40.44                                        
##  3rd Qu.:45.00                                        
##  Max.   :99.00

Interpretation
The dataset contains both numerical and categorical features with income as the target variable.


SECTION 4 — DATA PREPROCESSING

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
data <- data %>% mutate(across(where(is.character), as.factor))

data$income <- trimws(data$income)
data$income <- ifelse(data$income == ">50K", "high", "low")
data$income <- factor(data$income, levels = c("low", "high"))
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(42)
train_index <- createDataPartition(data$income, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data  <- data[-train_index, ]
dim(train_data)
## [1] 26049    15
dim(test_data)
## [1] 6512   15

Interpretation
Categorical encoding and train-test split ensure data is ready for ML and reproducible.


SECTION 5 — MODEL TRAINING

library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## The following object is masked from 'package:dplyr':
## 
##     combine
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
logistic_model <- glm(income ~ ., data = train_data, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
rf_model <- randomForest(income ~ ., data = train_data, ntree = 50)

train_data$income_num <- ifelse(train_data$income == "high", 1, 0)
gbm_model <- gbm(income_num ~ . - income, data = train_data,
                 distribution = "bernoulli", n.trees = 50,
                 interaction.depth = 3, shrinkage = 0.1, verbose = FALSE)

SECTION 6 — MODEL EVALUATION & RESULTS

for(col in names(train_data)) {
  if(is.factor(train_data[[col]])) {
    test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
  }
}

test_data <- na.omit(test_data)
test_data$income <- factor(test_data$income, levels = c("low", "high"))
pred_log <- ifelse(predict(logistic_model, test_data, type = "response") > 0.5, "high", "low")
acc_log <- mean(pred_log == test_data$income)

pred_rf <- predict(rf_model, test_data)
acc_rf <- mean(pred_rf == test_data$income)

test_data$income_num <- ifelse(test_data$income == "high", 1, 0)
pred_gbm <- ifelse(predict(gbm_model, test_data, n.trees = 50, type = "response") > 0.5, 1, 0)
acc_gbm <- mean(pred_gbm == test_data$income_num)

acc_log
## [1] 0.846898
acc_rf
## [1] 0.8559582
acc_gbm
## [1] 0.854269

SECTION 7 — EXPERIMENT TRACKING

library(jsonlite)
metrics <- list(logistic_accuracy = acc_log,
                rf_accuracy = acc_rf,
                gbm_accuracy = acc_gbm)
write_json(metrics, "metrics/metrics.json", pretty = TRUE)
metrics
## $logistic_accuracy
## [1] 0.846898
## 
## $rf_accuracy
## [1] 0.8559582
## 
## $gbm_accuracy
## [1] 0.854269

SECTION 8 — CI/CD PIPELINE IMPLEMENTATION

This project implements Continuous Integration and Continuous Deployment (CI/CD) for the machine learning pipeline using GitHub and GitHub Actions. Whenever code is pushed to the repository, the entire ML workflow runs automatically without manual intervention.

Workflow Automation

A workflow configuration file (ml_pipeline.yml) is created inside:

.github/workflows/

This file defines the automated ML pipeline steps executed in a cloud environment.

Pipeline Stages Executed Automatically

Step Purpose
Checkout Code Downloads the latest version of project files
Setup R Environment Installs R on the runner machine
Install Dependencies Installs required R packages
Data Ingestion Loads dataset automatically
Data Validation Checks schema and data integrity
Preprocessing Cleans and prepares features
Model Training Trains Logistic Regression, Random Forest, and GBM models
Evaluation Computes accuracy metrics

Continuous Integration (CI)

CI is achieved by automatically testing the ML pipeline whenever new code is pushed. If data validation fails or scripts produce an error, the workflow stops immediately. This prevents incorrect models from being built.

Continuous Deployment (CD)

Although the project does not deploy a web application, CD is demonstrated by automatically generating:

  • Trained model files
  • Performance metrics
  • Updated experiment results

This simulates deployment readiness, where models can be moved to production after successful validation.

Benefits of This CI/CD Setup

  • Ensures pipeline reproducibility
  • Detects errors early
  • Automates model building
  • Maintains consistent results across environments
  • Reduces manual effort

Interpretation

This integration shows that the project follows real-world MLOps practices by automating the machine learning workflow using GitHub Actions, ensuring reliability, repeatability, and continuous testing.


SECTION 9 — CONCLUSION

This project demonstrates a complete MLOps pipeline in R, including automation, validation, reproducibility, experiment tracking, and CI/CD integration. Random Forest achieved the highest accuracy, while Logistic Regression and GBM also performed strongly. The structured pipeline ensures the workflow is reliable and aligned with real-world ML systems.