In this project, I developed an end-to-end MLOps pipeline in R to solve a real-world machine learning problem. The aim was not only to build accurate models but also to follow a structured workflow similar to industry practices, including data ingestion, validation, preprocessing, model training, evaluation, experiment tracking, and CI/CD automation.
The dataset used in this project is the Adult Income dataset from the UCI Machine Learning Repository. It contains demographic and employment-related attributes such as age, education, occupation, working hours, and marital status.
The objective is a Binary Classification task where the goal is to predict whether a person earns more than 50K annually.
Classes: - Low Income (≤ 50K) - High Income (> 50K)
This project follows a structured Machine Learning Operations (MLOps) lifecycle.
| MLOps Stage | Implementation in This Project |
|---|---|
| Problem Definition | Income classification problem defined |
| Data Ingestion | Dataset automatically loaded using script |
| Data Validation | Schema and structure checks performed |
| Data Preprocessing | Encoding, cleaning, and splitting |
| Model Training | Logistic Regression, Random Forest, GBM |
| Model Evaluation | Accuracy measured on test dataset |
| Experiment Tracking | Metrics saved in JSON format |
| Reproducibility | Fixed random seed and script-based pipeline |
| Deployment Readiness | Handling unseen categorical levels |
| Monitoring (Concept) | Model can be re-evaluated on new data |
library(readr)
data <- read_csv("data/raw/adult.csv", show_col_types = FALSE)
head(data)
## # A tibble: 6 × 15
## age workclass fnlwgt education education_num marital_status occupation
## <dbl> <chr> <dbl> <chr> <dbl> <chr> <chr>
## 1 39 State-gov 77516 Bachelors 13 Never-married Adm-cleri…
## 2 50 Self-emp-not-i… 83311 Bachelors 13 Married-civ-s… Exec-mana…
## 3 38 Private 215646 HS-grad 9 Divorced Handlers-…
## 4 53 Private 234721 11th 7 Married-civ-s… Handlers-…
## 5 28 Private 338409 Bachelors 13 Married-civ-s… Prof-spec…
## 6 37 Private 284582 Masters 14 Married-civ-s… Exec-mana…
## # ℹ 8 more variables: relationship <chr>, race <chr>, sex <chr>,
## # capital_gain <dbl>, capital_loss <dbl>, hours_per_week <dbl>,
## # native_country <chr>, income <chr>
Interpretation
The dataset is automatically ingested and loaded as part of the
pipeline.
str(data)
## spc_tbl_ [32,561 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ age : num [1:32561] 39 50 38 53 28 37 49 52 31 42 ...
## $ workclass : chr [1:32561] "State-gov" "Self-emp-not-inc" "Private" "Private" ...
## $ fnlwgt : num [1:32561] 77516 83311 215646 234721 338409 ...
## $ education : chr [1:32561] "Bachelors" "Bachelors" "HS-grad" "11th" ...
## $ education_num : num [1:32561] 13 13 9 7 13 14 5 9 14 13 ...
## $ marital_status: chr [1:32561] "Never-married" "Married-civ-spouse" "Divorced" "Married-civ-spouse" ...
## $ occupation : chr [1:32561] "Adm-clerical" "Exec-managerial" "Handlers-cleaners" "Handlers-cleaners" ...
## $ relationship : chr [1:32561] "Not-in-family" "Husband" "Not-in-family" "Husband" ...
## $ race : chr [1:32561] "White" "White" "White" "Black" ...
## $ sex : chr [1:32561] "Male" "Male" "Male" "Male" ...
## $ capital_gain : num [1:32561] 2174 0 0 0 0 ...
## $ capital_loss : num [1:32561] 0 0 0 0 0 0 0 0 0 0 ...
## $ hours_per_week: num [1:32561] 40 13 40 40 40 40 16 45 50 40 ...
## $ native_country: chr [1:32561] "United-States" "United-States" "United-States" "United-States" ...
## $ income : chr [1:32561] "<=50K" "<=50K" "<=50K" "<=50K" ...
## - attr(*, "spec")=
## .. cols(
## .. age = col_double(),
## .. workclass = col_character(),
## .. fnlwgt = col_double(),
## .. education = col_character(),
## .. education_num = col_double(),
## .. marital_status = col_character(),
## .. occupation = col_character(),
## .. relationship = col_character(),
## .. race = col_character(),
## .. sex = col_character(),
## .. capital_gain = col_double(),
## .. capital_loss = col_double(),
## .. hours_per_week = col_double(),
## .. native_country = col_character(),
## .. income = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
summary(data)
## age workclass fnlwgt education
## Min. :17.00 Length:32561 Min. : 12285 Length:32561
## 1st Qu.:28.00 Class :character 1st Qu.: 117827 Class :character
## Median :37.00 Mode :character Median : 178356 Mode :character
## Mean :38.58 Mean : 189778
## 3rd Qu.:48.00 3rd Qu.: 237051
## Max. :90.00 Max. :1484705
## education_num marital_status occupation relationship
## Min. : 1.00 Length:32561 Length:32561 Length:32561
## 1st Qu.: 9.00 Class :character Class :character Class :character
## Median :10.00 Mode :character Mode :character Mode :character
## Mean :10.08
## 3rd Qu.:12.00
## Max. :16.00
## race sex capital_gain capital_loss
## Length:32561 Length:32561 Min. : 0 Min. : 0.0
## Class :character Class :character 1st Qu.: 0 1st Qu.: 0.0
## Mode :character Mode :character Median : 0 Median : 0.0
## Mean : 1078 Mean : 87.3
## 3rd Qu.: 0 3rd Qu.: 0.0
## Max. :99999 Max. :4356.0
## hours_per_week native_country income
## Min. : 1.00 Length:32561 Length:32561
## 1st Qu.:40.00 Class :character Class :character
## Median :40.00 Mode :character Mode :character
## Mean :40.44
## 3rd Qu.:45.00
## Max. :99.00
Interpretation
The dataset contains both numerical and categorical features with income
as the target variable.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data <- data %>% mutate(across(where(is.character), as.factor))
data$income <- trimws(data$income)
data$income <- ifelse(data$income == ">50K", "high", "low")
data$income <- factor(data$income, levels = c("low", "high"))
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
set.seed(42)
train_index <- createDataPartition(data$income, p = 0.8, list = FALSE)
train_data <- data[train_index, ]
test_data <- data[-train_index, ]
dim(train_data)
## [1] 26049 15
dim(test_data)
## [1] 6512 15
Interpretation
Categorical encoding and train-test split ensure data is ready for ML
and reproducible.
library(randomForest)
## randomForest 4.7-1.2
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## The following object is masked from 'package:dplyr':
##
## combine
library(gbm)
## Loaded gbm 2.2.2
## This version of gbm is no longer under development. Consider transitioning to gbm3, https://github.com/gbm-developers/gbm3
logistic_model <- glm(income ~ ., data = train_data, family = "binomial")
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
rf_model <- randomForest(income ~ ., data = train_data, ntree = 50)
train_data$income_num <- ifelse(train_data$income == "high", 1, 0)
gbm_model <- gbm(income_num ~ . - income, data = train_data,
distribution = "bernoulli", n.trees = 50,
interaction.depth = 3, shrinkage = 0.1, verbose = FALSE)
for(col in names(train_data)) {
if(is.factor(train_data[[col]])) {
test_data[[col]] <- factor(test_data[[col]], levels = levels(train_data[[col]]))
}
}
test_data <- na.omit(test_data)
test_data$income <- factor(test_data$income, levels = c("low", "high"))
pred_log <- ifelse(predict(logistic_model, test_data, type = "response") > 0.5, "high", "low")
acc_log <- mean(pred_log == test_data$income)
pred_rf <- predict(rf_model, test_data)
acc_rf <- mean(pred_rf == test_data$income)
test_data$income_num <- ifelse(test_data$income == "high", 1, 0)
pred_gbm <- ifelse(predict(gbm_model, test_data, n.trees = 50, type = "response") > 0.5, 1, 0)
acc_gbm <- mean(pred_gbm == test_data$income_num)
acc_log
## [1] 0.846898
acc_rf
## [1] 0.8559582
acc_gbm
## [1] 0.854269
library(jsonlite)
metrics <- list(logistic_accuracy = acc_log,
rf_accuracy = acc_rf,
gbm_accuracy = acc_gbm)
write_json(metrics, "metrics/metrics.json", pretty = TRUE)
metrics
## $logistic_accuracy
## [1] 0.846898
##
## $rf_accuracy
## [1] 0.8559582
##
## $gbm_accuracy
## [1] 0.854269
This project implements Continuous Integration and Continuous Deployment (CI/CD) for the machine learning pipeline using GitHub and GitHub Actions. Whenever code is pushed to the repository, the entire ML workflow runs automatically without manual intervention.
A workflow configuration file (ml_pipeline.yml) is
created inside:
.github/workflows/
This file defines the automated ML pipeline steps executed in a cloud environment.
| Step | Purpose |
|---|---|
| Checkout Code | Downloads the latest version of project files |
| Setup R Environment | Installs R on the runner machine |
| Install Dependencies | Installs required R packages |
| Data Ingestion | Loads dataset automatically |
| Data Validation | Checks schema and data integrity |
| Preprocessing | Cleans and prepares features |
| Model Training | Trains Logistic Regression, Random Forest, and GBM models |
| Evaluation | Computes accuracy metrics |
CI is achieved by automatically testing the ML pipeline whenever new code is pushed. If data validation fails or scripts produce an error, the workflow stops immediately. This prevents incorrect models from being built.
Although the project does not deploy a web application, CD is demonstrated by automatically generating:
This simulates deployment readiness, where models can be moved to production after successful validation.
Interpretation
This integration shows that the project follows real-world MLOps practices by automating the machine learning workflow using GitHub Actions, ensuring reliability, repeatability, and continuous testing.
This project demonstrates a complete MLOps pipeline in R, including automation, validation, reproducibility, experiment tracking, and CI/CD integration. Random Forest achieved the highest accuracy, while Logistic Regression and GBM also performed strongly. The structured pipeline ensures the workflow is reliable and aligned with real-world ML systems.