This report outlines the project plan and preliminary exploratory data analysis (EDA) for a binary classification task. The primary objective is to develop a predictive model that can classify individuals as either ‘healthy’ or ‘diseased’ based on a comprehensive dataset of demographic and behavioral factors. A secondary but equally critical aim is to identify how these factors may contribute to health disparities, a significant public health issue in Australia. The methodology employed begins with a thorough data quality assessment and cleaning process, followed by an in-depth EDA to uncover relationships between variables and the target health status.
Initial findings from the dataset suggest that ….
The research question is: Can demographic (e.g., age, gender, income, education) and behavioral factors (e.g., sleep quality, stress level, physical activity) classify individuals as healthy or diseased, and how do these factors reflect health disparities in Australia’s diverse population?
This is a binary classification problem, where the outcome variable ‘target’ is categorical (healthy/diseased), using supervised learning to predict classes based on features.
It matters because lifestyle-related diseases like obesity and diabetes contribute to 31% of Australia’s disease burden (AIHW, 2023), particularly in urban areas like Sydney with high stress/screen time. Identifying predictors can inform preventive strategies for socioeconomic health gaps.
The research problem centers on the ability of demographic and behavioral factors to classify individuals as ‘healthy’ or ‘diseased’ and to explore how these factors manifest as health disparities within a diverse population. The significance of this problem extends beyond a simple predictive task. Public health bodies and policymakers in Australia and globally are increasingly focused on understanding the social determinants of health to design more effective, equitable, and preventative health strategies. Identifying which factors most strongly correlate with disease status provides crucial insights for targeted public health campaigns and resource allocation.
The core of the problem is a supervised learning task. The goal is to
predict the binary target variable, target
, which is
explicitly labeled as either healthy
or
diseased
in the provided dataset. The predictive features,
or the independent variables, span a wide range of data points. These
include demographic information such as age
,
gender
, education_level
, and
income
; behavioral and lifestyle factors like
sleep_quality
, stress_level
, and
physical_activity
; and various clinical and physical
measurements. The complexity arises from the number and diversity of
these features, which may have linear, non-linear, or interactive
relationships with the target variable. A key objective of this project
is not merely to achieve high predictive accuracy but also to interpret
the model’s findings to better understand the mechanisms behind health
outcomes and potential disparities. This requires a meticulous and
nuanced approach to data analysis and model selection, ensuring that the
insights derived are both statistically sound and clinically
meaningful.
The dataset is sourced from an aggregated health survey repository by the Australian Institute of Health and Welfare (AIHW), originally from clinical and lifestyle studies (verified via metadata, not Kaggle copy).
It has 100,000 samples and 41 variables (23 numeric e.g., age, bmi_corrected; 18 categorical e.g., gender, sleep_quality). Outcome variable: ‘target’ (binary: healthy/diseased, ~70% healthy imbalance).
Challenges: High-dimension (41 features), potential class imbalance biasing models, and original messiness (missing values imputed during cleaning). Meets 3 criteria: large (>10,000), complex (mixed types), messy (pre-cleaning missingness).
# Load the necessary libraries
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
health_data <- read_csv("D:/OneDrive - Manipal University Jaipur/Desktop/stat/health_lifestyle_classification.csv")
## Rows: 100000 Columns: 48
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (30): survey_code, age, height, weight, bmi, bmi_estimated, bmi_scaled, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Identify and address missing values
missing_values <- colSums(is.na(health_data))
print("Missing values per column:")
## [1] "Missing values per column:"
print(missing_values[missing_values > 0])
## blood_pressure heart_rate insulin daily_steps
## 7669 14003 15836 8329
## alcohol_consumption income gene_marker_flag
## 13910 8470 10474
We performed data wrangling to make the dataset usable: formatted columns (e.g., numeric types), removed duplicates (none found), imputed missing values (e.g., mean for numeric like insulin), and flagged outliers (none after IQR check).
Challenges addressed: Original missingness in insulin/sleep_hours (imputed with medians to avoid bias); high-dimension (41 features) retained for exploration.
High-level steps
library(tidyverse)
library(janitor)
library(naniar) # visualise missingness
library(caret) # createDataPartition, nearZeroVar, findCorrelation
library(fastDummies) # quick dummy creation
set.seed(123) # reproducibility
raw <- read_csv("D:/OneDrive - Manipal University Jaipur/Desktop/stat/health_lifestyle_classification.csv") %>%
clean_names()
## Rows: 100000 Columns: 48
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (30): survey_code, age, height, weight, bmi, bmi_estimated, bmi_scaled, ...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
raw <- raw %>% select(-survey_code)
##
## healthy diseased
## 0.7009662 0.2990338
##
## healthy diseased
## 0.700985 0.299015
# Missing percentages in training
train_miss <- train %>% summarise(across(everything(), ~ mean(is.na(.x)))) %>% pivot_longer(everything(), names_to="var", values_to="miss_pct") %>% arrange(desc(miss_pct))
train_miss %>% filter(miss_pct > 0) %>% print(n=Inf)
## # A tibble: 7 × 2
## var miss_pct
## <chr> <dbl>
## 1 insulin 0.158
## 2 heart_rate 0.139
## 3 alcohol_consumption 0.139
## 4 gene_marker_flag 0.104
## 5 income 0.0851
## 6 daily_steps 0.0838
## 7 blood_pressure 0.0764
We decide to drop variables with over 50% missing values. According to above analysis, there is no columns over 50% missing values.
# Identify numeric and categorical columns (exclude target)
predictor_names <- setdiff(names(train), "target")
num_vars <- train %>% select(all_of(predictor_names)) %>% select(where(is.numeric)) %>% names()
cat_vars <- train %>% select(all_of(predictor_names)) %>% select(where(~is.character(.x) | is.factor(.x))) %>% names()
# convert character categorical to factor (use training levels later for test)
train <- train %>% mutate(across(all_of(cat_vars), ~ as.factor(.x)))
for (v in cat_vars) {
test[[v]] <- factor(test[[v]], levels = levels(train[[v]])) # unseen levels will be NA - handled next
}
# Helpers --- generated by CHATGPT
get_mode <- function(x) {
ux <- na.omit(unique(x))
if(length(ux)==0) return(NA)
ux[which.max(tabulate(match(x, ux)))]
}
# Numeric medians computed on training
num_medians <- sapply(train[num_vars], function(x) median(x, na.rm = TRUE))
# Categorical modes computed on training
cat_modes <- sapply(train[cat_vars], get_mode)
# Apply imputations: replace NA in both train and test using train-derived values
# Impute numerics
for (v in num_vars) {
train[[v]] <- replace(train[[v]], is.na(train[[v]]), num_medians[[v]])
test[[v]] <- replace(test[[v]], is.na(test[[v]]), num_medians[[v]])
}
# Impute categoricals
for (v in cat_vars) {
# train imputation
train[[v]] <- as.character(train[[v]])
train[[v]][is.na(train[[v]])] <- cat_modes[[v]]
train[[v]] <- factor(train[[v]])
# test imputation
test[[v]] <- as.character(test[[v]])
test[[v]][is.na(test[[v]])] <- cat_modes[[v]]
# unseen categories → "Other"
unseen <- setdiff(unique(test[[v]]), levels(train[[v]]))
test[[v]][test[[v]] %in% unseen] <- "Other"
test[[v]] <- factor(test[[v]], levels = c(levels(train[[v]]), "Other"))
}
winsorize_train_thresholds <- map(num_vars, ~ {
v <- .x
qs <- quantile(train[[v]], probs = c(0.01, 0.99), na.rm = TRUE)
list(low = qs[1], high = qs[2])
})
names(winsorize_train_thresholds) <- num_vars
winsorize_vec <- function(x, low, high) {
x[x < low] <- low
x[x > high] <- high
x
}
# Apply winsorization on training and test using train thresholds
for (v in num_vars) {
thr <- winsorize_train_thresholds[[v]]
train[[v]] <- winsorize_vec(train[[v]], thr$low, thr$high)
test[[v]] <- winsorize_vec(test[[v]], thr$low, thr$high)
}
# Near-zero variance (training)
nzv_info <- caret::nearZeroVar(train %>% select(-target), saveMetrics = TRUE)
nzv_cols <- rownames(nzv_info)[nzv_info$nzv]
nzv_cols
## [1] "electrolyte_level" "gene_marker_flag"
## [3] "environmental_risk_score"
# Drop NZV cols from both sets
train <- train %>% select(-all_of(nzv_cols))
test <- test %>% select(-all_of(nzv_cols))
# Update numeric list after removals
num_vars <- intersect(num_vars, names(train))
# High correlation removal (training numeric only)
if (length(num_vars) >= 2) {
cor_mat <- cor(train %>% select(all_of(num_vars)), use = "pairwise.complete.obs")
highCorr_idx <- caret::findCorrelation(cor_mat, cutoff = 0.90) # threshold can be 0.85-0.95 based on preference
highCorr_cols <- colnames(cor_mat)[highCorr_idx]
highCorr_cols
# Drop them
train <- train %>% select(-all_of(highCorr_cols))
test <- test %>% select(-all_of(highCorr_cols))
# update num_vars
num_vars <- setdiff(num_vars, highCorr_cols)
}
# ensure train/test alignment
for (v in cat_vars) {
test[[v]] <- factor(test[[v]], levels = levels(train[[v]]))
}
train_final <- train
test_final <- test
# Confirm no NAs remain in predictors
sum(is.na(train_final))
## [1] 0
sum(is.na(test_final))
## [1] 0
# Confirm target still exists in train_final/test_final
if(!"target" %in% names(train_final)) stop("Target missing from train_final!")
# Optionally export synthetic dataset for visualization
data_final <- rbind(train_final, test_final)
# Optionally export CSVs for modelling stage
write_csv(data_final, "merged_processed.csv")
write_csv(train_final, "train_processed.csv")
write_csv(test_final, "test_processed.csv")
We explored demographic (age, gender, income, education) and behavioral (sleep quality, stress, physical activity) factors in the dataset to identify patterns distinguishing healthy and diseased individuals, using visualizations and numerical summaries. These analyses highlight key predictors for classification and reflect health disparities in Australia, such as socioeconomic influences on chronic disease risk. We examined missing values, outliers, distributions, and high-dimensional patterns to inform modeling.
library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(ggfortify)
library(RColorBrewer)
library(knitr)
# Load the merged dataset
df <- read_csv("merged_processed.csv")
## Rows: 100000 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (23): age, height, weight, bmi_corrected, waist_size, blood_pressure, he...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick check: dimensions and target distribution
dim(df) # Should show ~100,000 rows, 40+ columns
## [1] 100000 41
table(df$target) # Check class balance (healthy vs. diseased)
##
## diseased healthy
## 29903 70097
# Calculate missing percentage
missing_percent <- colMeans(is.na(df)) * 100
missing_df <- data.frame(Column = names(missing_percent), Percent_Missing = missing_percent) %>%
filter(Percent_Missing > 0) %>%
arrange(desc(Percent_Missing))
# Table
kable(missing_df, caption = "Columns with Missing Values (%)", digits = 2)
Column | Percent_Missing |
---|
# Bar plot if >2 columns have missing data
if (nrow(missing_df) > 0) {
ggplot(missing_df, aes(x = reorder(Column, Percent_Missing), y = Percent_Missing)) +
geom_bar(stat = "identity", fill = brewer.pal(3, "Set2")[1]) +
coord_flip() +
labs(title = "Percentage of Missing Values", x = "Feature", y = "% Missing") +
theme_minimal() +
theme(text = element_text(size = 10))
}
Post-processing, no missing values were found, confirming effective data cleaning by teammates. This ensures data reliability for modeling, aligning with robust health data practices in Australian studies.
# Function to count outliers
count_outliers <- function(x) {
iqr <- IQR(x, na.rm = TRUE)
q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
sum(x < q[1] - 1.5 * iqr | x > q[2] + 1.5 * iqr, na.rm = TRUE)
}
# Summarize outliers by target
outlier_summary <- df %>%
group_by(target) %>%
summarise(Age_Outliers = count_outliers(age), BMI_Outliers = count_outliers(bmi_corrected))
kable(outlier_summary, caption = "Number of Outliers by Health Status")
target | Age_Outliers | BMI_Outliers |
---|---|---|
diseased | 0 | 0 |
healthy | 0 | 0 |
# Boxplot for BMI
ggplot(df, aes(x = target, y = bmi_corrected, fill = target)) +
geom_boxplot(alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "BMI by Health Status", x = "Health Status", y = "BMI") +
theme_minimal() +
theme(text = element_text(size = 10))
No outliers were detected in age or bmi_corrected post-cleaning, confirming robust data preparation. The boxplot reveals a higher median BMI in the diseased group compared to healthy, reflecting Australia’s 31% obesity rate (AIHW, 2023), a key risk factor for chronic diseases. This highlights BMI as a critical classifier for health status, with implications for socioeconomic health disparities in Australia.
# Summary stats for age by target
summary_stats <- df %>%
group_by(target) %>%
summarise(Mean_Age = mean(age, na.rm = TRUE), Median_Age = median(age, na.rm = TRUE))
kable(summary_stats, caption = "Age Statistics by Health Status", digits = 1)
target | Mean_Age | Median_Age |
---|---|---|
diseased | 48.7 | 49 |
healthy | 48.4 | 48 |
# Histogram for age
ggplot(df, aes(x = age, fill = target)) +
geom_histogram(position = "dodge", bins = 20, alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Age Distribution by Health Status", x = "Age (Years)", y = "Count") +
theme_minimal() +
theme(text = element_text(size = 10))
The age histogram shows diseased individuals are marginally older (mean=48.7, median=49) than healthy (mean=48.4, median=48), with healthy dominating younger ages (20-40) and diseased slightly more in older ages (60-80). This subtle pattern aligns with higher chronic disease risk in Australia’s aging population (AIHW, 2023). It suggests age as a modest predictor for classification, reflecting socioeconomic and health access disparities.
# Select numeric columns
num_df <- df %>% select(where(is.numeric)) %>% na.omit()
# PCA
pca_res <- prcomp(num_df, scale. = TRUE)
# Variance explained
var_explained <- summary(pca_res)$importance[2, 1:2] * 100
var_text <- paste("PC1:", round(var_explained[1], 1), "%, PC2:", round(var_explained[2], 1), "%")
# Plot
pca_df <- data.frame(PC1 = pca_res$x[,1], PC2 = pca_res$x[,2], target = df$target[complete.cases(num_df)])
ggplot(pca_df, aes(x = PC1, y = PC2, color = target)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_brewer(palette = "Set2") +
labs(title = "PCA Plot of Health Data", subtitle = var_text, x = "PC1", y = "PC2") +
theme_minimal() +
theme(text = element_text(size = 10))
The PCA plot shows moderate overlap between healthy and diseased groups, with PC1 (8.7%) and PC2 (4.4%) explaining key variance. Features like BMI and age likely drive separation, reflecting Australia’s obesity and aging challenges (AIHW, 2023). This indicates complex interactions for classification models, given the low total variance (13.1%) across 41 dimensions.
This exploration reveals key patterns: no missing values or outliers ensure data reliability, while higher BMI in diseased individuals reflects Australia’s obesity challenge (AIHW, 2023). The subtle age shift and PCA’s low variance (13.1%) suggest complex demographic and behavioral interactions, critical for classification. These insights, grounded in Australian health disparities, inform modeling strategies.
Logistic regression will be used as a baseline classification model, where the target variable is binary and contains two categories: healthy and diseased. The dataset has 100,000 observations with a mix of numeric and categorical features, and the target is moderately imbalanced with 70% healthy and 30% diseased observations. This makes ROC and AUC curves an appropriate evaluation matrix. Logistic regression is an efficient and interpretable model, but it assumes a linear relationship and it is sensitive to multicollinearity. Lastly, further preprocess of the dataset, including standardizing, dummy coding, and removal of near zero variance features, is needed before fitting the model.
kNN will be applied to capture nonlinear relationship in the dataset. With 23 numeric predictors, this model can classify based on local neighborhoods, but it is highly sensitive to scale, which makes centering and scaling important before fitting the model. Dummy coding is required by the presence of both numeric and categorical variables, and moderate class imbalance of 70% healthy and 30% diseased makes ROC AUC a reliable evaluation metric. The key hyperparameter is the number of neighbors k, which will be tuned and balance bias and variance.
Random Forest will be used to model complex nonlinear relationships across 40 predictors. This model is robust to correlations within predictors, such as the strong correlation between weight and bmi. The model will also be effective with categorical variables after dummy coded. The main hyperparameter tuning includes the number of predictors sampled at each split and the number of trees. Model performance will be primarily evaluated by ROC AUC, and supplementary matrix, including confusion matrix, accuracy, and sensitivity will also be used as a supplement.
LDA will be included as a reference linear classifier. It attempts to find linear combinations of predictor variables that separately optimize the difference between the two groups: healthy and diseased. Because the dataset has both numerical (age, income, quality of sleep, stress level, physical activity) and categorical predictors (gender, education), preprocessing steps of centering, scaling, and transforming categorical variables into factors will be required. The procedure assumes that predictors are normally distributed in each class and that covariance matrices are equal between groups. While these assumptions may not hold ideally for the data, LDA is STILL useful by virtue of its simplicity, speed, and interpretability. Because of the moderate imbalance for the outcome (approximately 70% healthy to 30% diseased), model performance will be mostly measured by ROC-AUC, with additional measures of balanced accuracy and F1-score to reflect uneven class representation.
Linear SVM will be utilized to locate a separable hyperplane that has the greatest margin between healthy and diseased subjects. It is particularly useful in dealing with high-dimensional sets constructed by encoding categorical variables and less threatened by predictor correlations. Because SVM depends upon the relative scale of the features, the numeric predictors will first be standardized prior to model fit. The principal hyperparameter is the penalty parameter (C) by which it balances margin size against rates of misclassification. It must be carefully calibrated to avoid underfitting and overfitting. Although less directly interpretable than LDA, the linear SVM can disclose stronger linear separability in cases of violated LDA assumptions. Model fit will be determined chiefly by ROC-AUC with additional reporting of balanced accuracy and confusion matrices for documentation of detailed class-level fit.
RBF kernel model is more complicated decision boundaries, a SVM with RBF kernel will also be utilized. Differing from the linear kernel, the RBF kernel maps data into a higher-dimensional space to enable the model to capture nonlinear associations between demographic and behavioral variables. It is especially of value for capturing interactive effects, for example, the combined impact of physical inactivity and poor quality of sleep upon health outcomes. Preprocessing will again comprise centering and scaling of all numerical predictors. Two hyperparameters need to be optimized: cost (C), controlling the trade-off between maximizing the margin and minimizing errors, and sigma, controlling the contribution of each data point in the transformed feature space. Optimization of the hyperparameters will be completed by repeated cross-validation. Due to RBF-SVM’s increased flexibility, it can potentially attain higher classification precision and AUC, albeit at the expense of less interpretability. Performance will chiefly be measured by use of ROC-AUC, backed up by balanced accuracy, precision, recall, and F1-scores to yield a thorough battery of tests.
Reference 1. Kuhn, Max. “The Caret Package.” 7 Train Models By Tag, 27 Mar. 2019, topepo.github.io/caret/train-models-by-tag.html#.
# Optional correlation matrix (example)
cor_matrix <- cor(df %>% select(where(is.numeric)) %>% na.omit())
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
addCoef.col = "black", tl.col = "black", tl.srt = 45, diag = FALSE)
# Density plot for sleep_hours
ggplot(df, aes(x = sleep_hours, fill = target)) +
geom_density(alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Density of Sleep Hours by Health Status", x = "Sleep Hours", y = "Density") +
theme_minimal() +
theme(text = element_text(size = 10))