Project Report: Week 7 - Exploratory Data Analysis & Project Planning for Australian Health Disparities

1. Executive Summary

This report outlines the project plan and preliminary exploratory data analysis (EDA) for a binary classification task. The primary objective is to develop a predictive model that can classify individuals as either ‘healthy’ or ‘diseased’ based on a comprehensive dataset of demographic and behavioral factors. A secondary but equally critical aim is to identify how these factors may contribute to health disparities, a significant public health issue in Australia. The methodology employed begins with a thorough data quality assessment and cleaning process, followed by an in-depth EDA to uncover relationships between variables and the target health status.

Initial findings from the dataset suggest that ….

The research question is: Can demographic (e.g., age, gender, income, education) and behavioral factors (e.g., sleep quality, stress level, physical activity) classify individuals as healthy or diseased, and how do these factors reflect health disparities in Australia’s diverse population?

This is a binary classification problem, where the outcome variable ‘target’ is categorical (healthy/diseased), using supervised learning to predict classes based on features.

It matters because lifestyle-related diseases like obesity and diabetes contribute to 31% of Australia’s disease burden (AIHW, 2023), particularly in urban areas like Sydney with high stress/screen time. Identifying predictors can inform preventive strategies for socioeconomic health gaps.

2. Problem Definition: Classifying Health Status and Disparities

The research problem centers on the ability of demographic and behavioral factors to classify individuals as ‘healthy’ or ‘diseased’ and to explore how these factors manifest as health disparities within a diverse population. The significance of this problem extends beyond a simple predictive task. Public health bodies and policymakers in Australia and globally are increasingly focused on understanding the social determinants of health to design more effective, equitable, and preventative health strategies. Identifying which factors most strongly correlate with disease status provides crucial insights for targeted public health campaigns and resource allocation.

The core of the problem is a supervised learning task. The goal is to predict the binary target variable, target, which is explicitly labeled as either healthy or diseased in the provided dataset. The predictive features, or the independent variables, span a wide range of data points. These include demographic information such as age, gender, education_level, and income; behavioral and lifestyle factors like sleep_quality, stress_level, and physical_activity; and various clinical and physical measurements. The complexity arises from the number and diversity of these features, which may have linear, non-linear, or interactive relationships with the target variable. A key objective of this project is not merely to achieve high predictive accuracy but also to interpret the model’s findings to better understand the mechanisms behind health outcomes and potential disparities. This requires a meticulous and nuanced approach to data analysis and model selection, ensuring that the insights derived are both statistically sound and clinically meaningful.

3. Data Description and Source Analysis

The dataset is sourced from an aggregated health survey repository by the Australian Institute of Health and Welfare (AIHW), originally from clinical and lifestyle studies (verified via metadata, not Kaggle copy).

It has 100,000 samples and 41 variables (23 numeric e.g., age, bmi_corrected; 18 categorical e.g., gender, sleep_quality). Outcome variable: ‘target’ (binary: healthy/diseased, ~70% healthy imbalance).

Challenges: High-dimension (41 features), potential class imbalance biasing models, and original messiness (missing values imputed during cleaning). Meets 3 criteria: large (>10,000), complex (mixed types), messy (pre-cleaning missingness).

# Load the necessary libraries
library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

# Load the dataset
health_data <- read_csv("D:/OneDrive - Manipal University Jaipur/Desktop/stat/health_lifestyle_classification.csv")

## Rows: 100000 Columns: 48

## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (30): survey_code, age, height, weight, bmi, bmi_estimated, bmi_scaled, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Identify and address missing values
missing_values <- colSums(is.na(health_data))
print("Missing values per column:")

## [1] "Missing values per column:"

print(missing_values[missing_values > 0])

##      blood_pressure          heart_rate             insulin         daily_steps 
##                7669               14003               15836                8329 
## alcohol_consumption              income    gene_marker_flag 
##               13910                8470               10474

4. Data Cleaning and Preparation

We performed data wrangling to make the dataset usable: formatted columns (e.g., numeric types), removed duplicates (none found), imputed missing values (e.g., mean for numeric like insulin), and flagged outliers (none after IQR check).

Challenges addressed: Original missingness in insulin/sleep_hours (imputed with medians to avoid bias); high-dimension (41 features) retained for exploration.

High-level steps

Load data & libraries.

library(tidyverse)
library(janitor)
library(naniar)       # visualise missingness
library(caret)        # createDataPartition, nearZeroVar, findCorrelation
library(fastDummies)  # quick dummy creation
set.seed(123)         # reproducibility

raw <- read_csv("D:/OneDrive - Manipal University Jaipur/Desktop/stat/health_lifestyle_classification.csv") %>% 
  clean_names()

## Rows: 100000 Columns: 48
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (30): survey_code, age, height, weight, bmi, bmi_estimated, bmi_scaled, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Remove obvious non-predictive columns IDs / timestamps usually leak no predictive information and can confuse some pipelines.

raw <- raw %>% select(-survey_code)

Stratified train/test split — before fitting imputers/scalers Splitting first avoids leakage from global imputations or scaling. Stratifying preserves the class ratio.

## 
##   healthy  diseased 
## 0.7009662 0.2990338

## 
##  healthy diseased 
## 0.700985 0.299015

Missingness analysis (on training set)

# Missing percentages in training
train_miss <- train %>% summarise(across(everything(), ~ mean(is.na(.x)))) %>% pivot_longer(everything(), names_to="var", values_to="miss_pct") %>% arrange(desc(miss_pct))
train_miss %>% filter(miss_pct > 0) %>% print(n=Inf)

## # A tibble: 7 × 2
##   var                 miss_pct
##   <chr>                  <dbl>
## 1 insulin               0.158 
## 2 heart_rate            0.139 
## 3 alcohol_consumption   0.139 
## 4 gene_marker_flag      0.104 
## 5 income                0.0851
## 6 daily_steps           0.0838
## 7 blood_pressure        0.0764

We decide to drop variables with over 50% missing values. According to above analysis, there is no columns over 50% missing values.

Type conversion: identify numeric vs categorical & ordinal treatment

# Identify numeric and categorical columns (exclude target)
predictor_names <- setdiff(names(train), "target")

num_vars  <- train %>% select(all_of(predictor_names)) %>% select(where(is.numeric)) %>% names()
cat_vars  <- train %>% select(all_of(predictor_names)) %>% select(where(~is.character(.x) | is.factor(.x))) %>% names()

# convert character categorical to factor (use training levels later for test)
train <- train %>% mutate(across(all_of(cat_vars), ~ as.factor(.x)))
for (v in cat_vars) {
  test[[v]] <- factor(test[[v]], levels = levels(train[[v]]))    # unseen levels will be NA - handled next
}

Imputation — train-derived rules (median for numeric; mode for categorical).

# Helpers --- generated by CHATGPT
get_mode <- function(x) {
  ux <- na.omit(unique(x))
  if(length(ux)==0) return(NA)
  ux[which.max(tabulate(match(x, ux)))]
}

# Numeric medians computed on training
num_medians <- sapply(train[num_vars], function(x) median(x, na.rm = TRUE))

# Categorical modes computed on training
cat_modes <- sapply(train[cat_vars], get_mode)

# Apply imputations: replace NA in both train and test using train-derived values
# Impute numerics
for (v in num_vars) {
  train[[v]] <- replace(train[[v]], is.na(train[[v]]), num_medians[[v]])
  test[[v]]  <- replace(test[[v]],  is.na(test[[v]]),  num_medians[[v]])
}

# Impute categoricals
for (v in cat_vars) {
  # train imputation
  train[[v]] <- as.character(train[[v]])
  train[[v]][is.na(train[[v]])] <- cat_modes[[v]]
  train[[v]] <- factor(train[[v]])

  # test imputation
  test[[v]] <- as.character(test[[v]])
  test[[v]][is.na(test[[v]])] <- cat_modes[[v]]

  # unseen categories → "Other"
  unseen <- setdiff(unique(test[[v]]), levels(train[[v]]))
  test[[v]][test[[v]] %in% unseen] <- "Other"

  test[[v]] <- factor(test[[v]], levels = c(levels(train[[v]]), "Other"))
}

Outliers: winsorize numeric features using train thresholds (This part of code is generated by AI) Winsorizing caps extreme values that otherwise distort scaling and influence linear models. Using train-derived thresholds avoids test leakage.

winsorize_train_thresholds <- map(num_vars, ~ {
  v <- .x
  qs <- quantile(train[[v]], probs = c(0.01, 0.99), na.rm = TRUE)
  list(low = qs[1], high = qs[2])
})
names(winsorize_train_thresholds) <- num_vars

winsorize_vec <- function(x, low, high) {
  x[x < low]  <- low
  x[x > high] <- high
  x
}

# Apply winsorization on training and test using train thresholds
for (v in num_vars) {
  thr <- winsorize_train_thresholds[[v]]
  train[[v]] <- winsorize_vec(train[[v]], thr$low, thr$high)
  test[[v]]  <- winsorize_vec(test[[v]], thr$low, thr$high)
}

Remove near-zero-variance predictors and highly correlated numeric predictors This is because it exists multicollinearity if not removing highly correlated predictors.

# Near-zero variance (training)
nzv_info <- caret::nearZeroVar(train %>% select(-target), saveMetrics = TRUE)
nzv_cols <- rownames(nzv_info)[nzv_info$nzv]
nzv_cols

## [1] "electrolyte_level"        "gene_marker_flag"        
## [3] "environmental_risk_score"

# Drop NZV cols from both sets
train <- train %>% select(-all_of(nzv_cols))
test  <- test  %>% select(-all_of(nzv_cols))

# Update numeric list after removals
num_vars <- intersect(num_vars, names(train))

# High correlation removal (training numeric only)
if (length(num_vars) >= 2) {
  cor_mat <- cor(train %>% select(all_of(num_vars)), use = "pairwise.complete.obs")
  highCorr_idx <- caret::findCorrelation(cor_mat, cutoff = 0.90)  # threshold can be 0.85-0.95 based on preference
  highCorr_cols <- colnames(cor_mat)[highCorr_idx]
  highCorr_cols
  # Drop them
  train <- train %>% select(-all_of(highCorr_cols))
  test  <- test  %>% select(-all_of(highCorr_cols))
  # update num_vars
  num_vars <- setdiff(num_vars, highCorr_cols)
}
# ensure train/test alignment
for (v in cat_vars) {
  test[[v]] <- factor(test[[v]], levels = levels(train[[v]]))
}

train_final <- train
test_final  <- test

Final check and output modified dataset

# Confirm no NAs remain in predictors
sum(is.na(train_final))

## [1] 0

sum(is.na(test_final))

## [1] 0

# Confirm target still exists in train_final/test_final
if(!"target" %in% names(train_final)) stop("Target missing from train_final!")

# Optionally export synthetic dataset for visualization
data_final <- rbind(train_final, test_final)

# Optionally export CSVs for modelling stage
write_csv(data_final, "merged_processed.csv")
write_csv(train_final, "train_processed.csv")
write_csv(test_final,  "test_processed.csv")

5. Explore and Visualize data

We explored demographic (age, gender, income, education) and behavioral (sleep quality, stress, physical activity) factors in the dataset to identify patterns distinguishing healthy and diseased individuals, using visualizations and numerical summaries. These analyses highlight key predictors for classification and reflect health disparities in Australia, such as socioeconomic influences on chronic disease risk. We examined missing values, outliers, distributions, and high-dimensional patterns to inform modeling.

library(readr)
library(dplyr)
library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

library(ggfortify)
library(RColorBrewer)
library(knitr)

# Load the merged dataset
df <- read_csv("merged_processed.csv")

## Rows: 100000 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (23): age, height, weight, bmi_corrected, waist_size, blood_pressure, he...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Quick check: dimensions and target distribution
dim(df)  # Should show ~100,000 rows, 40+ columns

## [1] 100000     41

table(df$target)  # Check class balance (healthy vs. diseased)

## 
## diseased  healthy 
##    29903    70097

Missing Values

# Calculate missing percentage
missing_percent <- colMeans(is.na(df)) * 100
missing_df <- data.frame(Column = names(missing_percent), Percent_Missing = missing_percent) %>%
  filter(Percent_Missing > 0) %>%
  arrange(desc(Percent_Missing))

# Table
kable(missing_df, caption = "Columns with Missing Values (%)", digits = 2)

Columns with Missing Values (%)
Column	Percent_Missing

# Bar plot if >2 columns have missing data
if (nrow(missing_df) > 0) {
  ggplot(missing_df, aes(x = reorder(Column, Percent_Missing), y = Percent_Missing)) +
    geom_bar(stat = "identity", fill = brewer.pal(3, "Set2")[1]) +
    coord_flip() +
    labs(title = "Percentage of Missing Values", x = "Feature", y = "% Missing") +
    theme_minimal() +
    theme(text = element_text(size = 10))
}

Post-processing, no missing values were found, confirming effective data cleaning by teammates. This ensures data reliability for modeling, aligning with robust health data practices in Australian studies.

Outliers

# Function to count outliers
count_outliers <- function(x) {
  iqr <- IQR(x, na.rm = TRUE)
  q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  sum(x < q[1] - 1.5 * iqr | x > q[2] + 1.5 * iqr, na.rm = TRUE)
}

# Summarize outliers by target
outlier_summary <- df %>%
  group_by(target) %>%
  summarise(Age_Outliers = count_outliers(age), BMI_Outliers = count_outliers(bmi_corrected))
kable(outlier_summary, caption = "Number of Outliers by Health Status")

Number of Outliers by Health Status
target	Age_Outliers	BMI_Outliers
diseased	0	0
healthy	0	0

# Boxplot for BMI
ggplot(df, aes(x = target, y = bmi_corrected, fill = target)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "BMI by Health Status", x = "Health Status", y = "BMI") +
  theme_minimal() +
  theme(text = element_text(size = 10))

No outliers were detected in age or bmi_corrected post-cleaning, confirming robust data preparation. The boxplot reveals a higher median BMI in the diseased group compared to healthy, reflecting Australia’s 31% obesity rate (AIHW, 2023), a key risk factor for chronic diseases. This highlights BMI as a critical classifier for health status, with implications for socioeconomic health disparities in Australia.

Distributions

# Summary stats for age by target
summary_stats <- df %>%
  group_by(target) %>%
  summarise(Mean_Age = mean(age, na.rm = TRUE), Median_Age = median(age, na.rm = TRUE))
kable(summary_stats, caption = "Age Statistics by Health Status", digits = 1)

Age Statistics by Health Status
target	Mean_Age	Median_Age
diseased	48.7	49
healthy	48.4	48

# Histogram for age
ggplot(df, aes(x = age, fill = target)) +
  geom_histogram(position = "dodge", bins = 20, alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Age Distribution by Health Status", x = "Age (Years)", y = "Count") +
  theme_minimal() +
  theme(text = element_text(size = 10))

The age histogram shows diseased individuals are marginally older (mean=48.7, median=49) than healthy (mean=48.4, median=48), with healthy dominating younger ages (20-40) and diseased slightly more in older ages (60-80). This subtle pattern aligns with higher chronic disease risk in Australia’s aging population (AIHW, 2023). It suggests age as a modest predictor for classification, reflecting socioeconomic and health access disparities.

High-Dimensional Patterns

# Select numeric columns
num_df <- df %>% select(where(is.numeric)) %>% na.omit()

# PCA
pca_res <- prcomp(num_df, scale. = TRUE)

# Variance explained
var_explained <- summary(pca_res)$importance[2, 1:2] * 100
var_text <- paste("PC1:", round(var_explained[1], 1), "%, PC2:", round(var_explained[2], 1), "%")

# Plot
pca_df <- data.frame(PC1 = pca_res$x[,1], PC2 = pca_res$x[,2], target = df$target[complete.cases(num_df)])
ggplot(pca_df, aes(x = PC1, y = PC2, color = target)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_brewer(palette = "Set2") +
  labs(title = "PCA Plot of Health Data", subtitle = var_text, x = "PC1", y = "PC2") +
  theme_minimal() +
  theme(text = element_text(size = 10))

The PCA plot shows moderate overlap between healthy and diseased groups, with PC1 (8.7%) and PC2 (4.4%) explaining key variance. Features like BMI and age likely drive separation, reflecting Australia’s obesity and aging challenges (AIHW, 2023). This indicates complex interactions for classification models, given the low total variance (13.1%) across 41 dimensions.

Conclusion

This exploration reveals key patterns: no missing values or outliers ensure data reliability, while higher BMI in diseased individuals reflects Australia’s obesity challenge (AIHW, 2023). The subtle age shift and PCA’s low variance (13.1%) suggest complex demographic and behavioral interactions, critical for classification. These insights, grounded in Australian health disparities, inform modeling strategies.

6. Modelling Plan

Logistic regression

Logistic regression will be used as a baseline classification model, where the target variable is binary and contains two categories: healthy and diseased. The dataset has 100,000 observations with a mix of numeric and categorical features, and the target is moderately imbalanced with 70% healthy and 30% diseased observations. This makes ROC and AUC curves an appropriate evaluation matrix. Logistic regression is an efficient and interpretable model, but it assumes a linear relationship and it is sensitive to multicollinearity. Lastly, further preprocess of the dataset, including standardizing, dummy coding, and removal of near zero variance features, is needed before fitting the model.

k-Nearest Neighbors(KNN)

kNN will be applied to capture nonlinear relationship in the dataset. With 23 numeric predictors, this model can classify based on local neighborhoods, but it is highly sensitive to scale, which makes centering and scaling important before fitting the model. Dummy coding is required by the presence of both numeric and categorical variables, and moderate class imbalance of 70% healthy and 30% diseased makes ROC AUC a reliable evaluation metric. The key hyperparameter is the number of neighbors k, which will be tuned and balance bias and variance.

Random Forest

Random Forest will be used to model complex nonlinear relationships across 40 predictors. This model is robust to correlations within predictors, such as the strong correlation between weight and bmi. The model will also be effective with categorical variables after dummy coded. The main hyperparameter tuning includes the number of predictors sampled at each split and the number of trees. Model performance will be primarily evaluated by ROC AUC, and supplementary matrix, including confusion matrix, accuracy, and sensitivity will also be used as a supplement.

Linear Discriminant Analysis (LDA)

LDA will be included as a reference linear classifier. It attempts to find linear combinations of predictor variables that separately optimize the difference between the two groups: healthy and diseased. Because the dataset has both numerical (age, income, quality of sleep, stress level, physical activity) and categorical predictors (gender, education), preprocessing steps of centering, scaling, and transforming categorical variables into factors will be required. The procedure assumes that predictors are normally distributed in each class and that covariance matrices are equal between groups. While these assumptions may not hold ideally for the data, LDA is STILL useful by virtue of its simplicity, speed, and interpretability. Because of the moderate imbalance for the outcome (approximately 70% healthy to 30% diseased), model performance will be mostly measured by ROC-AUC, with additional measures of balanced accuracy and F1-score to reflect uneven class representation.

Support Vector Machine –Linear Kernel

Linear SVM will be utilized to locate a separable hyperplane that has the greatest margin between healthy and diseased subjects. It is particularly useful in dealing with high-dimensional sets constructed by encoding categorical variables and less threatened by predictor correlations. Because SVM depends upon the relative scale of the features, the numeric predictors will first be standardized prior to model fit. The principal hyperparameter is the penalty parameter (C) by which it balances margin size against rates of misclassification. It must be carefully calibrated to avoid underfitting and overfitting. Although less directly interpretable than LDA, the linear SVM can disclose stronger linear separability in cases of violated LDA assumptions. Model fit will be determined chiefly by ROC-AUC with additional reporting of balanced accuracy and confusion matrices for documentation of detailed class-level fit.

Support Vector Machine – Radial Basis Function (RBF) Kernel

RBF kernel model is more complicated decision boundaries, a SVM with RBF kernel will also be utilized. Differing from the linear kernel, the RBF kernel maps data into a higher-dimensional space to enable the model to capture nonlinear associations between demographic and behavioral variables. It is especially of value for capturing interactive effects, for example, the combined impact of physical inactivity and poor quality of sleep upon health outcomes. Preprocessing will again comprise centering and scaling of all numerical predictors. Two hyperparameters need to be optimized: cost (C), controlling the trade-off between maximizing the margin and minimizing errors, and sigma, controlling the contribution of each data point in the transformed feature space. Optimization of the hyperparameters will be completed by repeated cross-validation. Due to RBF-SVM’s increased flexibility, it can potentially attain higher classification precision and AUC, albeit at the expense of less interpretability. Performance will chiefly be measured by use of ROC-AUC, backed up by balanced accuracy, precision, recall, and F1-scores to yield a thorough battery of tests.

7. Appendix

Reference 1. Kuhn, Max. “The Caret Package.” 7 Train Models By Tag, 27 Mar. 2019, topepo.github.io/caret/train-models-by-tag.html#.

# Optional correlation matrix (example)
cor_matrix <- cor(df %>% select(where(is.numeric)) %>% na.omit())
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
         addCoef.col = "black", tl.col = "black", tl.srt = 45, diag = FALSE)

# Density plot for sleep_hours
ggplot(df, aes(x = sleep_hours, fill = target)) +
  geom_density(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Density of Sleep Hours by Health Status", x = "Sleep Hours", y = "Density") +
  theme_minimal() +
  theme(text = element_text(size = 10))

STAT5003_report1

W10_G05

2025-09-21