Group Project: Health and Lifestyle Classification

4. Explore and Visualize

We explored demographic (age, gender, income, education) and behavioral (sleep quality, stress, physical activity) factors in the dataset to identify patterns distinguishing healthy and diseased individuals, using visualizations and numerical summaries. These analyses highlight key predictors for classification and reflect health disparities in Australia, such as socioeconomic influences on chronic disease risk. We examined missing values, outliers, distributions, and high-dimensional patterns to inform modeling.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(corrplot)

## corrplot 0.95 loaded

library(ggfortify)
library(RColorBrewer)
library(knitr)

# Set working directory to the folder containing the dataset
setwd("D:/OneDrive - Manipal University Jaipur/Desktop/stat")
# Load the merged dataset
df <- read_csv("merged_processed.csv")

## Rows: 100000 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (23): age, height, weight, bmi_corrected, waist_size, blood_pressure, he...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Quick check: dimensions and target distribution
dim(df)  # Should show ~100,000 rows, 40+ columns

## [1] 100000     41

table(df$target)  # Check class balance (healthy vs. diseased)

## 
## diseased  healthy 
##    29903    70097

Missing Values

# Calculate missing percentage
missing_percent <- colMeans(is.na(df)) * 100
missing_df <- data.frame(Column = names(missing_percent), Percent_Missing = missing_percent) %>%
  filter(Percent_Missing > 0) %>%
  arrange(desc(Percent_Missing))

# Table
kable(missing_df, caption = "Columns with Missing Values (%)", digits = 2)

Columns with Missing Values (%)
Column	Percent_Missing

# Bar plot if >2 columns have missing data
if (nrow(missing_df) > 0) {
  ggplot(missing_df, aes(x = reorder(Column, Percent_Missing), y = Percent_Missing)) +
    geom_bar(stat = "identity", fill = brewer.pal(3, "Set2")[1]) +
    coord_flip() +
    labs(title = "Percentage of Missing Values", x = "Feature", y = "% Missing") +
    theme_minimal() +
    theme(text = element_text(size = 10))
}

Post-processing, no missing values were found, confirming effective data cleaning by teammates. This ensures data reliability for modeling, aligning with robust health data practices in Australian studies.

Outliers

# Function to count outliers
count_outliers <- function(x) {
  iqr <- IQR(x, na.rm = TRUE)
  q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  sum(x < q[1] - 1.5 * iqr | x > q[2] + 1.5 * iqr, na.rm = TRUE)
}

# Summarize outliers by target
outlier_summary <- df %>%
  group_by(target) %>%
  summarise(Age_Outliers = count_outliers(age), BMI_Outliers = count_outliers(bmi_corrected))
kable(outlier_summary, caption = "Number of Outliers by Health Status")

Number of Outliers by Health Status
target	Age_Outliers	BMI_Outliers
diseased	0	0
healthy	0	0

# Boxplot for BMI
ggplot(df, aes(x = target, y = bmi_corrected, fill = target)) +
  geom_boxplot(alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "BMI by Health Status", x = "Health Status", y = "BMI") +
  theme_minimal() +
  theme(text = element_text(size = 10))

No outliers were detected in age or bmi_corrected post-cleaning, confirming robust data preparation. The boxplot reveals a higher median BMI in the diseased group compared to healthy, reflecting Australia’s 31% obesity rate (AIHW, 2023), a key risk factor for chronic diseases. This highlights BMI as a critical classifier for health status, with implications for socioeconomic health disparities in Australia.

Distributions

# Summary stats for age by target
summary_stats <- df %>%
  group_by(target) %>%
  summarise(Mean_Age = mean(age, na.rm = TRUE), Median_Age = median(age, na.rm = TRUE))
kable(summary_stats, caption = "Age Statistics by Health Status", digits = 1)

Age Statistics by Health Status
target	Mean_Age	Median_Age
diseased	48.7	49
healthy	48.4	48

# Histogram for age
ggplot(df, aes(x = age, fill = target)) +
  geom_histogram(position = "dodge", bins = 20, alpha = 0.7) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Age Distribution by Health Status", x = "Age (Years)", y = "Count") +
  theme_minimal() +
  theme(text = element_text(size = 10))

The age histogram shows diseased individuals are marginally older (mean=48.7, median=49) than healthy (mean=48.4, median=48), with healthy dominating younger ages (20-40) and diseased slightly more in older ages (60-80). This subtle pattern aligns with higher chronic disease risk in Australia’s aging population (AIHW, 2023). It suggests age as a modest predictor for classification, reflecting socioeconomic and health access disparities.

High-Dimensional Patterns

# Select numeric columns
num_df <- df %>% select(where(is.numeric)) %>% na.omit()

# PCA
pca_res <- prcomp(num_df, scale. = TRUE)

# Variance explained
var_explained <- summary(pca_res)$importance[2, 1:2] * 100
var_text <- paste("PC1:", round(var_explained[1], 1), "%, PC2:", round(var_explained[2], 1), "%")

# Plot
pca_df <- data.frame(PC1 = pca_res$x[,1], PC2 = pca_res$x[,2], target = df$target[complete.cases(num_df)])
ggplot(pca_df, aes(x = PC1, y = PC2, color = target)) +
  geom_point(alpha = 0.6, size = 1.5) +
  scale_color_brewer(palette = "Set2") +
  labs(title = "PCA Plot of Health Data", subtitle = var_text, x = "PC1", y = "PC2") +
  theme_minimal() +
  theme(text = element_text(size = 10))

The PCA plot shows moderate overlap between healthy and diseased groups, with PC1 (8.7%) and PC2 (4.4%) explaining key variance. Features like BMI and age likely drive separation, reflecting Australia’s obesity and aging challenges (AIHW, 2023). This indicates complex interactions for classification models, given the low total variance (13.1%) across 41 dimensions.

Conclusion

This exploration reveals key patterns: no missing values or outliers ensure data reliability, while higher BMI in diseased individuals reflects Australia’s obesity challenge (AIHW, 2023). The subtle age shift and PCA’s low variance (13.1%) suggest complex demographic and behavioral interactions, critical for classification. These insights, grounded in Australian health disparities, inform modeling strategies.

Group Project: Health and Lifestyle Classification

1. Define the Problem

2. Describe the Data

3. Clean and Prepare the Data

4. Explore and Visualize

Missing Values

Outliers

Distributions

High-Dimensional Patterns

Conclusion

Appendix

5. Modelling Plan