[Placeholder for teammates]
[Placeholder]
[Placeholder]
We explored demographic (age, gender, income, education) and behavioral (sleep quality, stress, physical activity) factors in the dataset to identify patterns distinguishing healthy and diseased individuals, using visualizations and numerical summaries. These analyses highlight key predictors for classification and reflect health disparities in Australia, such as socioeconomic influences on chronic disease risk. We examined missing values, outliers, distributions, and high-dimensional patterns to inform modeling.
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(ggfortify)
library(RColorBrewer)
library(knitr)
# Set working directory to the folder containing the dataset
setwd("D:/OneDrive - Manipal University Jaipur/Desktop/stat")
# Load the merged dataset
df <- read_csv("merged_processed.csv")
## Rows: 100000 Columns: 41
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (18): gender, sleep_quality, alcohol_consumption, smoking_level, mental_...
## dbl (23): age, height, weight, bmi_corrected, waist_size, blood_pressure, he...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Quick check: dimensions and target distribution
dim(df) # Should show ~100,000 rows, 40+ columns
## [1] 100000 41
table(df$target) # Check class balance (healthy vs. diseased)
##
## diseased healthy
## 29903 70097
# Calculate missing percentage
missing_percent <- colMeans(is.na(df)) * 100
missing_df <- data.frame(Column = names(missing_percent), Percent_Missing = missing_percent) %>%
filter(Percent_Missing > 0) %>%
arrange(desc(Percent_Missing))
# Table
kable(missing_df, caption = "Columns with Missing Values (%)", digits = 2)
Column | Percent_Missing |
---|
# Bar plot if >2 columns have missing data
if (nrow(missing_df) > 0) {
ggplot(missing_df, aes(x = reorder(Column, Percent_Missing), y = Percent_Missing)) +
geom_bar(stat = "identity", fill = brewer.pal(3, "Set2")[1]) +
coord_flip() +
labs(title = "Percentage of Missing Values", x = "Feature", y = "% Missing") +
theme_minimal() +
theme(text = element_text(size = 10))
}
Post-processing, no missing values were found, confirming effective data cleaning by teammates. This ensures data reliability for modeling, aligning with robust health data practices in Australian studies.
# Function to count outliers
count_outliers <- function(x) {
iqr <- IQR(x, na.rm = TRUE)
q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
sum(x < q[1] - 1.5 * iqr | x > q[2] + 1.5 * iqr, na.rm = TRUE)
}
# Summarize outliers by target
outlier_summary <- df %>%
group_by(target) %>%
summarise(Age_Outliers = count_outliers(age), BMI_Outliers = count_outliers(bmi_corrected))
kable(outlier_summary, caption = "Number of Outliers by Health Status")
target | Age_Outliers | BMI_Outliers |
---|---|---|
diseased | 0 | 0 |
healthy | 0 | 0 |
# Boxplot for BMI
ggplot(df, aes(x = target, y = bmi_corrected, fill = target)) +
geom_boxplot(alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "BMI by Health Status", x = "Health Status", y = "BMI") +
theme_minimal() +
theme(text = element_text(size = 10))
No outliers were detected in age or bmi_corrected post-cleaning, confirming robust data preparation. The boxplot reveals a higher median BMI in the diseased group compared to healthy, reflecting Australia’s 31% obesity rate (AIHW, 2023), a key risk factor for chronic diseases. This highlights BMI as a critical classifier for health status, with implications for socioeconomic health disparities in Australia.
# Summary stats for age by target
summary_stats <- df %>%
group_by(target) %>%
summarise(Mean_Age = mean(age, na.rm = TRUE), Median_Age = median(age, na.rm = TRUE))
kable(summary_stats, caption = "Age Statistics by Health Status", digits = 1)
target | Mean_Age | Median_Age |
---|---|---|
diseased | 48.7 | 49 |
healthy | 48.4 | 48 |
# Histogram for age
ggplot(df, aes(x = age, fill = target)) +
geom_histogram(position = "dodge", bins = 20, alpha = 0.7) +
scale_fill_brewer(palette = "Set2") +
labs(title = "Age Distribution by Health Status", x = "Age (Years)", y = "Count") +
theme_minimal() +
theme(text = element_text(size = 10))
The age histogram shows diseased individuals are marginally older (mean=48.7, median=49) than healthy (mean=48.4, median=48), with healthy dominating younger ages (20-40) and diseased slightly more in older ages (60-80). This subtle pattern aligns with higher chronic disease risk in Australia’s aging population (AIHW, 2023). It suggests age as a modest predictor for classification, reflecting socioeconomic and health access disparities.
# Select numeric columns
num_df <- df %>% select(where(is.numeric)) %>% na.omit()
# PCA
pca_res <- prcomp(num_df, scale. = TRUE)
# Variance explained
var_explained <- summary(pca_res)$importance[2, 1:2] * 100
var_text <- paste("PC1:", round(var_explained[1], 1), "%, PC2:", round(var_explained[2], 1), "%")
# Plot
pca_df <- data.frame(PC1 = pca_res$x[,1], PC2 = pca_res$x[,2], target = df$target[complete.cases(num_df)])
ggplot(pca_df, aes(x = PC1, y = PC2, color = target)) +
geom_point(alpha = 0.6, size = 1.5) +
scale_color_brewer(palette = "Set2") +
labs(title = "PCA Plot of Health Data", subtitle = var_text, x = "PC1", y = "PC2") +
theme_minimal() +
theme(text = element_text(size = 10))
The PCA plot shows moderate overlap between healthy and diseased groups, with PC1 (8.7%) and PC2 (4.4%) explaining key variance. Features like BMI and age likely drive separation, reflecting Australia’s obesity and aging challenges (AIHW, 2023). This indicates complex interactions for classification models, given the low total variance (13.1%) across 41 dimensions.
This exploration reveals key patterns: no missing values or outliers ensure data reliability, while higher BMI in diseased individuals reflects Australia’s obesity challenge (AIHW, 2023). The subtle age shift and PCA’s low variance (13.1%) suggest complex demographic and behavioral interactions, critical for classification. These insights, grounded in Australian health disparities, inform modeling strategies.
# Optional correlation matrix (example)
cor_matrix <- cor(df %>% select(where(is.numeric)) %>% na.omit())
corrplot(cor_matrix, method = "color", type = "upper", order = "hclust",
addCoef.col = "black", tl.col = "black", tl.srt = 45, diag = FALSE)
[Placeholder]