This report analyzes the Heart Disease dataset from the UCI Machine Learning Repository. We examine the distribution of numeric features, test for normality, and compare groups based on heart disease outcome.
# Load the heart disease dataset from UCI
data_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.cleveland.data"
heart_data <- read.csv(data_url, header = FALSE, na.strings = "?")
# Assign column names based on UCI documentation
colnames(heart_data) <- c("age", "sex", "cp", "trestbps", "chol", "fbs", "restecg",
"thalach", "exang", "oldpeak", "slope", "ca", "thal", "num")
# Remove missing values and create binary outcome
heart_data <- na.omit(heart_data)
heart_data$HeartDisease <- ifelse(heart_data$num == 0, "No", "Yes")
All numeric features in the Heart Disease dataset did not meet the normality assumption, as shown by very low p-values from the Shapiro-Wilk tests. Even features like age and maximum heart rate (thalach), which were closest to normal, still had significant p-values indicating deviations. The QQ-plots supported these findings by showing departures from the straight line expected for normally distributed data. In short, these results suggest that none of the variables are normally distributed, so nonparametric tests may be more appropriate for further analysis.
numeric_features <- setdiff(names(heart_data)[sapply(heart_data, is.numeric)], "num")
par(mfrow = c(2, 2))
for (feature in numeric_features) {
print(paste("Normality test for", feature))
print(shapiro.test(heart_data[[feature]]))
qqnorm(heart_data[[feature]], main = paste("QQ-plot of", feature))
qqline(heart_data[[feature]], col = "red")
}
par(mfrow = c(1,1))
library(ggplot2)
library(reshape2)
data_melt <- melt(heart_data[, c(numeric_features, "HeartDisease")], id.vars = "HeartDisease")
ggplot(data_melt, aes(x = variable, y = value, fill = HeartDisease)) +
geom_boxplot() +
theme_minimal() +
labs(title = "Boxplots of Features by Heart Disease Outcome", x = "Features", y = "Values")
for (feature in numeric_features) {
group_no <- heart_data[heart_data$HeartDisease == "No", feature]
group_yes <- heart_data[heart_data$HeartDisease == "Yes", feature]
if (shapiro.test(group_no)$p.value > 0.05 & shapiro.test(group_yes)$p.value > 0.05) {
test_result <- t.test(group_no, group_yes)
} else {
test_result <- wilcox.test(group_no, group_yes)
}
cat("Hypothesis test for", feature, ":
")
print(test_result)
cat("\n")
}
## Hypothesis test for age :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 7916.5, p-value = 3.673e-05
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for sex :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 8096.5, p-value = 1.667e-06
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for cp :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 5475.5, p-value = 1.276e-15
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for trestbps :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 9292.5, p-value = 0.02346
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for chol :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 9492, p-value = 0.04669
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for fbs :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 10936, p-value = 0.9574
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for restecg :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 9119, p-value = 0.004216
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for thalach :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 16399, p-value = 1.675e-13
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for exang :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 6615.5, p-value = 4.216e-13
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for oldpeak :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 5833.5, p-value = 1.539e-12
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for slope :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 6897, p-value = 7.272e-10
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for ca :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 5426.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
##
##
## Hypothesis test for thal :
##
## Wilcoxon rank sum test with continuity correction
##
## data: group_no and group_yes
## W = 5118.5, p-value < 2.2e-16
## alternative hypothesis: true location shift is not equal to 0
For the feature with high variance and a lot of overlap between groups, many of the 1000 tests resulted in p-values above 0.05, especially with smaller sample sizes, indicating that the groups are not consistently different. In contrast, for the feature with low variance and little overlap, nearly every test yielded a very low p-value, showing a clear difference between the groups even with small samples. Increasing the sample size improved the consistency of the results in both cases, with the overlapping feature showing more significant tests and the clearly separated feature maintaining very low p-values. This simulation clearly illustrates how both the natural separation between groups and the sample size can impact the outcome of hypothesis tests.
set.seed(123)
features_to_analyze <- c("chol", "thalach")
sample_sizes <- c(10, 15, 20)
for (feature in features_to_analyze) {
for (n in sample_sizes) {
pvals <- replicate(1000, {
group_no <- sample(heart_data[heart_data$HeartDisease == "No", feature], n)
group_yes <- sample(heart_data[heart_data$HeartDisease == "Yes", feature], n)
t.test(group_no, group_yes)$p.value
})
hist(pvals, main = paste("P-value Distribution for", feature, "\nSample Size:", n),
xlab = "P-value", col = "lightblue", border = "gray")
abline(v = 0.05, col = "red", lwd = 2)
}
}
This analysis examined the normality of various clinical measures, tested for differences in heart disease outcomes, and explored how sample size affects the distribution of p-values. The findings suggest several features differ significantly between those with and without heart disease.