The data comes from the UCI Machine Learning Repository, and this specific data set describes patients at the VA hospital in Long Beach, California (for more detailed information, go to this data set description).
Use the following code to load the data. An internet connection is required. Install the packages beforehand if need be.
library("tidyverse") #enables tbl_df() function
library("readr") #enables read_csv() function
col_attributes <- c("age", "sex", "cp",
"trestbps", "chol", "fbs",
"restecg", "thalach", "exang",
"oldpeak", "slope", "ca",
"thal", "num") # used later to rename the columns
# heart_df <- read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/heart-disease/processed.va.data", col_names = col_attributes)
heart_df <- read_csv("http://mlr.cs.umass.edu/ml/machine-learning-databases/heart-disease/processed.va.data", col_names = col_attributes) #use this link if the UCI website is down
heart_df <- tbl_df(heart_df)
heart_df$chol[heart_df$chol < 100] <- NA #re-labels missing values to "NA"
heart_df
## # A tibble: 200 x 14
## age sex cp trestbps chol fbs restecg thalach exang oldpeak
## <int> <int> <int> <chr> <chr> <chr> <int> <chr> <chr> <chr>
## 1 63 1 4 140 260 0 1 112 1 3
## 2 44 1 4 130 209 0 1 127 0 0
## 3 60 1 4 132 218 0 1 140 1 1.5
## 4 55 1 4 142 228 0 1 149 1 2.5
## 5 66 1 3 110 213 1 2 99 1 1.3
## 6 66 1 3 120 <NA> 0 1 120 0 -0.5
## 7 65 1 4 150 236 1 1 105 1 0
## 8 60 1 3 180 <NA> 0 1 140 1 1.5
## 9 60 1 3 120 <NA> ? 0 141 1 2
## 10 60 1 2 160 267 1 1 157 0 0.5
## # ... with 190 more rows, and 4 more variables: slope <chr>, ca <chr>,
## # thal <chr>, num <int>
The num column in the data frame refers to risk of a heart attack, with 0 = “no risk” and 4 = “high risk”. Here we will test the claim that the average risk level is 1.6 with the following setup with the null and alternative hypothesis respectively: \[H_{0}: \mu = 1.6\] \[H_{a}: \mu > 1.6\]
n <- nrow(heart_df)
null_value <- 1.6
sample_means <- replicate(100, mean(sample(heart_df$num, n*0.60, replace = TRUE)))
df <- data.frame(sample_means, null_value)
threshold at the 95th percentile, and create a classification column that labels results in the sampling distribution that are above the threshold as being in the “reject region” (“fail to reject” otherwise).threshold <- quantile(sample_means, 0.95)
df <- df %>%
mutate(classification = ifelse(sample_means > threshold, "reject region", "fail to reject"))
ggplot(df, aes(x = sample_means, fill = classification)) +
geom_dotplot(binwidth = 0.01) +
geom_vline(aes(xintercept = null_value), col = "red") +
labs(caption = "Vertical line at null value",
title = "Sampling Distribution",
x = "sample means",
y = "proportion")
The p-value is the probability of a false null hypothesis. Here we compute the p-value empirically as the proportion of sample means that are less than the null value.
p_value <- mean(sample_means <= null_value)
p_value
## [1] 0.78
In general, to perform these simulations quickly and thoroughly in R, we use the t.test command.
t.test(heart_df$num, mu = 1.6, alternative = "greater")
##
## One Sample t-test
##
## data: heart_df$num
## t = -0.92778, df = 199, p-value = 0.8227
## alternative hypothesis: true mean is greater than 1.6
## 95 percent confidence interval:
## 1.377505 Inf
## sample estimates:
## mean of x
## 1.52
Visually, and since the p-value 0.8226768 > 0.05, we fail to reject the claim at the \(\alpha = 0.05\) significance level.
A total cholesterol level less than 240 milligrams is desirable for adults (source). Test the claim that these patients are healthy with the following setup: \[H_{0}: \mu = 240\] \[H_{a}: \mu < 240\]
#converts characters to numbers
heart_df$chol <- as.numeric(heart_df$chol)
null_value <- 240
sample_means <- replicate(100,
mean(sample(heart_df$chol, n*0.60, replace = TRUE),
na.rm = TRUE))
df <- data.frame(sample_means, null_value)
threshold at the 5th percentile, and create a classification column that labels results in the sampling distribution that are below the threshold as being in the “reject region” (“fail to reject” otherwise).threshold <- quantile(sample_means, 0.05)
df <- df %>%
mutate(classification = ifelse(sample_means < threshold, "reject region", "fail to reject"))
ggplot(df, aes(x = sample_means, fill = classification)) +
geom_dotplot(binwidth = 1) +
geom_vline(aes(xintercept = null_value), col = "red") +
labs(caption = "Vertical line at null value",
title = "Sampling Distribution",
x = "sample means",
y = "proportion")
The p-value is the probability of a false null hypothesis. Here we compute the p-value empirically as the proportion of sample means that are less than the null value.
p_value <- mean(sample_means >= null_value)
p_value
## [1] 0.48
In general, to perform these simulations quickly and thoroughly in R, we use the t.test command.
t.test(heart_df$chol, mu = null_value, alternative = "less")
##
## One Sample t-test
##
## data: heart_df$chol
## t = -0.097874, df = 143, p-value = 0.4611
## alternative hypothesis: true mean is less than 240
## 95 percent confidence interval:
## -Inf 246.8524
## sample estimates:
## mean of x
## 239.5694
Visually, and since the p-value 1 > 0.05, we fail to reject the claim at the \(\alpha = 0.05\) significance level.
Test the claim that heart-risk patients are elderly with the following setup: \[H_{0}: \mu = 65\] \[H_{a}: \mu \neq 65\]
null_value <- 65
sample_means <- replicate(100,
mean(sample(heart_df$age, n*0.60, replace = TRUE),
na.rm = TRUE))
df <- data.frame(sample_means, null_value)
classification column that labels results in the sampling distribution that are outside the thresholds as being in the “reject region” (“fail to reject” otherwise).left_threshold <- quantile(sample_means, 0.025)
right_threshold <- quantile(sample_means, 0.975)
df <- df %>%
mutate(classification = ifelse(sample_means < left_threshold | sample_means > right_threshold,
"reject region", "fail to reject"))
ggplot(df, aes(x = sample_means, fill = classification)) +
geom_dotplot(binwidth = 0.1) +
geom_vline(aes(xintercept = null_value), col = "red") +
labs(caption = "Vertical line at null value",
title = "Sampling Distribution",
x = "sample means",
y = "proportion")
The p-value is the probability of a false null hypothesis. Here we compute the p-value empirically as the proportion of sample means that are less than the null value.
p_value <- mean(2*sample_means <= null_value)
p_value
## [1] 0
In general, to perform these simulations quickly and thoroughly in R, we use the t.test command.
t.test(heart_df$age, mu = null_value, alternative = "two.sided")
##
## One Sample t-test
##
## data: heart_df$age
## t = -10.229, df = 199, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 65
## 95 percent confidence interval:
## 58.26075 60.43925
## sample estimates:
## mean of x
## 59.35
Visually, and since the p-value 5.279306710^{-20} > 0.05, we reject the claim at the \(\alpha = 0.05\) significance level.