This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
DUE: Tuesday January 14th 11:59 pm
For this assignment you will apply what you learned about basic statistical analysis to a data set on heart disease. The assignment should be completed as a new RMD document and knitted. Submit the html knitted document and the RMD document.
If you are piping long codes, please use the # to explain your work.
To help you out I have set up another data set.
To get you started here is the link to the data https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat
and here is a vector of column names:
age, sex, chestPain, restBP, cholesterol, fastGlucose, restEC, maxHeartRate, exAngina, oldPeak, slope, numVessels, thal, disease
Here is what they mean
age-- 1. age
sex-- 2. sex
cp-- 3. chest pain type (4 values)
restbp-- 4. resting blood pressure
chol-- 5. serum cholesterol in mg/dl
fbs-- 6. fasting blood sugar > 120 mg/dl
restecg-- 7. resting electrocardiographic results (values 0,1,2)
maxach -- 8. maximum heart rate achieved
exang -- 9. exercise induced angina
oldpeak -- 10. oldpeak = ST depression induced by exercise relative to rest
slope -- 11. the slope of the peak exercise ST segment
num -- 12. number of major vessels (0-3) colored by fluoroscopy
thal -- 13. thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
disease -- 14. 1 = healthy; 2 = sick
Attributes types
Real: 1,4,5,8,10,12 Ordered:11, Binary: 2,6,9 Nominal:7,3,13
QUESTIONS
Starting with a new notebook,completed the numbered tasks/questions below. Put each task as at least one code chunk or more if needed.
Complete this question on your own using short answers 1-2 sentences (0.5 point each).
Be sure to briefly describe what each of the code chunks is doing and why you are doing the analysis/calculation.
Only use the # comments in the code chunks to organize and annotate the code, DO NOT use this to discuss results.
After the code chunk be sure to assess the findings in the free text. Consider is this expected is this result being used in a subsequent step? If using a statistical test was the result significant/non-significant? At the end make any conclusions from your findings. You can work alone or in a small group of 2-3 people. But you must submit your own work.
This code chunk will read in the data remotely from a url and add the labels on to the column and format some of the categorical variables to be more readable. Often these are 0 and 1 or 1 and 2. It is convent to people readable letters or words. For example sex will be converted from 0 and 1 to M and F.
url<-"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"
raw_heart <- read.csv(url, sep=" ", header = F)
names <- c("age", "sex", "cp", "restbp",
"chol", "fbs", "restecg", "maxach", "exang", "oldpeak", "slope", "num",
"thal","disease")
data_heart <- data.frame(raw_heart)
colnames(data_heart) <- names
data_heart$sex <- factor(data_heart$sex, labels=c("F", "M"))
data_heart$disease <- factor(data_heart$disease, labels=c("H", "S"))
data_heart$ID <- as.numeric(rownames(data_heart))+1284
data_heart$logCholesterol <- log10(data_heart$chol)
NOTE: the call to the internet may cause problems when knitting the document to HTML. To solve this you should run the code to create the heart.dat table. Then save the table to your drive. Create a new code chunk to load the table from your drive. Comment the code to prevent it from running during knitting.
# to save the data file, only need to do this once then comment out
save(data_heart, file= "data_heart.Rdata")
# to load the data file, need to do this every time, this will load the object heart.dat
load("data_heart.Rdata")
table_disease_sex <- table(data_heart$disease, data_heart$sex) # This creates a contingency table
print(table_disease_sex)
##
## F M
## H 67 83
## S 20 100
chi_test <- chisq.test(table_disease_sex) # chi-square test can determine if there is a significant relationship between disease and sex
print(chi_test)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: table_disease_sex
## X-squared = 22.667, df = 1, p-value = 1.926e-06
The results demonstrate a p-value less than 0.05 so the null hypothesis is rejected and it can be inferred that there is a significant association between disease and sex. Thus, men are more sick than women by a significant amount.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
variables <- c("age", "restbp", "chol")
outliers <- data_heart %>%
select(all_of(variables)) %>% # This will select the columns that will be used
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>% # Reshape data to long
group_by(Variable) %>% # Group by each variable
summarise( # Reduces the data to a single row
Q1 = quantile(Value, 0.25),
Q3 = quantile(Value, 0.75),
IQR = Q3 - Q1, # Calculates IQR
Lower_Bound = Q1 - 1.5 * IQR, # Calculates lower bound
Upper_Bound = Q3 + 1.5 * IQR, # Calculates upper bound
Number_of_Outliers = sum(Value < Lower_Bound | Value > Upper_Bound) # Counts the number of outliers
)
print(outliers)
## # A tibble: 3 × 7
## Variable Q1 Q3 IQR Lower_Bound Upper_Bound Number_of_Outliers
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 age 48 61 13 28.5 80.5 0
## 2 chol 213 280 67 112. 380. 5
## 3 restbp 120 140 20 90 170 9
Age had 0 outliers, resting blood pressure had 9 outliers, and cholesterol had 5 outliers. These outliers represent extreme values that fall outside the calculated lower and upper bounds
shapiro_results <- data.frame(
Variable = c("age", "restbp", "chol"),
p_Value = c(
shapiro.test(data_heart$age)$p.value, # Extract p-value from each Shapiro-Wilk test for each variable
shapiro.test(data_heart$restbp)$p.value,
shapiro.test(data_heart$chol)$p.value
)
)
shapiro_results$Normal_Status <- ifelse(shapiro_results$p_Value >= 0.05, "Yes", "No")
# Adds a column to indicate whether the variable is normally distributed or not based on the p-value (greater than or equal to 0.05)
print(shapiro_results)
## Variable p_Value Normal_Status
## 1 age 2.765234e-02 No
## 2 restbp 3.738947e-06 No
## 3 chol 1.078634e-08 No
qqnorm(data_heart$age) # Q-Q plot for age
qqline(data_heart$age)
qqnorm(data_heart$restbp) # Q-Q plot for resting blood pressure
qqline(data_heart$restbp)
qqnorm(data_heart$chol) # Q-Q plot for cholesterol
qqline(data_heart$chol)
The Shapiro-Wilk test demonstrates that none of the variables (age,
resting blood pressure, cholesterol) are normally distributed, as all of
their p-values are less than 0.05.
males_data <- data_heart %>%
filter(sex == "M") # Filters data to only include males
variables <- c("age","restbp", "chol")
summary_table <- data.frame(
Variable = character(), # Name of variable
Presented_As = character(), # Whether the mean or median was calculated
Value_of_Sick_Group = numeric(),
Value_of_Healthy_Group = numeric(),
Pvalue = numeric(),
Test_for_significance = character() # Name of the test performed
)
for (v in variables) { # Extracts data for sick and healthy individuals
sick_data <- males_data[males_data$disease =="S", v]
healthy_data <- males_data[males_data$disease == "H",v]
normal_sickdata <- shapiro.test(sick_data)$p.value >= 0.05
normal_healthydata <- shapiro.test(healthy_data)$p.value >= 0.05
if(normal_sickdata & normal_healthydata) {
test_result <- t.test(sick_data, healthy_data) # If both groups are normally distributed, then perform a T-test
presented_as <- "Mean"
value_sick <- mean(sick_data)
value_healthy <- mean(healthy_data)
test_name <- "T-test"
} else {
test_result <- wilcox.test(sick_data, healthy_data) # If either group is not normal, then perform a U-test
presented_as <- "Median"
value_sick <- median(sick_data)
value_healthy <- median(healthy_data)
test_name <- "U-test"
}
summary_table <- rbind( # Add the results to the table
summary_table,
data.frame(
Variable = v, # Current variable (age, resting blood pressure, cholesterol)
Presented_As = presented_as, # Mean or Median
Value_of_Sick_Group = value_sick,
Value_of_Healthy_Group = value_healthy,
Pvalue = test_result$p.value, # P-value from the test conducted
Test_for_significance = test_name # Name of the test conducted
)
)
}
print (summary_table)
## Variable Presented_As Value_of_Sick_Group Value_of_Healthy_Group Pvalue
## 1 age Mean 56.04 51.19277 0.0001734191
## 2 restbp Median 129.00 130.00000 0.5214372201
## 3 chol Mean 249.76 233.72289 0.0078357185
## Test_for_significance
## 1 T-test
## 2 U-test
## 3 T-test
Age and cholesterol both use T-test and resting blood pressure uses U-test. The results suggest that age and cholesterol are important variables in men between the sick and healthy group, while resting blood pressure does not show a significant difference.
Good luck and have fun 🍀🍀🍀