R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

TAKE HOME ASSIGNMENT

DUE: Tuesday January 14th 11:59 pm

For this assignment you will apply what you learned about basic statistical analysis to a data set on heart disease. The assignment should be completed as a new RMD document and knitted. Submit the html knitted document and the RMD document.

If you are piping long codes, please use the # to explain your work.

To help you out I have set up another data set.

To get you started here is the link to the data https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat

and here is a vector of column names:

age, sex, chestPain, restBP, cholesterol, fastGlucose, restEC, maxHeartRate, exAngina, oldPeak, slope, numVessels, thal, disease

Here is what they mean

  age-- 1. age       
  sex-- 2. sex       
  cp-- 3. chest pain type  (4 values)       
  restbp-- 4. resting blood pressure  
  chol-- 5. serum cholesterol in mg/dl      
  fbs-- 6. fasting blood sugar > 120 mg/dl       
  restecg-- 7. resting electrocardiographic results  (values 0,1,2) 
 maxach -- 8. maximum heart rate achieved  
 exang -- 9. exercise induced angina    
 oldpeak -- 10. oldpeak = ST depression induced by exercise relative to rest   
slope  -- 11. the slope of the peak exercise ST segment     
num  -- 12. number of major vessels (0-3) colored by fluoroscopy        
thal -- 13.  thal: 3 = normal; 6 = fixed defect; 7 = reversible defect
disease -- 14. 1 = healthy; 2 = sick

Attributes types

Real: 1,4,5,8,10,12 Ordered:11, Binary: 2,6,9 Nominal:7,3,13

QUESTIONS

Starting with a new notebook,completed the numbered tasks/questions below. Put each task as at least one code chunk or more if needed.

Concept quesitons (20%)

Complete this question on your own using short answers 1-2 sentences (0.5 point each).

  1. How do you recognize a function? Functions can be recognized by the () that follows the name of the function. They also often contain an argument within the parenthesis that will tailor how the function works.
  2. What is the name of the setting for a function that can be modified? The settings for a function are called arguments.
  3. What is the main difference between a matrix and data frame? The main difference is that a matrix contains the same data type while a data frame can contain different data types for each column.
  4. What are the three main data types? Give some examples. Numeric data types represent numbers, which can be integers or real numbers (examples: 5, 2.5) Character data types represent text or strings, which are in between “” (examples: “PSL376”) Logical data types represent whether a statement is TRUE or FALSE (example: 10 > 9 will evaluate to TRUE)

Coding activity (80%)

Be sure to briefly describe what each of the code chunks is doing and why you are doing the analysis/calculation.

Only use the # comments in the code chunks to organize and annotate the code, DO NOT use this to discuss results.

After the code chunk be sure to assess the findings in the free text. Consider is this expected is this result being used in a subsequent step? If using a statistical test was the result significant/non-significant? At the end make any conclusions from your findings. You can work alone or in a small group of 2-3 people. But you must submit your own work.

1. Read in the data tables. We will do this together!

This code chunk will read in the data remotely from a url and add the labels on to the column and format some of the categorical variables to be more readable. Often these are 0 and 1 or 1 and 2. It is convent to people readable letters or words. For example sex will be converted from 0 and 1 to M and F.

url<-"https://archive.ics.uci.edu/ml/machine-learning-databases/statlog/heart/heart.dat"

raw_heart <- read.csv(url, sep=" ", header = F)
names <- c("age", "sex", "cp", "restbp", 
  "chol", "fbs", "restecg", "maxach", "exang", "oldpeak", "slope", "num", 
  "thal","disease")
data_heart <- data.frame(raw_heart)
colnames(data_heart) <- names

data_heart$sex <- factor(data_heart$sex, labels=c("F", "M"))
data_heart$disease <- factor(data_heart$disease, labels=c("H", "S"))
data_heart$ID <- as.numeric(rownames(data_heart))+1284
data_heart$logCholesterol <- log10(data_heart$chol)

NOTE: the call to the internet may cause problems when knitting the document to HTML. To solve this you should run the code to create the heart.dat table. Then save the table to your drive. Create a new code chunk to load the table from your drive. Comment the code to prevent it from running during knitting.

# to save the data file, only need to do this once then comment out
save(data_heart, file= "data_heart.Rdata") 

# to load the data file, need to do this every time, this will load the object heart.dat
load("data_heart.Rdata")

2. Create a 2x2 table that describes the distribution of the cohort by disease and sex. Perform a statistical test to determine whether there are significant differences. Write 1-2 sentences that describes your findings. (2 points)

table_disease_sex <- table(data_heart$disease, data_heart$sex) # This creates a contingency table
print(table_disease_sex)
##    
##       F   M
##   H  67  83
##   S  20 100
chi_test <- chisq.test(table_disease_sex) # chi-square test can determine if there is a significant relationship between disease and sex
print(chi_test)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table_disease_sex
## X-squared = 22.667, df = 1, p-value = 1.926e-06

The results demonstrate a p-value less than 0.05 so the null hypothesis is rejected and it can be inferred that there is a significant association between disease and sex. Thus, men are more sick than women by a significant amount.

3. Perform an outlier check on age, resting BP, and cholesterol in the dataset using the IQR method. Briefly, any value that is outside of the lower (1st Qu. - 1.5xIQR) and higher (3rd Qu. + 1.5xIQR) fences will be considered outliers here. Write 1-2 sentences that describes your findings. (2 points)

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
variables <- c("age", "restbp", "chol")


outliers <- data_heart %>%
  select(all_of(variables)) %>% # This will select the columns that will be used 
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>% # Reshape data to long
  group_by(Variable) %>% # Group by each variable
  summarise( # Reduces the data to a single row 
    Q1 = quantile(Value, 0.25), 
    Q3 = quantile(Value, 0.75), 
    IQR = Q3 - Q1,             # Calculates IQR
    Lower_Bound = Q1 - 1.5 * IQR, # Calculates lower bound
    Upper_Bound = Q3 + 1.5 * IQR, # Calculates upper bound
    Number_of_Outliers = sum(Value < Lower_Bound | Value > Upper_Bound) # Counts the number of outliers
  )

print(outliers)
## # A tibble: 3 × 7
##   Variable    Q1    Q3   IQR Lower_Bound Upper_Bound Number_of_Outliers
##   <chr>    <dbl> <dbl> <dbl>       <dbl>       <dbl>              <int>
## 1 age         48    61    13        28.5        80.5                  0
## 2 chol       213   280    67       112.        380.                   5
## 3 restbp     120   140    20        90         170                    9

Age had 0 outliers, resting blood pressure had 9 outliers, and cholesterol had 5 outliers. These outliers represent extreme values that fall outside the calculated lower and upper bounds

4. Perform a normality check on age, resting BP, and cholesterol in the dataset using a statisical test and if needed a graphical test. Create a table that shows the variable name, p-value, and normal status (yes/no). Write 1-2 sentences that describes your findings. (2 points)

shapiro_results <- data.frame(
  Variable = c("age", "restbp", "chol"),
  p_Value = c(
    shapiro.test(data_heart$age)$p.value, # Extract p-value from each Shapiro-Wilk test for each variable
    shapiro.test(data_heart$restbp)$p.value,
    shapiro.test(data_heart$chol)$p.value
  )
)

shapiro_results$Normal_Status <- ifelse(shapiro_results$p_Value >= 0.05, "Yes", "No") 
# Adds a column to indicate whether the variable is normally distributed or not based on the p-value (greater than or equal to 0.05)

print(shapiro_results)
##   Variable      p_Value Normal_Status
## 1      age 2.765234e-02            No
## 2   restbp 3.738947e-06            No
## 3     chol 1.078634e-08            No
qqnorm(data_heart$age) # Q-Q plot for age
qqline(data_heart$age)

qqnorm(data_heart$restbp) # Q-Q plot for resting blood pressure
qqline(data_heart$restbp)

qqnorm(data_heart$chol) # Q-Q plot for cholesterol
qqline(data_heart$chol)

The Shapiro-Wilk test demonstrates that none of the variables (age, resting blood pressure, cholesterol) are normally distributed, as all of their p-values are less than 0.05.

5. Create a table that summarises the 3 variables (i.e., age, rest BP, and cholesterol) by disease status (S = sick; H = healthy) in males only. Briefly, it should be formatted accordingly: Variable, Presented As (mean/median, specific one), Value of Sick Group, Value of Healthy Group, Pvalue, Test for Significance (u-test or t-test, specific one). Write 1-2 sentences that describes your findings. (2 points)

males_data <- data_heart %>%
  filter(sex == "M") # Filters data to only include males

variables <- c("age","restbp", "chol")

summary_table <- data.frame(
  Variable = character(), # Name of variable
  Presented_As = character(), # Whether the mean or median was calculated
  Value_of_Sick_Group = numeric(),
  Value_of_Healthy_Group = numeric(),
  Pvalue = numeric(),
  Test_for_significance = character() # Name of the test performed
)

for (v in variables) { # Extracts data for sick and healthy individuals
  sick_data <- males_data[males_data$disease =="S", v]
  healthy_data <- males_data[males_data$disease == "H",v]
  
  normal_sickdata <- shapiro.test(sick_data)$p.value >= 0.05
  normal_healthydata <- shapiro.test(healthy_data)$p.value >= 0.05
  
  if(normal_sickdata & normal_healthydata) {
    test_result <- t.test(sick_data, healthy_data) # If both groups are normally distributed, then perform a T-test
    presented_as <- "Mean" 
    value_sick <- mean(sick_data) 
    value_healthy <- mean(healthy_data)
    test_name <- "T-test"
  } else {
    test_result <- wilcox.test(sick_data, healthy_data) # If either group is not normal, then perform a U-test
    presented_as <- "Median"
    value_sick <- median(sick_data)
    value_healthy <- median(healthy_data)
    test_name <- "U-test"
  }

  summary_table <- rbind( # Add the results to the table 
    summary_table,
    data.frame(
      Variable = v, # Current variable (age, resting blood pressure, cholesterol)
      Presented_As = presented_as, # Mean or Median
      Value_of_Sick_Group = value_sick,
      Value_of_Healthy_Group = value_healthy,
      Pvalue = test_result$p.value, # P-value from the test conducted
      Test_for_significance = test_name # Name of the test conducted
    )
  )
}

print (summary_table)
##   Variable Presented_As Value_of_Sick_Group Value_of_Healthy_Group       Pvalue
## 1      age         Mean               56.04               51.19277 0.0001734191
## 2   restbp       Median              129.00              130.00000 0.5214372201
## 3     chol         Mean              249.76              233.72289 0.0078357185
##   Test_for_significance
## 1                T-test
## 2                U-test
## 3                T-test

Age and cholesterol both use T-test and resting blood pressure uses U-test. The results suggest that age and cholesterol are important variables in men between the sick and healthy group, while resting blood pressure does not show a significant difference.

Good luck and have fun 🍀🍀🍀