To complete this exercise, you will use a selection of data from the Duke cardiac catheterization coronary artery disease diagnostic dataset provided below (Department of Biostatistics, 2020).
| Sex | Age | Duration symptoms coronary artery disease (days) | Cholesterol level mg | Significant coronary disease | Three vessel or left main disease |
|---|---|---|---|---|---|
| F | 63 | medium | 192 | no | no |
| M | 62 | medium | 222 | yes | yes |
| M | 56 | long | 224 | yes | yes |
| F | 59 | short | 286 | no | no |
| M | 38 | medium | 275 | yes | no |
| M | 52 | medium | 204 | yes | no |
| M | 74 | long | 285 | yes | yes |
| M | 44 | short | 159 | yes | no |
| M | 65 | long | 205 | yes | yes |
| F | 63 | short | 312 | no | no |
Table 2.1.1. Selection of data from the Duke cardiac catheterization coronary artery disease diagnostic dataset (Department of Biostatistics, 2020).
Copy and paste your R code for each of the following:
6a Enter a date frame containing the above data in R. (Hint: use read.csv and import the file Duke_Cardiac_Data.csv)
Duke_Cardiac_Data <- read.csv("Duke_Cardiac_Data.csv")
6b What is the sample size of the dataset?
sample_size <- dim(Duke_Cardiac_Data) [1]
sample_size
## [1] 10
6c How many variables are in the dataset?
sample_vars <- dim(Duke_Cardiac_Data) [2]
sample_vars
## [1] 6
6d List the variables that are quantitative.
# The command str() delivers the structure of the data frame. From its result we can read how many variables are quantitative (= num or int) and which ones they are:
str(Duke_Cardiac_Data)
## 'data.frame': 10 obs. of 6 variables:
## $ Sex : chr "F" "M" "M" "F" ...
## $ Age : int 63 62 56 59 38 52 74 44 65 63
## $ Duration.symptoms.coronary.artery.disease..days.: chr "medium" "medium" "long" "short" ...
## $ Cholesterol.level.mg : int 192 222 224 286 275 204 285 159 205 312
## $ Significant.coronary.disease : chr "no" "yes" "yes" "no" ...
## $ Three.vessel.or.left.main.disease : chr "no" "yes" "yes" "no" ...
# The answer is 2
6e List the variables that are categorical. Further, define any categorical variables as ordinal or nominal.
# Sex; nominal
# Duration.symptoms.coronary.artery.disease..days.; ordinal
# Significant.coronary.disease; nominal
# Three.vessel.or.left.main.disease; nominal
6f Compute the median, mean, standard deviation and coefficient of variation of cholesterol levels.
median_Duke <- median(Duke_Cardiac_Data$Cholesterol.level.mg)
mean_Duke <- mean(Duke_Cardiac_Data$Cholesterol.level.mg)
sd_Duke <- sd(Duke_Cardiac_Data$Cholesterol.level.mg)
CV_Duke <- sd_Duke/mean_Duke*100
median_Duke
## [1] 223
mean_Duke
## [1] 236.4
sd_Duke
## [1] 49.87362
CV_Duke
## [1] 21.09713
6g Compute the mode of the duration, significant coronary disease and three vessel or left main disease variables.
# I didn't find a workable answer, posting the given solution instead for completeness
levels<-unique(Duke_Cardiac_Data$Duration.symptoms.coronary.artery.disease..days.)
levels[which.max(tabulate(match(Duke_Cardiac_Data$Duration.symptoms.coronary.artery.disease..days.,levels)))]
## [1] "medium"
levels<-unique(Duke_Cardiac_Data$Significant.coronary.disease)
levels[which.max(tabulate(match(Duke_Cardiac_Data$Significant.coronary.disease,levels)))]
## [1] "yes"
levels<-unique(Duke_Cardiac_Data$Three.vessel.or.left.main.disease)
levels[which.max(tabulate(match(Duke_Cardiac_Data$Three.vessel.or.left.main.disease,levels)))]
## [1] "no"
6h Compute the z-scores for the cholesterol level values.
zscores <- (Duke_Cardiac_Data$Cholesterol.level.mg-
mean(Duke_Cardiac_Data$Cholesterol.level.mg)/
sd(Duke_Cardiac_Data$Cholesterol.level.mg))
zscores
## [1] 187.26 217.26 219.26 281.26 270.26 199.26 280.26 154.26 200.26 307.26
6i Estimate the standard error of the mean for cholesterol level.
SE_Duke <- sd_Duke/sqrt(sample_size)
SE_Duke
## [1] 15.77142
6j Use R to produce a scatterplot using the cholesterol level and age data.
plot(Duke_Cardiac_Data$Cholesterol.level.mg~Duke_Cardiac_Data$Age)