MoDS Week 2 Exercise 6

To complete this exercise, you will use a selection of data from the Duke cardiac catheterization coronary artery disease diagnostic dataset provided below (Department of Biostatistics, 2020).

Sex Age Duration symptoms coronary artery disease (days) Cholesterol level mg Significant coronary disease Three vessel or left main disease
F 63 medium 192 no no
M 62 medium 222 yes yes
M 56 long 224 yes yes
F 59 short 286 no no
M 38 medium 275 yes no
M 52 medium 204 yes no
M 74 long 285 yes yes
M 44 short 159 yes no
M 65 long 205 yes yes
F 63 short 312 no no

Table 2.1.1. Selection of data from the Duke cardiac catheterization coronary artery disease diagnostic dataset (Department of Biostatistics, 2020).

Copy and paste your R code for each of the following:

6a Enter a date frame containing the above data in R. (Hint: use read.csv and import the file Duke_Cardiac_Data.csv)

Duke_Cardiac_Data <- read.csv("Duke_Cardiac_Data.csv")

6b What is the sample size of the dataset?

sample_size <- dim(Duke_Cardiac_Data) [1]
sample_size
## [1] 10

6c How many variables are in the dataset?

sample_vars <- dim(Duke_Cardiac_Data) [2]
sample_vars
## [1] 6

6d List the variables that are quantitative.

# The command str() delivers the structure of the data frame. From its result we can read how many variables are quantitative (= num or int) and which ones they are:
str(Duke_Cardiac_Data)
## 'data.frame':    10 obs. of  6 variables:
##  $ Sex                                             : chr  "F" "M" "M" "F" ...
##  $ Age                                             : int  63 62 56 59 38 52 74 44 65 63
##  $ Duration.symptoms.coronary.artery.disease..days.: chr  "medium" "medium" "long" "short" ...
##  $ Cholesterol.level.mg                            : int  192 222 224 286 275 204 285 159 205 312
##  $ Significant.coronary.disease                    : chr  "no" "yes" "yes" "no" ...
##  $ Three.vessel.or.left.main.disease               : chr  "no" "yes" "yes" "no" ...
# The answer is 2

6e List the variables that are categorical. Further, define any categorical variables as ordinal or nominal.

# Sex; nominal
# Duration.symptoms.coronary.artery.disease..days.; ordinal
# Significant.coronary.disease; nominal
# Three.vessel.or.left.main.disease; nominal

6f Compute the median, mean, standard deviation and coefficient of variation of cholesterol levels.

median_Duke <- median(Duke_Cardiac_Data$Cholesterol.level.mg)
mean_Duke <- mean(Duke_Cardiac_Data$Cholesterol.level.mg)
sd_Duke <- sd(Duke_Cardiac_Data$Cholesterol.level.mg)
CV_Duke <- sd_Duke/mean_Duke*100
median_Duke
## [1] 223
mean_Duke
## [1] 236.4
sd_Duke
## [1] 49.87362
CV_Duke
## [1] 21.09713

6g Compute the mode of the duration, significant coronary disease and three vessel or left main disease variables.

# I didn't find a workable answer, posting the given solution instead for completeness
levels<-unique(Duke_Cardiac_Data$Duration.symptoms.coronary.artery.disease..days.)
levels[which.max(tabulate(match(Duke_Cardiac_Data$Duration.symptoms.coronary.artery.disease..days.,levels)))]
## [1] "medium"
levels<-unique(Duke_Cardiac_Data$Significant.coronary.disease)
levels[which.max(tabulate(match(Duke_Cardiac_Data$Significant.coronary.disease,levels)))]
## [1] "yes"
levels<-unique(Duke_Cardiac_Data$Three.vessel.or.left.main.disease)
levels[which.max(tabulate(match(Duke_Cardiac_Data$Three.vessel.or.left.main.disease,levels)))]
## [1] "no"

6h Compute the z-scores for the cholesterol level values.

zscores <- (Duke_Cardiac_Data$Cholesterol.level.mg-
              mean(Duke_Cardiac_Data$Cholesterol.level.mg)/
              sd(Duke_Cardiac_Data$Cholesterol.level.mg))
zscores
##  [1] 187.26 217.26 219.26 281.26 270.26 199.26 280.26 154.26 200.26 307.26

6i Estimate the standard error of the mean for cholesterol level.

SE_Duke <- sd_Duke/sqrt(sample_size)
SE_Duke
## [1] 15.77142

6j Use R to produce a scatterplot using the cholesterol level and age data.

plot(Duke_Cardiac_Data$Cholesterol.level.mg~Duke_Cardiac_Data$Age)