Introduction

General information

The exam consists of 2 parts in which you are asked to conduct analysis of different datasets. Each part is focused on a different dataset. For the first part, the dataset is included as a part of a R package and you need to install the package to access the data. Your analysis should be done using R and your answers should be given in R. For example, if the question is

Question 0 (Example)

  1. Draw a random sample of size 100 from N(0,1).
  2. Produce a histogram for the sample.

Your solution should be

Solution for question 0.1:

x<-rnorm(100,0,1)

Solution for question 0.2:

hist(x)

You do not need to explain your R code. For example, you do not need to write: “the function hist() was used to produce the histogram.” Your answers to the questions should be the R code that you used to produce the output.

What do you need to submit as a solution for the exam ?

You need to submit the following materials:

  1. R markdown program that can be used to conduct the analysis.
  2. PDF file version of the solution (produced using the R markdown program).

What you do not need to write ?

You do not need to interpret the results !!! For example, if the question is to fit a One-Way ANOVA model, you do not need to formulate the model and to interpret the results. This means, for example, that you do not need to write “the p-value is 0.007 indicating a significant effect of the factor.”

How to submit the solution?

Solutions on the questions should be introduced in an R markdown file (see above) and saved in the folder in which the exam questions are available.

Part 1: the Bernard data

In this part of the exam, the questions are focused on the Bernard dataset which is a part of the pubh R package. To access the data you need to install the package. More information can be found in https://www.rdocumentation.org/packages/pubh/versions/1.2.5/topics/Bernard. Use the code below to access the data.

#install.packages("pubh")
library(pubh)
data("Bernard")
names(Bernard)
## [1] "id"       "treat"    "race"     "fate"     "apache"   "o2del"    "followup"
## [8] "temp0"    "temp10"

Question 1

  1. The variable treat represents the treatment group of the subjects. How many patients were treated with Ibuprofen?
  2. Produce a 2X2 table that shows the number of patients treated with Ibuprofen or Placebo and mortality status at 30 days (alive/dead, the variable fate). Define two new R objects (p1 and p2) which are equal to the proportion of death among the Ibuprofen and Placebo groups, respectively and calculate the proportion difference.
  3. Use a barplot to visualize the distribution of the Ibuprofen treatment across the factor levels of the mortality status and produce Figures 1.1.

Solution Q1.1

table(Bernard$treat)
## 
##   Placebo Ibuprofen 
##       231       224

Solution Q1.2

table(Bernard$treat, Bernard$fate)
##            
##             Alive Dead
##   Placebo     139   92
##   Ibuprofen   140   84
p1 <- 92/(92+139)*100
p2 <- 84/(140+84)*100
p_diff=abs(p1-p2)
print(p_diff)
## [1] 2.32684

Solution Q1.3

ggplot(data = Bernard, mapping = aes(x=fate, fill=treat)) +
  geom_bar(position = "stack", width=0.5) +
  labs(y = "Count",
       x = "fate") +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Question 2

  1. What is the mean and standard deviation of the baseline temperature (the variable temp0) of the patients by the mortality status?
  2. Use a boxplot to visualize the distribution of the baseline temperature of the patients by mortality status as shown in Figure 2.1.
  3. Produce Figure 2.2. Note that the red dots are the means for dead and alive patients.
  4. Calculate a \(95\%\) confidence interval for the mean difference of the baseline temperature of the patients using a t distribution.
  5. Sort the Bernard dataset according to the variable temp0 and print the 7 observations with the highest value of temp0.

Solution Q2.1

summary(Bernard$temp0)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   33.10   37.50   38.17   38.01   38.67   41.67
library(tidyverse)
Bernard %>% 
  group_by(fate) %>% 
  summarise(mean=mean(temp0),
            sd=sd(temp0))
fatemeansd
Alive38.10.943
Dead37.81.34 

Solution Q2.2

ggplot(Bernard, aes(y=temp0, fill=fate))+
  geom_boxplot()

Solution Q2.3

ggplot(Bernard, aes(y=temp0, x=fate, color=fate))+
  geom_jitter(width = 0.2)+
  stat_summary(
    fun.y = "mean",
    geom = "point",
    position = position_dodge(width = 0.75), 
    color = "red",
    size = 3)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Solution Q2.4

t_test_result <- t.test(Bernard$temp0 ~ Bernard$treat)
conf_interval <- t_test_result$conf.int
cat("95% Confidence Interval for the Mean Difference:", conf_interval, "\n")
## 95% Confidence Interval for the Mean Difference: -0.136423 0.2794215

Solution Q2.5

Bernard %>% 
  arrange(desc(temp0)) %>% 
  print(c(7))
## # A
## #   tibble:
## #   455
## #   ×
## #   9
## # ℹ 445
## #   more
## #   rows
## # ℹ 9
## #   more
## #   variables:
## #   id <dbl>, …

Question 3

use the Bernard dataset to conduct the follwong analysis:

  1. Remove all missing values from the data
  2. Select only the observations for which the mortality status at 30 days is “Dead”.
  3. Calculate the percentages of each race among the patients who died
  4. Calculate the minimum and maximum apache score of the patient (the variable apache).
  5. Produce a plot visualizing the distribution of race across the levels of the treatment as shown in Figure 3.1.

Solution Q3.1

Bernard_clean <- na.omit(Bernard)

Solution Q2.2

Bernard_dead <- Bernard_clean %>% 
  filter(fate=="Dead")

Solution Q3.3

prop.table(table(Bernard_dead$race))*100
## 
##            White African American            Other 
##         76.59574         21.27660          2.12766

Solution Q3.4

min(Bernard_dead$apache)
## [1] 8
max(Bernard_dead$apache)
## [1] 34

Solution Q3.5

ggplot(data = Bernard_dead, mapping = aes(x=treat, fill=race)) +
  geom_bar(position = "dodge") +
  labs(y = "Count",
       x = "fate") +
  theme_gray()

Part 2: the ToothGrowth data

In this part we use the data ToothGrowth which is a data frame in R. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods (Supplement type): orange juice (OJ) or ascorbic acid (a form of vitamin C and coded as VC). Use the code below to access the data.

data("ToothGrowth")
names(ToothGrowth)
## [1] "len"  "supp" "dose"
head(ToothGrowth)
lensuppdose
4.2VC0.5
11.5VC0.5
7.3VC0.5
5.8VC0.5
6.4VC0.5
10  VC0.5

Question 4

  1. Conduct a two samples t-test of Tooth length by supplement type (VC vs. OJ) for subject with dose > 0.5.
  2. For the new data (i.e. observations with dose > 0.5), fit a one-way anova model for with Tooth length as a response variable and the dose as a factor.
  3. Produce Figure 4.1 (note that the red point is the mean of the group).
  4. Produce Figure 4.2.

Solution Q4.1

len_supp <- ToothGrowth %>% 
  filter(dose>0.5)

t.test(len_supp$len~len_supp$supp)
## 
##  Welch Two Sample t-test
## 
## data:  len_supp$len by len_supp$supp
## t = 1.8397, df = 31.273, p-value = 0.07533
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
##  -0.3166175  6.1666175
## sample estimates:
## mean in group OJ mean in group VC 
##           24.380           21.455

Solution Q4.2

model <- aov(len ~ dose, data = len_supp)
summary(model)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## dose         1  405.1   405.1   24.02 1.81e-05 ***
## Residuals   38  641.1    16.9                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Solution Q4.3

ggplot(len_supp, aes(x=supp, y=len, fill=supp))+
  geom_violin()+
  geom_point()+
  stat_summary(
    fun.y = "mean",
    geom = "point",
    position = position_dodge(width = 0.75), 
    color = "red",
    size = 5)

Solution Q4.4

class(len_supp$dose)
## [1] "numeric"
len_supp$dose <- as.factor(len_supp$dose)
str(len_supp)
## 'data.frame':    40 obs. of  3 variables:
##  $ len : num  16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5 ...
##  $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dose: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(len_supp, aes(y=len, fill=dose))+
  geom_boxplot()+
  facet_wrap(~supp, nrow = 2)