The exam consists of 2 parts in which you are asked to conduct analysis of different datasets. Each part is focused on a different dataset. For the first part, the dataset is included as a part of a R package and you need to install the package to access the data. Your analysis should be done using R and your answers should be given in R. For example, if the question is
Your solution should be
x<-rnorm(100,0,1)
hist(x)
You do not need to explain your R code. For example, you do not need to write: “the function hist() was used to produce the histogram.” Your answers to the questions should be the R code that you used to produce the output.
You need to submit the following materials:
You do not need to interpret the results !!! For example, if the question is to fit a One-Way ANOVA model, you do not need to formulate the model and to interpret the results. This means, for example, that you do not need to write “the p-value is 0.007 indicating a significant effect of the factor.”
Solutions on the questions should be introduced in an R markdown file (see above) and saved in the folder in which the exam questions are available.
In this part of the exam, the questions are focused on the Bernard dataset which is a part of the pubh R package. To access the data you need to install the package. More information can be found in https://www.rdocumentation.org/packages/pubh/versions/1.2.5/topics/Bernard. Use the code below to access the data.
#install.packages("pubh")
library(pubh)
data("Bernard")
names(Bernard)
## [1] "id" "treat" "race" "fate" "apache" "o2del" "followup"
## [8] "temp0" "temp10"
table(Bernard$treat)
##
## Placebo Ibuprofen
## 231 224
table(Bernard$treat, Bernard$fate)
##
## Alive Dead
## Placebo 139 92
## Ibuprofen 140 84
p1 <- 92/(92+139)*100
p2 <- 84/(140+84)*100
p_diff=abs(p1-p2)
print(p_diff)
## [1] 2.32684
ggplot(data = Bernard, mapping = aes(x=fate, fill=treat)) +
geom_bar(position = "stack", width=0.5) +
labs(y = "Count",
x = "fate") +
theme_gray() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
summary(Bernard$temp0)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 33.10 37.50 38.17 38.01 38.67 41.67
library(tidyverse)
Bernard %>%
group_by(fate) %>%
summarise(mean=mean(temp0),
sd=sd(temp0))
| fate | mean | sd |
|---|---|---|
| Alive | 38.1 | 0.943 |
| Dead | 37.8 | 1.34 |
ggplot(Bernard, aes(y=temp0, fill=fate))+
geom_boxplot()
ggplot(Bernard, aes(y=temp0, x=fate, color=fate))+
geom_jitter(width = 0.2)+
stat_summary(
fun.y = "mean",
geom = "point",
position = position_dodge(width = 0.75),
color = "red",
size = 3)
## Warning: The `fun.y` argument of `stat_summary()` is deprecated as of ggplot2 3.3.0.
## ℹ Please use the `fun` argument instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
t_test_result <- t.test(Bernard$temp0 ~ Bernard$treat)
conf_interval <- t_test_result$conf.int
cat("95% Confidence Interval for the Mean Difference:", conf_interval, "\n")
## 95% Confidence Interval for the Mean Difference: -0.136423 0.2794215
Bernard %>%
arrange(desc(temp0)) %>%
print(c(7))
## # A
## # tibble:
## # 455
## # ×
## # 9
## # ℹ 445
## # more
## # rows
## # ℹ 9
## # more
## # variables:
## # id <dbl>, …
use the Bernard dataset to conduct the follwong analysis:
Bernard_clean <- na.omit(Bernard)
Bernard_dead <- Bernard_clean %>%
filter(fate=="Dead")
prop.table(table(Bernard_dead$race))*100
##
## White African American Other
## 76.59574 21.27660 2.12766
min(Bernard_dead$apache)
## [1] 8
max(Bernard_dead$apache)
## [1] 34
ggplot(data = Bernard_dead, mapping = aes(x=treat, fill=race)) +
geom_bar(position = "dodge") +
labs(y = "Count",
x = "fate") +
theme_gray()
In this part we use the data ToothGrowth which is a data frame in R. The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods (Supplement type): orange juice (OJ) or ascorbic acid (a form of vitamin C and coded as VC). Use the code below to access the data.
data("ToothGrowth")
names(ToothGrowth)
## [1] "len" "supp" "dose"
head(ToothGrowth)
| len | supp | dose |
|---|---|---|
| 4.2 | VC | 0.5 |
| 11.5 | VC | 0.5 |
| 7.3 | VC | 0.5 |
| 5.8 | VC | 0.5 |
| 6.4 | VC | 0.5 |
| 10 | VC | 0.5 |
len_supp <- ToothGrowth %>%
filter(dose>0.5)
t.test(len_supp$len~len_supp$supp)
##
## Welch Two Sample t-test
##
## data: len_supp$len by len_supp$supp
## t = 1.8397, df = 31.273, p-value = 0.07533
## alternative hypothesis: true difference in means between group OJ and group VC is not equal to 0
## 95 percent confidence interval:
## -0.3166175 6.1666175
## sample estimates:
## mean in group OJ mean in group VC
## 24.380 21.455
model <- aov(len ~ dose, data = len_supp)
summary(model)
## Df Sum Sq Mean Sq F value Pr(>F)
## dose 1 405.1 405.1 24.02 1.81e-05 ***
## Residuals 38 641.1 16.9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
ggplot(len_supp, aes(x=supp, y=len, fill=supp))+
geom_violin()+
geom_point()+
stat_summary(
fun.y = "mean",
geom = "point",
position = position_dodge(width = 0.75),
color = "red",
size = 5)
class(len_supp$dose)
## [1] "numeric"
len_supp$dose <- as.factor(len_supp$dose)
str(len_supp)
## 'data.frame': 40 obs. of 3 variables:
## $ len : num 16.5 16.5 15.2 17.3 22.5 17.3 13.6 14.5 18.8 15.5 ...
## $ supp: Factor w/ 2 levels "OJ","VC": 2 2 2 2 2 2 2 2 2 2 ...
## $ dose: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
ggplot(len_supp, aes(y=len, fill=dose))+
geom_boxplot()+
facet_wrap(~supp, nrow = 2)