1. (10 points) It is currently estimated that the proportion of people with SARS-COV-2 who are asymptomatic is 0.35. Suppose you randomly sample a cohort of 300 individuals who test positive for SARS-COV-2. What is the probability that at least 100 of those individuals with SARS-COV-2 are asymptomatic?
answer <- sum(dbinom(100:300,size=300,prob=0.35))*100
answer
## [1] 74.61005

The chance is 74.6 %

  1. (10 points) You are conducting a pilot study to see whether an intensive cardiovascular workout program reduces resting heart rate (unit: beats per minute) in patients. You collect data on 10 individuals at baseline and again 6 months later at the conclusion of the study. These data are below:
HR_baseline <- c(81,70,72,62,64,60,64,65,72,71)
HR_tx <- c(75,70,67, 60, 54, 62, 63,66, 71,64)
  1. Which hypothesis test should be used and why?
hist(HR_baseline)

hist(HR_tx)

I would use the Wilcoxon rank test, since the n is only 10 , and even thout the data seems to be normally distributed, the histograms do not demonnstrate it, also seems is the same participant, for initial and final, this is correlated.

The null hypothesis is that the two heart rates are equal and have no difference. The alternative hypothesis is that the two heart rates measures are not equal.

  1. Use R to conduct the hypothesis test you selected in part (A). Assume a two-tailed test and Type I error rate of 0.05. What is the p-value? Should you reject or fail to reject the null hypothesis?
wilcox.test(HR_baseline,HR_tx,alternative = 'two.sided')
## Warning in wilcox.test.default(HR_baseline, HR_tx, alternative = "two.sided"):
## cannot compute exact p-value with ties
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  HR_baseline and HR_tx
## W = 62, p-value = 0.3831
## alternative hypothesis: true location shift is not equal to 0

The p-value is 0.3831, considering that the type I error is 0.05 , we will fail to reject the null hypothesis.

  1. Since this is a pilot study, suppose you decide to increase your Type I error rate to 0.1. Would you reject or fail to reject the null hypothesis?

Since the p-value is so high and is still higher than 0.1 , then we would still fail to reject the null hypothesis

t.test(HR_baseline,HR_tx)
## 
##  Welch Two Sample t-test
## 
## data:  HR_baseline and HR_tx
## t = 1.0546, df = 17.967, p-value = 0.3056
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -2.877773  8.677773
## sample estimates:
## mean of x mean of y 
##      68.1      65.2
  1. (20 points) A university is interested in determining whether there is an association between the use of performance-enhancing drugs among athletes and collegiate sport played. The university sampled 200 student-athletes, for which the counts of positive and negative test results for performance-enhancing drugs by sport are below:
  1. Calculate the expected cell values for the above 4 x 2 contingency table.

EXPECTED VALUES

                 Positive Test               Negative Test
                 

Track & Field (49)(16)/200 = 3.92 ||||||||| (49)(184)/200 = 45.08

Basketball (40)(16)/200 = 3.2 ||||||||| (40)(184)/200 = 36.08

Tennis (54)(16)/200 = 4.32 ||||||||| (54)(184)/200 = 49.68

Football (57)(16)/200 = 4.56 ||||||||| (55)(184)/200 = 50.6

  1. Which hypothesis test should be used and why?

We should use the Fisher exact test. Because 50% of the expected values are less than 5. In order to use Chi-squre , we need less than 20% of the expected values to be lower than 5.

  1. Use R to calculate a p-value.
fisher.test(matrix(c(9,1,4,2,40,39,50,55),nrow=4))
## 
##  Fisher's Exact Test for Count Data
## 
## data:  matrix(c(9, 1, 4, 2, 40, 39, 50, 55), nrow = 4)
## p-value = 0.02663
## alternative hypothesis: two.sided

The p-value is 0.02663.

  1. Is there a statistical difference in testing results across the four sports?

Yes, there is statistical significance since the p-value is lower than the Type I error (0.05). Therefore, we reject the null hypothesis, and confirmed a statistically difference across all four sports.

  1. (48 points) A study analyzed the number of hours that 300 teenagers (ages 13-19) and 300 tweens (ages 8-12) spent on screens for entertainment. Answer the following questions relating to this data:
  1. Load into R the file “screen_time.csv”
setwd("/Users/victorleon/Desktop/biostats")
exam_data <- read.csv('screen_time.csv')
  1. Write code to create subsets of the data by the following groups:

• “teenager”: individuals ages 13-19

table(exam_data$Age)
## 
##  8  9 10 11 12 13 14 15 16 17 18 19 
## 71 54 59 56 60 45 47 37 42 37 45 47
age13 <- subset(exam_data,Age >=13)
age13_19 <- subset(age13,Age <= 19)


table(age13_19$Age)
## 
## 13 14 15 16 17 18 19 
## 45 47 37 42 37 45 47

• “tween”: individuals ages 8-12

age8 <- subset(exam_data,Age >=8)
age8_12 <- subset(age8,Age <= 12)

table(age8_12$Age)
## 
##  8  9 10 11 12 
## 71 54 59 56 60
  1. Make two histograms of the variable “ScreenTime” (one for teenagers and the other for tweens). Are the histograms normally distributed?
hist(age13_19$ScreenTime)

hist(age8_12$ScreenTime)

I would say yes, not perfect but there is a normal distribution trend.

  1. Conduct hypothesis testing on the variable “ScreenTime” by comparing the number of hours of screen time between teenagers and tweens:
  1. What are the null and alternative hypotheses?

H0 = There is not statistically difference in screen time between group ages 13_to_19 and 8_to_12 H1 = There is statistically difference in screen time between group ages 13_to_19 and 8_to_12

  1. Which statistical test(s) should be used, and why?

Independent sample t-test , because there is no correlation between the groups F-test to assess variances between the groups

  1. Perform the statistical test(s) that you selected in (ii). Assume a two-sided test and Type I error rate of 0.05. Report a p-value(s). Would you reject or fail to reject the null hypothesis?
t.test(age13_19$ScreenTime,age8_12$ScreenTime)
## 
##  Welch Two Sample t-test
## 
## data:  age13_19$ScreenTime and age8_12$ScreenTime
## t = 27.626, df = 563.72, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  2.470443 2.848620
## sample estimates:
## mean of x mean of y 
##  7.396473  4.736941
#p-value < 0.001
var.test(age13_19$ScreenTime,age8_12$ScreenTime)
## 
##  F test to compare two variances
## 
## data:  age13_19$ScreenTime and age8_12$ScreenTime
## F = 0.60435, num df = 299, denom df = 299, p-value = 1.506e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4815557 0.7584509
## sample estimates:
## ratio of variances 
##          0.6043478
#pvalue < 0.001

I would reject the null hypothesis because of the very low p-values.

  1. What is the 95% confidence interval about the difference of the two means. Does the interval make sense given the p-value in part (iii)? Why or why not?

Yes the confidence interval makes sense because the interval does not include 0 within its range.

The dataset “screen_time.csv” also includes a variable “Location”, where 0 indicates that the individual lives in an urban environment, 1 indicates suburban, and 2 indicates rural. Answer the following questions relating to this additional variable:

  1. Make box-plots for the variable “ScreenTime” stratified by each of the three locations.
urban <- subset(exam_data, Location == 0)
subur <- subset(exam_data, Location == 1)
rural <- subset(exam_data, Location == 2)


library("ggplot2")
ggplot(exam_data,aes(x=as.factor(Location),y=ScreenTime)) + geom_boxplot()

# or

box_data <- data.frame(urban$ScreenTime,subur$ScreenTime,rural$ScreenTime)
names(box_data)[1] <- 'URBAN'
names(box_data)[2] <- 'SUBURBAN'
names(box_data)[3] <- 'RURAL'

boxplot(box_data)

  1. Run a one-way ANOVA test where the independent factor variable is “Location” and the response variable is “ScreenTime”. What is the p-value? How would you interpret this result?
test_aov <- aov(ScreenTime ~ as.factor(Location), data = exam_data)
summary(test_aov) 
##                      Df Sum Sq Mean Sq F value Pr(>F)  
## as.factor(Location)   2   25.6  12.777   4.086 0.0173 *
## Residuals           597 1866.7   3.127                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is 0.0173 , which means that is significant

  1. If the result in (F) is statistically significant, run a Tukey HSD test and state which pairwise-comparisons are statistically significant.
TukeyHSD(test_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = ScreenTime ~ as.factor(Location), data = exam_data)
## 
## $`as.factor(Location)`
##           diff        lwr         upr     p adj
## 1-0 -0.2994132 -0.7148840  0.11605763 0.2085947
## 2-0 -0.5024314 -0.9179022 -0.08696062 0.0128835
## 2-1 -0.2030182 -0.6184890  0.21245255 0.4848745

Finally, answer the following questions related to the dataset “screen_time.csv”:

  1. Find the correlation between the variable “ScreenTime” and “Age”
cor(exam_data$ScreenTime,exam_data$Age, method = c("pearson"))
## [1] 0.6788333
  1. Find the correlation between the variable “ScreenTime” and “Location”
cor(exam_data$ScreenTime,exam_data$Location, method = c("spearman"))
## [1] -0.1193716
  1. Fit a single linear regression model where “ScreenTime” is the dependent variable and “Age” is the continuous independent variable. What are the slope and intercept terms in this model? (Note: you do not need to check model assumptions for this question, but in practice when conducting your own research please remember to!)
hist(exam_data$ScreenTime)

hist(exam_data$Age)

plot(exam_data$ScreenTime,exam_data$Age)

result <- lm(exam_data$ScreenTime ~ exam_data$Age)
summary(result)
## 
## Call:
## lm(formula = exam_data$ScreenTime ~ exam_data$Age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4858 -0.8821  0.0115  0.8689  4.5434 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.62509    0.20358   7.983 7.35e-15 ***
## exam_data$Age  0.34245    0.01515  22.607  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.306 on 598 degrees of freedom
## Multiple R-squared:  0.4608, Adjusted R-squared:  0.4599 
## F-statistic: 511.1 on 1 and 598 DF,  p-value: < 2.2e-16

The intercept is the slope, which is 1.62509

  1. (12 pts) You’re preparing to run a two-arm study (control vs. intervention), but your budget will only allow you to enroll 400 patients per arm. Assuming a continuous variable outcome, a detectable clinical difference of 0.35, Type I error rate of 0.05, and pooled estimate of variance between the two groups of 3, what power would your study yield?

Type I error = 0.05 clinical difference = 0.35 Patients = 400 Variance = 3 Power=?

Effect size is clinical difference/variance = 0.35/3 = 0.1167 ( a small effect size)

# via R function

library(pwr)
pwr.t.test(n=400,d=0.1167,sig.level = 0.05, type = ('two.sample'))
## 
##      Two-sample t test power calculation 
## 
##               n = 400
##               d = 0.1167
##       sig.level = 0.05
##           power = 0.3778404
##     alternative = two.sided
## 
## NOTE: n is number in *each* group

The power of the study is 37.78%