answer <- sum(dbinom(100:300,size=300,prob=0.35))*100
answer
## [1] 74.61005
The chance is 74.6 %
HR_baseline <- c(81,70,72,62,64,60,64,65,72,71)
HR_tx <- c(75,70,67, 60, 54, 62, 63,66, 71,64)
hist(HR_baseline)
hist(HR_tx)
I would use the Wilcoxon rank test, since the n is only 10 , and even thout the data seems to be normally distributed, the histograms do not demonnstrate it, also seems is the same participant, for initial and final, this is correlated.
The null hypothesis is that the two heart rates are equal and have no difference. The alternative hypothesis is that the two heart rates measures are not equal.
wilcox.test(HR_baseline,HR_tx,alternative = 'two.sided')
## Warning in wilcox.test.default(HR_baseline, HR_tx, alternative = "two.sided"):
## cannot compute exact p-value with ties
##
## Wilcoxon rank sum test with continuity correction
##
## data: HR_baseline and HR_tx
## W = 62, p-value = 0.3831
## alternative hypothesis: true location shift is not equal to 0
The p-value is 0.3831, considering that the type I error is 0.05 , we will fail to reject the null hypothesis.
t.test(HR_baseline,HR_tx)
##
## Welch Two Sample t-test
##
## data: HR_baseline and HR_tx
## t = 1.0546, df = 17.967, p-value = 0.3056
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.877773 8.677773
## sample estimates:
## mean of x mean of y
## 68.1 65.2
EXPECTED VALUES
Positive Test Negative Test
Track & Field (49)(16)/200 = 3.92 ||||||||| (49)(184)/200 = 45.08
Basketball (40)(16)/200 = 3.2 ||||||||| (40)(184)/200 = 36.08
Tennis (54)(16)/200 = 4.32 ||||||||| (54)(184)/200 = 49.68
Football (57)(16)/200 = 4.56 ||||||||| (55)(184)/200 = 50.6
We should use the Fisher exact test. Because 50% of the expected values are less than 5. In order to use Chi-squre , we need less than 20% of the expected values to be lower than 5.
fisher.test(matrix(c(9,1,4,2,40,39,50,55),nrow=4))
##
## Fisher's Exact Test for Count Data
##
## data: matrix(c(9, 1, 4, 2, 40, 39, 50, 55), nrow = 4)
## p-value = 0.02663
## alternative hypothesis: two.sided
The p-value is 0.02663.
Yes, there is statistical significance since the p-value is lower than the Type I error (0.05). Therefore, we reject the null hypothesis, and confirmed a statistically difference across all four sports.
setwd("/Users/victorleon/Desktop/biostats")
exam_data <- read.csv('screen_time.csv')
• “teenager”: individuals ages 13-19
table(exam_data$Age)
##
## 8 9 10 11 12 13 14 15 16 17 18 19
## 71 54 59 56 60 45 47 37 42 37 45 47
age13 <- subset(exam_data,Age >=13)
age13_19 <- subset(age13,Age <= 19)
table(age13_19$Age)
##
## 13 14 15 16 17 18 19
## 45 47 37 42 37 45 47
• “tween”: individuals ages 8-12
age8 <- subset(exam_data,Age >=8)
age8_12 <- subset(age8,Age <= 12)
table(age8_12$Age)
##
## 8 9 10 11 12
## 71 54 59 56 60
hist(age13_19$ScreenTime)
hist(age8_12$ScreenTime)
I would say yes, not perfect but there is a normal distribution trend.
H0 = There is not statistically difference in screen time between group ages 13_to_19 and 8_to_12 H1 = There is statistically difference in screen time between group ages 13_to_19 and 8_to_12
Independent sample t-test , because there is no correlation between the groups F-test to assess variances between the groups
t.test(age13_19$ScreenTime,age8_12$ScreenTime)
##
## Welch Two Sample t-test
##
## data: age13_19$ScreenTime and age8_12$ScreenTime
## t = 27.626, df = 563.72, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 2.470443 2.848620
## sample estimates:
## mean of x mean of y
## 7.396473 4.736941
#p-value < 0.001
var.test(age13_19$ScreenTime,age8_12$ScreenTime)
##
## F test to compare two variances
##
## data: age13_19$ScreenTime and age8_12$ScreenTime
## F = 0.60435, num df = 299, denom df = 299, p-value = 1.506e-05
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4815557 0.7584509
## sample estimates:
## ratio of variances
## 0.6043478
#pvalue < 0.001
I would reject the null hypothesis because of the very low p-values.
Yes the confidence interval makes sense because the interval does not include 0 within its range.
The dataset “screen_time.csv” also includes a variable “Location”, where 0 indicates that the individual lives in an urban environment, 1 indicates suburban, and 2 indicates rural. Answer the following questions relating to this additional variable:
urban <- subset(exam_data, Location == 0)
subur <- subset(exam_data, Location == 1)
rural <- subset(exam_data, Location == 2)
library("ggplot2")
ggplot(exam_data,aes(x=as.factor(Location),y=ScreenTime)) + geom_boxplot()
# or
box_data <- data.frame(urban$ScreenTime,subur$ScreenTime,rural$ScreenTime)
names(box_data)[1] <- 'URBAN'
names(box_data)[2] <- 'SUBURBAN'
names(box_data)[3] <- 'RURAL'
boxplot(box_data)
test_aov <- aov(ScreenTime ~ as.factor(Location), data = exam_data)
summary(test_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Location) 2 25.6 12.777 4.086 0.0173 *
## Residuals 597 1866.7 3.127
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is 0.0173 , which means that is significant
TukeyHSD(test_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = ScreenTime ~ as.factor(Location), data = exam_data)
##
## $`as.factor(Location)`
## diff lwr upr p adj
## 1-0 -0.2994132 -0.7148840 0.11605763 0.2085947
## 2-0 -0.5024314 -0.9179022 -0.08696062 0.0128835
## 2-1 -0.2030182 -0.6184890 0.21245255 0.4848745
Finally, answer the following questions related to the dataset “screen_time.csv”:
cor(exam_data$ScreenTime,exam_data$Age, method = c("pearson"))
## [1] 0.6788333
cor(exam_data$ScreenTime,exam_data$Location, method = c("spearman"))
## [1] -0.1193716
hist(exam_data$ScreenTime)
hist(exam_data$Age)
plot(exam_data$ScreenTime,exam_data$Age)
result <- lm(exam_data$ScreenTime ~ exam_data$Age)
summary(result)
##
## Call:
## lm(formula = exam_data$ScreenTime ~ exam_data$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.4858 -0.8821 0.0115 0.8689 4.5434
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.62509 0.20358 7.983 7.35e-15 ***
## exam_data$Age 0.34245 0.01515 22.607 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.306 on 598 degrees of freedom
## Multiple R-squared: 0.4608, Adjusted R-squared: 0.4599
## F-statistic: 511.1 on 1 and 598 DF, p-value: < 2.2e-16
The intercept is the slope, which is 1.62509
Type I error = 0.05 clinical difference = 0.35 Patients = 400 Variance = 3 Power=?
Effect size is clinical difference/variance = 0.35/3 = 0.1167 ( a small effect size)
# via R function
library(pwr)
pwr.t.test(n=400,d=0.1167,sig.level = 0.05, type = ('two.sample'))
##
## Two-sample t test power calculation
##
## n = 400
## d = 0.1167
## sig.level = 0.05
## power = 0.3778404
## alternative = two.sided
##
## NOTE: n is number in *each* group
The power of the study is 37.78%