Type I error = 0.05 Type 2 error = 0.1 = Power is 0.9 Pooled estimate of variance = 2 Clinical difference = 0.5 Patients = ?
First we need to find the qnorm to plug in the numbers in the formula.
qnorm(0.9)
## [1] 1.281552
# The result is 1.282
m = (((1.96+1.282)2)2(2)2)/0.5^2 m is 336.3 ( in humans we count a whole number for decimals ) -. so this is 337 Since we will have 2 arms this is 2 x 337 = 674 patients for the study ( two arms )
Suppose you can actually only enroll 300 patients per arm. Assuming the same clinical difference, Type I error rate, and standard deviation, what power would your study yield?
Type I error = 0.05 Type 2 error = ? Pooled estimate of variance = 2 Clinical difference = 0.5 Sample size = 300
We will plug in the values in the formula:
300 = (((1.96+Z1-B)^2)2 (2)2)/0.52
Z1-B is 1.10 , therefore we will do pnorm to find the area under the curve
pnorm(1.10)
## [1] 0.8643339
#This is 0.8643
The power is 86.43%
Db <- read.csv('diabetes.csv')
We will use pearson correlation
cor(Db$Age,Db$A1C, method = c('pearson'))
## [1] -0.04120664
The correlation is -0.0412
We will use Spearmans correlation
cor(Db$EducationLevel,Db$A1C, method=c('spearman'))
## [1] -0.7995521
The correlation is -0.8
First is always look to look at the data and explore the histogram and a plot to visualize.
hist(Db$A1C)
plot(Db$EducationLevel,Db$A1C)
modelo1 <- lm(Db$A1C~Db$EducationLevel)
summary(modelo1)
##
## Call:
## lm(formula = Db$A1C ~ Db$EducationLevel)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.05633 -0.17733 0.02267 0.18317 1.52267
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.77733 0.01961 345.53 <2e-16 ***
## Db$EducationLevel -0.46050 0.01519 -30.31 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3039 on 598 degrees of freedom
## Multiple R-squared: 0.6057, Adjusted R-squared: 0.6051
## F-statistic: 918.7 on 1 and 598 DF, p-value: < 2.2e-16
The intercept is 6.777 and the Slope is -0.46 ( and the p-value under 2.2e-16)
The slope is definitely significant considering the type I error of 0.05
#mean We want to make sure that the residuals means is close to zero, since is a step for diagnostics.
mean(modelo1$residuals)
## [1] 1.232959e-17
hist(modelo1$residuals)
The mean is 1.23e-17, which is almost zero, therefore, assumption i) is met.
#homoscedasticity
Then, we want to check how the data points are scattered and how the regression line will relate.
plot(modelo1)
#Diagnostic plots
The line looks close to 0 and there is not much of a dramatic difference so its center well. The Q-Q plot looks great, the dots are aligning in the center with the line. In the Residuals vs. Leverage looks like there are a couple of outliers. The residuals don’t look perfectly distributed, but I don’t think it matters that much because the overall distribution doesn’t look bad.
Something that is tricky here is that he plots look to show more caterogical rather than ordinal, so only tell us the range of these values.
cor.test(Db$EducationLevel,modelo1$residuals,method=c('spearman'))
## Warning in cor.test.default(Db$EducationLevel, modelo1$residuals, method =
## c("spearman")): Cannot compute exact p-value with ties
##
## Spearman's rank correlation rho
##
## data: Db$EducationLevel and modelo1$residuals
## S = 36225822, p-value = 0.8781
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.006275615
The residuals are not normally distributed. I have mixed feelings and I am not sure on using spearman, since educationallevel is categorical, but I am not sure what other diagnostic to use instead.
Built a model for A1C and Age
modelo2 <- lm(Db$A1C~Db$Age)
summary(modelo2)
##
## Call:
## lm(formula = Db$A1C ~ Db$Age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.52989 -0.31584 0.00875 0.37303 1.97713
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.500872 0.183548 35.418 <2e-16 ***
## Db$Age -0.002342 0.002322 -1.009 0.314
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4835 on 598 degrees of freedom
## Multiple R-squared: 0.001698, Adjusted R-squared: 2.859e-05
## F-statistic: 1.017 on 1 and 598 DF, p-value: 0.3136
Build a model for A1C and GENDER
modelo3 <- lm(Db$A1C ~ Db$GENDER)
summary(modelo3)
##
## Call:
## lm(formula = Db$A1C ~ Db$GENDER)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.54274 -0.29858 0.00142 0.35726 2.00142
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.29858 0.02577 244.444 <2e-16 ***
## Db$GENDERMale 0.04416 0.04008 1.102 0.271
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4834 on 598 degrees of freedom
## Multiple R-squared: 0.002026, Adjusted R-squared: 0.0003574
## F-statistic: 1.214 on 1 and 598 DF, p-value: 0.271
The corresponding P- values for Age and GENDER related to A1C were 0.314 and 0.271, which implies that there is not a statistical relationship between these values. Therefore, its unnecessary for make a multivariate linear regression.
First we need to parse into subgroups
E0 <- subset(Db,EducationLevel ==0)
E1 <- subset(Db,EducationLevel ==1)
E2 <- subset(Db,EducationLevel ==2)
Then we want to explore the means in each group
mean(E0$A1C)
## [1] 6.7525
mean(E1$A1C)
## [1] 6.3665
mean(E2$A1C)
## [1] 5.8315
Then, we want to observe the differences between the groups by visualization tools.
library('ggplot2')
ggplot(Db,aes(x=as.factor(EducationLevel),y=A1C)) + geom_boxplot()
The box plots helps visualize how this groups might be different. In our case we see some differences in the mean, but we are not 100% confident that this is significant, we need further testing.
To have a more complete analysis , we will run ANOVA as a diagnostics
r_aov <- aov(A1C ~ as.factor(Db$EducationLevel), data=Db)
summary(r_aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Db$EducationLevel) 2 85.56 42.78 468.8 <2e-16 ***
## Residuals 597 54.48 0.09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The P-value is close to zero, which tell us that the means are very significant.
Now, we have to test for pairwise-comparisons
TukeyHSD(r_aov)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = A1C ~ as.factor(Db$EducationLevel), data = Db)
##
## $`as.factor(Db$EducationLevel)`
## diff lwr upr p adj
## 1-0 -0.386 -0.4569747 -0.3150253 0
## 2-0 -0.921 -0.9919747 -0.8500253 0
## 2-1 -0.535 -0.6059747 -0.4640253 0
I would predict from the P-value shown in the ANOVA that this test result is significant, and the p-adjusted value for Tukey HSD is 0 which indicated that the comparisons are significantly different.
modelo4 <- lm(Db$A1C~factor((Db$EducationLevel)))
summary(modelo4)
##
## Call:
## lm(formula = Db$A1C ~ factor((Db$EducationLevel)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.0315 -0.1665 0.0335 0.1475 1.5475
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.75250 0.02136 316.13 <2e-16 ***
## factor((Db$EducationLevel))1 -0.38600 0.03021 -12.78 <2e-16 ***
## factor((Db$EducationLevel))2 -0.92100 0.03021 -30.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3021 on 597 degrees of freedom
## Multiple R-squared: 0.611, Adjusted R-squared: 0.6097
## F-statistic: 468.8 on 2 and 597 DF, p-value: < 2.2e-16
The A1C mean when EducationalLevel = 0 is 6.75250 , which is the intercept.
I see that the estimates between EducationLevel 1 and EducationLevel 2 are closer to each other than EducationLevel = 0 .
EducationLevel =1 has a -0.386 estimate, which results in substraction fof the Intercept from Educaiton level = 0 (6.75250), ending in showing a mean difference from A1C of 6.3665 ( (6.75250) + Estimate (-0.38600 )
EducationLevel =2 has a -0.921 estimate, which results in substraction fof the Intercept from Educaiton level = 0 (6.75250), ending in showing a mean difference from A1C of 5.832 ( (6.75250 ) + Estimate (-0.921) Ed