1. You’re preparing to run an experiment where you will enroll patients into two arms (control vs. intervention). Your outcome will be a continuous variable. How many patients would you need to enroll per arm to detect a clinical difference of 0.5? Assume a Type I error rate of 0.05, Type II error rate of 0.1, and a pooled estimate of standard deviation between the two groups of 2.

Type I error = 0.05 Type 2 error = 0.1 = Power is 0.9 Pooled estimate of variance = 2 Clinical difference = 0.5 Patients = ?

First we need to find the qnorm to plug in the numbers in the formula.

qnorm(0.9)
## [1] 1.281552
# The result is 1.282

m = (((1.96+1.282)2)2(2)2)/0.5^2 m is 336.3 ( in humans we count a whole number for decimals ) -. so this is 337 Since we will have 2 arms this is 2 x 337 = 674 patients for the study ( two arms )

Suppose you can actually only enroll 300 patients per arm. Assuming the same clinical difference, Type I error rate, and standard deviation, what power would your study yield?

Type I error = 0.05 Type 2 error = ? Pooled estimate of variance = 2 Clinical difference = 0.5 Sample size = 300

We will plug in the values in the formula:

300 = (((1.96+Z1-B)^2)2 (2)2)/0.52

Z1-B is 1.10 , therefore we will do pnorm to find the area under the curve

pnorm(1.10) 
## [1] 0.8643339
#This is 0.8643 

The power is 86.43%

  1. Load into R the dataset “diabetes.csv” and answer the following questions:
Db <- read.csv('diabetes.csv')
  1. Find the correlation between the variables “Age” and “A1C”

We will use pearson correlation

cor(Db$Age,Db$A1C, method = c('pearson'))
## [1] -0.04120664

The correlation is -0.0412

  1. Find the correlation between the variables “EducationLevel” and “A1C”

We will use Spearmans correlation

cor(Db$EducationLevel,Db$A1C, method=c('spearman'))
## [1] -0.7995521

The correlation is -0.8

  1. Fit a single linear regression model where “EducationLevel” is the independent variable and “A1C” is the dependent variable. What are the slope and intercept terms in this model? Is the p-value for the slope term significant assuming a Type I error rate of 0.05?

First is always look to look at the data and explore the histogram and a plot to visualize.

hist(Db$A1C)

plot(Db$EducationLevel,Db$A1C)

modelo1 <- lm(Db$A1C~Db$EducationLevel)
summary(modelo1)
## 
## Call:
## lm(formula = Db$A1C ~ Db$EducationLevel)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.05633 -0.17733  0.02267  0.18317  1.52267 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.77733    0.01961  345.53   <2e-16 ***
## Db$EducationLevel -0.46050    0.01519  -30.31   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3039 on 598 degrees of freedom
## Multiple R-squared:  0.6057, Adjusted R-squared:  0.6051 
## F-statistic: 918.7 on 1 and 598 DF,  p-value: < 2.2e-16

The intercept is 6.777 and the Slope is -0.46 ( and the p-value under 2.2e-16)

The slope is definitely significant considering the type I error of 0.05

  1. Check the following assumptions of the model in (c). Which assumptions are met/unmet?
  2. Mean of the residuals is zero
  1. Homoscedasticity of residuals
  2. Normally distributed residuals
  3. x variable and residuals are uncorrelated

#mean We want to make sure that the residuals means is close to zero, since is a step for diagnostics.

mean(modelo1$residuals)
## [1] 1.232959e-17
hist(modelo1$residuals)

The mean is 1.23e-17, which is almost zero, therefore, assumption i) is met.

#homoscedasticity

Then, we want to check how the data points are scattered and how the regression line will relate.

plot(modelo1)

#Diagnostic plots

The line looks close to 0 and there is not much of a dramatic difference so its center well. The Q-Q plot looks great, the dots are aligning in the center with the line. In the Residuals vs. Leverage looks like there are a couple of outliers. The residuals don’t look perfectly distributed, but I don’t think it matters that much because the overall distribution doesn’t look bad.

Something that is tricky here is that he plots look to show more caterogical rather than ordinal, so only tell us the range of these values.

cor.test(Db$EducationLevel,modelo1$residuals,method=c('spearman'))
## Warning in cor.test.default(Db$EducationLevel, modelo1$residuals, method =
## c("spearman")): Cannot compute exact p-value with ties
## 
##  Spearman's rank correlation rho
## 
## data:  Db$EducationLevel and modelo1$residuals
## S = 36225822, p-value = 0.8781
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
##          rho 
## -0.006275615

The residuals are not normally distributed. I have mixed feelings and I am not sure on using spearman, since educationallevel is categorical, but I am not sure what other diagnostic to use instead.

  1. Using again single linear regression, is there a significant relationship between “A1C” (dependent variable) and “Age” (independent variable) or “A1C” (dependent variable) and “GENDER” (independent variable)? If so, include those variables in a multivariable regression equation along with EducationLevel (independent variable).

Built a model for A1C and Age

modelo2 <- lm(Db$A1C~Db$Age)

summary(modelo2)
## 
## Call:
## lm(formula = Db$A1C ~ Db$Age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52989 -0.31584  0.00875  0.37303  1.97713 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.500872   0.183548  35.418   <2e-16 ***
## Db$Age      -0.002342   0.002322  -1.009    0.314    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4835 on 598 degrees of freedom
## Multiple R-squared:  0.001698,   Adjusted R-squared:  2.859e-05 
## F-statistic: 1.017 on 1 and 598 DF,  p-value: 0.3136

Build a model for A1C and GENDER

modelo3 <- lm(Db$A1C ~ Db$GENDER)
summary(modelo3)
## 
## Call:
## lm(formula = Db$A1C ~ Db$GENDER)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.54274 -0.29858  0.00142  0.35726  2.00142 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.29858    0.02577 244.444   <2e-16 ***
## Db$GENDERMale  0.04416    0.04008   1.102    0.271    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4834 on 598 degrees of freedom
## Multiple R-squared:  0.002026,   Adjusted R-squared:  0.0003574 
## F-statistic: 1.214 on 1 and 598 DF,  p-value: 0.271

The corresponding P- values for Age and GENDER related to A1C were 0.314 and 0.271, which implies that there is not a statistical relationship between these values. Therefore, its unnecessary for make a multivariate linear regression.

  1. Run a one-way ANOVA test where the independent factor variable is “EducationLevel” and the response variable is “A1C”. Is the test result significant, and if so, what is the interpretation of the result, and run a Tukey HSD test to look at which pairwise-comparisons are significant?

First we need to parse into subgroups

E0 <- subset(Db,EducationLevel ==0)
E1 <- subset(Db,EducationLevel ==1)
E2 <- subset(Db,EducationLevel ==2)

Then we want to explore the means in each group

mean(E0$A1C)
## [1] 6.7525
mean(E1$A1C)
## [1] 6.3665
mean(E2$A1C)
## [1] 5.8315

Then, we want to observe the differences between the groups by visualization tools.

library('ggplot2')
ggplot(Db,aes(x=as.factor(EducationLevel),y=A1C)) + geom_boxplot()

The box plots helps visualize how this groups might be different. In our case we see some differences in the mean, but we are not 100% confident that this is significant, we need further testing.

To have a more complete analysis , we will run ANOVA as a diagnostics

r_aov <- aov(A1C ~ as.factor(Db$EducationLevel), data=Db)
summary(r_aov)
##                               Df Sum Sq Mean Sq F value Pr(>F)    
## as.factor(Db$EducationLevel)   2  85.56   42.78   468.8 <2e-16 ***
## Residuals                    597  54.48    0.09                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The P-value is close to zero, which tell us that the means are very significant.

Now, we have to test for pairwise-comparisons

TukeyHSD(r_aov)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = A1C ~ as.factor(Db$EducationLevel), data = Db)
## 
## $`as.factor(Db$EducationLevel)`
##       diff        lwr        upr p adj
## 1-0 -0.386 -0.4569747 -0.3150253     0
## 2-0 -0.921 -0.9919747 -0.8500253     0
## 2-1 -0.535 -0.6059747 -0.4640253     0

I would predict from the P-value shown in the ANOVA that this test result is significant, and the p-adjusted value for Tukey HSD is 0 which indicated that the comparisons are significantly different.

  1. Re-run the single linear regression model from part c, but now treating EducationLevel as an independent factor variable. How do the coefficients in this model relate to the mean of A1C within each of the three levels in the factor variable EducationLevel?
modelo4 <- lm(Db$A1C~factor((Db$EducationLevel)))
summary(modelo4)
## 
## Call:
## lm(formula = Db$A1C ~ factor((Db$EducationLevel)))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0315 -0.1665  0.0335  0.1475  1.5475 
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   6.75250    0.02136  316.13   <2e-16 ***
## factor((Db$EducationLevel))1 -0.38600    0.03021  -12.78   <2e-16 ***
## factor((Db$EducationLevel))2 -0.92100    0.03021  -30.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3021 on 597 degrees of freedom
## Multiple R-squared:  0.611,  Adjusted R-squared:  0.6097 
## F-statistic: 468.8 on 2 and 597 DF,  p-value: < 2.2e-16

The A1C mean when EducationalLevel = 0 is 6.75250 , which is the intercept.

I see that the estimates between EducationLevel 1 and EducationLevel 2 are closer to each other than EducationLevel = 0 .

EducationLevel =1 has a -0.386 estimate, which results in substraction fof the Intercept from Educaiton level = 0 (6.75250), ending in showing a mean difference from A1C of 6.3665 ( (6.75250) + Estimate (-0.38600 )

EducationLevel =2 has a -0.921 estimate, which results in substraction fof the Intercept from Educaiton level = 0 (6.75250), ending in showing a mean difference from A1C of 5.832 ( (6.75250 ) + Estimate (-0.921) Ed