tumorvol.csv.
(Hint: see ?read.table and see the first Lab for examples
of how to read the data in. Call this object tumorvol.
Check that the categorical variables are factor variables, convert them
if not.)> getwd()
[1] "/Users/chelseabrown/Documents/R/Biostats402/Lab 4"
> tumorval <- read.csv('tumorvol.csv')
> str(tumorval)
'data.frame': 200 obs. of 4 variables:
$ trt : chr "trt" "trt" "trt" "trt" ...
$ genotype: chr "CC" "CC" "CC" "CC" ...
$ age : num 16.3 17.1 18.7 14.3 18.6 ...
$ vol : int 14 78 222 806 1352 72 802 179 27 368 ...
vol ~ trt + genotype + trt:genotype. Call this object
fit. Use powerTransform() to estimate \(lambda\) for the Box-Cox transofrmation.
Then, use testTransform() functions to “test”” for common
lambda values -1, -1/2, -1/3, 0, 1/3, 1/2, 1. Describe what the
testTransform() is testing. Describe your findings for the
various \(\lambda\)’s. What
transformation will you recommend? Does it make biological sense? Why or
why not?The testTransform output shows us that all transformations have a p-value > .05 (we fail to reject the null hypothesis that the lamda value is equal to -1, for example), except for the lamda = .33 (p=0.7). This means we reject the null hypothesis and lamda = 0.33 which shows us the same information as the powerTransform() test, which has a Rounded Power output of .33, so we should transform our outcome variable to .33. Biologically this transformation doesn’t really make sense and will make the results of our model very difficult to interpret.
> fit <- lm(vol ~ trt + genotype + trt:genotype, data = tumorval)
> powerTransform(fit, family = "bcPower")
Estimated transformation parameter
Y1
0.3556054
> summary(powerTransform(fit, family = "bcPower")) #rounded Pwr = 0.33 for Y
bcPower Transformation to Normality
Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
Y1 0.3556 0.33 0.2454 0.4658
Likelihood ratio test that transformation parameter is equal to 0
(log transformation)
LRT df pval
LR test, lambda = (0) 51.85581 1 5.973e-13
Likelihood ratio test that no transformation is needed
LRT df pval
LR test, lambda = (1) 89.89819 1 < 2.22e-16
> ## Testing out difference power transformation
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(-1))
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(-1/2))
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(-1/3))
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(0))
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(1/3))
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(1/2))
> powerTransform(fit, family = "bcPower") %>%
+ testTransform(lambda=c(1))
fit2. How
do the two models compare?The R-squared value is larger in the transformed model (0.44 vs 0.33) and the p-value is more significant as well (but both are < .0001). The model parameters and interepts are vastly different since fit2 is fit on a different scale. Particularly of interest, is the significance of trttrt:genotypeCT in fit 2 as this interaction term does not appear in the first model prior to transformation. All other variables of significance are the same between the 2 models (trt and trttrt:genotypeTT).
> # Fitting a model with transformed variable
> tumorval$vol_transformed <- (tumorval$vol)^0.33
> fit2 <- lm(vol_transformed ~ trt + genotype + trt:genotype, data = tumorval)
> summary(fit)
Call:
lm(formula = vol ~ trt + genotype + trt:genotype, data = tumorval)
Residuals:
Min 1Q Median 3Q Max
-990.25 -307.97 -73.07 272.20 1991.00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1239.25 89.09 13.911 < 2e-16 ***
trttrt -922.10 125.99 -7.319 6.47e-12 ***
genotypeCT 157.00 125.99 1.246 0.214209
genotypeTT -96.90 154.30 -0.628 0.530750
trttrt:genotypeCT 230.85 178.17 1.296 0.196635
trttrt:genotypeTT 857.15 218.22 3.928 0.000119 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 563.4 on 194 degrees of freedom
Multiple R-squared: 0.3333, Adjusted R-squared: 0.3162
F-statistic: 19.4 on 5 and 194 DF, p-value: 1.204e-15
> summary(fit2)
Call:
lm(formula = vol_transformed ~ trt + genotype + trt:genotype,
data = tumorval)
Residuals:
Min 1Q Median 3Q Max
-4.1448 -1.2732 0.0753 1.2197 5.2432
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.1000 0.3008 33.579 < 2e-16 ***
trttrt -4.0546 0.4254 -9.532 < 2e-16 ***
genotypeCT 0.5644 0.4254 1.327 0.18611
genotypeTT -0.2549 0.5210 -0.489 0.62517
trttrt:genotypeCT 1.7031 0.6016 2.831 0.00513 **
trttrt:genotypeTT 3.9325 0.7368 5.338 2.62e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.902 on 194 degrees of freedom
Multiple R-squared: 0.4401, Adjusted R-squared: 0.4257
F-statistic: 30.5 on 5 and 194 DF, p-value: < 2.2e-16
> compareCoefs(fit, fit2) #compare coefficients more efficiently
Calls:
1: lm(formula = vol ~ trt + genotype + trt:genotype, data = tumorval)
2: lm(formula = vol_transformed ~ trt + genotype + trt:genotype, data =
tumorval)
Model 1 Model 2
(Intercept) 1239.250 10.100
SE 89.086 0.301
trttrt -922.100 -4.055
SE 125.987 0.425
genotypeCT 157.000 0.564
SE 125.987 0.425
genotypeTT -96.900 -0.255
SE 154.302 0.521
trttrt:genotypeCT 230.850 1.703
SE 178.173 0.602
trttrt:genotypeTT 857.150 3.933
SE 218.216 0.737
Normal distribution of residuals is now satisfied.
> qqnorm(residuals(fit))
> qqline(residuals(fit))
> qqnorm(residuals(fit2))
> qqline(residuals(fit2))
## 2. Is the interaction between genotype and treatment significant?
Explain why. (Hint: analysis of variance…) Yes, it is statistically
significant for both models.
> Anova(fit)
> Anova(fit2)
fit3. Looking at the estimates
of parameters for genotype:age interaction, do you expect the
interaction to be significant? Why or why not?No, I don’t expect an interaction since age is shouldn’t be related to genetic expression. This is correct when looking at Anova() as age:genotype is not significant.
> fit3 <- lm(vol_transformed ~ trt + genotype + age + trt:genotype + genotype:age, data = tumorval)
> Anova(fit3)
The main effect for age and interaction effecst are not significant (all p’s > .05)
> summary(fit3)
Call:
lm(formula = vol_transformed ~ trt + genotype + age + trt:genotype +
genotype:age, data = tumorval)
Residuals:
Min 1Q Median 3Q Max
-4.2031 -1.2401 0.0082 1.2201 4.9700
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.134349 1.169146 9.523 < 2e-16 ***
trttrt -3.999295 0.430598 -9.288 < 2e-16 ***
genotypeCT 0.507342 1.675857 0.303 0.76242
genotypeTT -0.109872 1.993364 -0.055 0.95610
age -0.070256 0.076724 -0.916 0.36098
trttrt:genotypeCT 1.636813 0.606100 2.701 0.00754 **
trttrt:genotypeTT 3.919738 0.743035 5.275 3.57e-07 ***
genotypeCT:age 0.007199 0.107283 0.067 0.94657
genotypeTT:age -0.007823 0.128505 -0.061 0.95152
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.907 on 191 degrees of freedom
Multiple R-squared: 0.4463, Adjusted R-squared: 0.4231
F-statistic: 19.24 on 8 and 191 DF, p-value: < 2.2e-16
age on tumor volume
based on fit3? If not, fit a model where we can in fact
interpret the effect of age, respecting the marginality
principle.> fit4 <- lm(vol_transformed ~ trt + genotype + age + trt:genotype, data = tumorval)
> summary(fit4)
Call:
lm(formula = vol_transformed ~ trt + genotype + age + trt:genotype,
data = tumorval)
Residuals:
Min 1Q Median 3Q Max
-4.2021 -1.2375 0.0182 1.2219 4.9636
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 11.11622 0.75863 14.653 < 2e-16 ***
trttrt -4.00026 0.42578 -9.395 < 2e-16 ***
genotypeCT 0.61796 0.42573 1.452 0.14826
genotypeTT -0.22852 0.51979 -0.440 0.66069
age -0.06902 0.04733 -1.458 0.14636
trttrt:genotypeCT 1.63674 0.60156 2.721 0.00711 **
trttrt:genotypeTT 3.91578 0.73473 5.330 2.73e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.897 on 193 degrees of freedom
Multiple R-squared: 0.4462, Adjusted R-squared: 0.429
F-statistic: 25.92 on 6 and 193 DF, p-value: < 2.2e-16
anova() to compare the last two models. Compare
this to the results in part 4.The last model is not statistically significant (p>.05), and is not a better fit to the data than fit3. The p-value for age:genotype is the same as the p-value comparing the 2 models.
> anova(fit3, fit4)