Question 1: Transformations

1. Read in the tumor volume data from tumorvol.csv. (Hint: see ?read.table and see the first Lab for examples of how to read the data in. Call this object tumorvol. Check that the categorical variables are factor variables, convert them if not.)

> getwd()
   [1] "/Users/chelseabrown/Documents/R/Biostats402/Lab 4"
> tumorval <- read.csv('tumorvol.csv')
> str(tumorval)
   'data.frame':    200 obs. of  4 variables:
    $ trt     : chr  "trt" "trt" "trt" "trt" ...
    $ genotype: chr  "CC" "CC" "CC" "CC" ...
    $ age     : num  16.3 17.1 18.7 14.3 18.6 ...
    $ vol     : int  14 78 222 806 1352 72 802 179 27 368 ...

2. Fit a regression model vol ~ trt + genotype + trt:genotype. Call this object fit. Use powerTransform() to estimate \(lambda\) for the Box-Cox transofrmation. Then, use testTransform() functions to “test”” for common lambda values -1, -1/2, -1/3, 0, 1/3, 1/2, 1. Describe what the testTransform() is testing. Describe your findings for the various \(\lambda\)’s. What transformation will you recommend? Does it make biological sense? Why or why not?

The testTransform output shows us that all transformations have a p-value > .05 (we fail to reject the null hypothesis that the lamda value is equal to -1, for example), except for the lamda = .33 (p=0.7). This means we reject the null hypothesis and lamda = 0.33 which shows us the same information as the powerTransform() test, which has a Rounded Power output of .33, so we should transform our outcome variable to .33. Biologically this transformation doesn’t really make sense and will make the results of our model very difficult to interpret.

> fit <- lm(vol ~ trt + genotype + trt:genotype, data = tumorval)
> powerTransform(fit, family = "bcPower")
   Estimated transformation parameter 
          Y1 
   0.3556054
> summary(powerTransform(fit, family = "bcPower")) #rounded Pwr = 0.33 for Y
   bcPower Transformation to Normality 
      Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
   Y1    0.3556        0.33       0.2454       0.4658
   
   Likelihood ratio test that transformation parameter is equal to 0
    (log transformation)
                              LRT df      pval
   LR test, lambda = (0) 51.85581  1 5.973e-13
   
   Likelihood ratio test that no transformation is needed
                              LRT df       pval
   LR test, lambda = (1) 89.89819  1 < 2.22e-16
> ## Testing out difference power transformation
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(-1))
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(-1/2))
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(-1/3))
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(0))
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(1/3))
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(1/2))
> powerTransform(fit, family = "bcPower") %>%
+   testTransform(lambda=c(1))

Question 2: F-tests

1. Using the tumorvol data, refit the model with the transformation as you determined in Question 1. Call this object fit2. How do the two models compare?

The R-squared value is larger in the transformed model (0.44 vs 0.33) and the p-value is more significant as well (but both are < .0001). The model parameters and interepts are vastly different since fit2 is fit on a different scale. Particularly of interest, is the significance of trttrt:genotypeCT in fit 2 as this interaction term does not appear in the first model prior to transformation. All other variables of significance are the same between the 2 models (trt and trttrt:genotypeTT).

>  # Fitting a model with transformed variable
> tumorval$vol_transformed <- (tumorval$vol)^0.33
> fit2 <- lm(vol_transformed ~ trt + genotype + trt:genotype, data = tumorval)
> summary(fit)
   
   Call:
   lm(formula = vol ~ trt + genotype + trt:genotype, data = tumorval)
   
   Residuals:
       Min      1Q  Median      3Q     Max 
   -990.25 -307.97  -73.07  272.20 1991.00 
   
   Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
   (Intercept)        1239.25      89.09  13.911  < 2e-16 ***
   trttrt             -922.10     125.99  -7.319 6.47e-12 ***
   genotypeCT          157.00     125.99   1.246 0.214209    
   genotypeTT          -96.90     154.30  -0.628 0.530750    
   trttrt:genotypeCT   230.85     178.17   1.296 0.196635    
   trttrt:genotypeTT   857.15     218.22   3.928 0.000119 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Residual standard error: 563.4 on 194 degrees of freedom
   Multiple R-squared:  0.3333, Adjusted R-squared:  0.3162 
   F-statistic:  19.4 on 5 and 194 DF,  p-value: 1.204e-15
> summary(fit2)
   
   Call:
   lm(formula = vol_transformed ~ trt + genotype + trt:genotype, 
       data = tumorval)
   
   Residuals:
       Min      1Q  Median      3Q     Max 
   -4.1448 -1.2732  0.0753  1.2197  5.2432 
   
   Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
   (Intercept)        10.1000     0.3008  33.579  < 2e-16 ***
   trttrt             -4.0546     0.4254  -9.532  < 2e-16 ***
   genotypeCT          0.5644     0.4254   1.327  0.18611    
   genotypeTT         -0.2549     0.5210  -0.489  0.62517    
   trttrt:genotypeCT   1.7031     0.6016   2.831  0.00513 ** 
   trttrt:genotypeTT   3.9325     0.7368   5.338 2.62e-07 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Residual standard error: 1.902 on 194 degrees of freedom
   Multiple R-squared:  0.4401, Adjusted R-squared:  0.4257 
   F-statistic:  30.5 on 5 and 194 DF,  p-value: < 2.2e-16
> compareCoefs(fit, fit2) #compare coefficients more efficiently 
   Calls:
   1: lm(formula = vol ~ trt + genotype + trt:genotype, data = tumorval)
   2: lm(formula = vol_transformed ~ trt + genotype + trt:genotype, data = 
     tumorval)
   
                      Model 1  Model 2
   (Intercept)       1239.250   10.100
   SE                  89.086    0.301
                                      
   trttrt            -922.100   -4.055
   SE                 125.987    0.425
                                      
   genotypeCT         157.000    0.564
   SE                 125.987    0.425
                                      
   genotypeTT         -96.900   -0.255
   SE                 154.302    0.521
                                      
   trttrt:genotypeCT  230.850    1.703
   SE                 178.173    0.602
                                      
   trttrt:genotypeTT  857.150    3.933
   SE                 218.216    0.737
   

2. Look at the QQ-plot of the residuals from the two models (with vol, and transformed volume). Did the Box-Cox transformation improve model fit? What model assumption is now satisfied?

Normal distribution of residuals is now satisfied.

> qqnorm(residuals(fit))
> qqline(residuals(fit))

> qqnorm(residuals(fit2))
> qqline(residuals(fit2))

## 2. Is the interaction between genotype and treatment significant? Explain why. (Hint: analysis of variance…) Yes, it is statistically significant for both models.

> Anova(fit)
> Anova(fit2)

3. Now add age by genotype interaction, respecting the marginality principle. Call this object fit3. Looking at the estimates of parameters for genotype:age interaction, do you expect the interaction to be significant? Why or why not?

No, I don’t expect an interaction since age is shouldn’t be related to genetic expression. This is correct when looking at Anova() as age:genotype is not significant.

> fit3 <- lm(vol_transformed ~ trt + genotype + age + trt:genotype + genotype:age, data = tumorval)
> Anova(fit3)

4. Test whether in fact the the interaction is significant.

The main effect for age and interaction effecst are not significant (all p’s > .05)

> summary(fit3)
   
   Call:
   lm(formula = vol_transformed ~ trt + genotype + age + trt:genotype + 
       genotype:age, data = tumorval)
   
   Residuals:
       Min      1Q  Median      3Q     Max 
   -4.2031 -1.2401  0.0082  1.2201  4.9700 
   
   Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
   (Intercept)       11.134349   1.169146   9.523  < 2e-16 ***
   trttrt            -3.999295   0.430598  -9.288  < 2e-16 ***
   genotypeCT         0.507342   1.675857   0.303  0.76242    
   genotypeTT        -0.109872   1.993364  -0.055  0.95610    
   age               -0.070256   0.076724  -0.916  0.36098    
   trttrt:genotypeCT  1.636813   0.606100   2.701  0.00754 ** 
   trttrt:genotypeTT  3.919738   0.743035   5.275 3.57e-07 ***
   genotypeCT:age     0.007199   0.107283   0.067  0.94657    
   genotypeTT:age    -0.007823   0.128505  -0.061  0.95152    
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Residual standard error: 1.907 on 191 degrees of freedom
   Multiple R-squared:  0.4463, Adjusted R-squared:  0.4231 
   F-statistic: 19.24 on 8 and 191 DF,  p-value: < 2.2e-16

5. Can we interpret the effect of age on tumor volume based on fit3? If not, fit a model where we can in fact interpret the effect of age, respecting the marginality principle.

> fit4 <- lm(vol_transformed ~ trt + genotype + age + trt:genotype, data = tumorval)
> summary(fit4)
   
   Call:
   lm(formula = vol_transformed ~ trt + genotype + age + trt:genotype, 
       data = tumorval)
   
   Residuals:
       Min      1Q  Median      3Q     Max 
   -4.2021 -1.2375  0.0182  1.2219  4.9636 
   
   Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
   (Intercept)       11.11622    0.75863  14.653  < 2e-16 ***
   trttrt            -4.00026    0.42578  -9.395  < 2e-16 ***
   genotypeCT         0.61796    0.42573   1.452  0.14826    
   genotypeTT        -0.22852    0.51979  -0.440  0.66069    
   age               -0.06902    0.04733  -1.458  0.14636    
   trttrt:genotypeCT  1.63674    0.60156   2.721  0.00711 ** 
   trttrt:genotypeTT  3.91578    0.73473   5.330 2.73e-07 ***
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
   
   Residual standard error: 1.897 on 193 degrees of freedom
   Multiple R-squared:  0.4462, Adjusted R-squared:  0.429 
   F-statistic: 25.92 on 6 and 193 DF,  p-value: < 2.2e-16

6. Use anova() to compare the last two models. Compare this to the results in part 4.

The last model is not statistically significant (p>.05), and is not a better fit to the data than fit3. The p-value for age:genotype is the same as the p-value comparing the 2 models.

> anova(fit3, fit4)