knitr::opts_chunk$set(echo = TRUE)
library(multcomp)
## Loading required package: mvtnorm
## Loading required package: survival
## Loading required package: TH.data
## Loading required package: MASS
##
## Attaching package: 'TH.data'
## The following object is masked from 'package:MASS':
##
## geyser
library(car)
## Loading required package: carData
Previously we considered regression models for predicting a quantitative response variable from quantitative predictor variables. But we can include nominal or ordinal factors as predictors as well. When factors are included as explanatory variables, our focus usually shifts from prediction to understanding group differences, and the methodology is referred to as analysis of variance (ANOVA). ANOVA methodology is used to analyze a wide variety of experimental and quasi-experimental designs.
In the following example, fifty patients received one of five cholesterol-reducing drug regimens (trt). Three of the treatment conditions involved the same drug administered as 20 mg once per day (1time), 10mg twice per day (2times), or 5 mg four times per day (4times). The two remaining conditions (drugD and drugE) represented competing drugs. Which drug regimen produced the greatest cholesterol reduction (response)?
We are going to use box-plots to observe differences between treatments, using ggplot2.
library(multcomp)
data(cholesterol)
cholesterol
library(ggplot2)
ggplot(cholesterol, aes(x=trt, y=response)) + geom_boxplot(fill="cornflowerblue") +
labs(x = "Tratamientos", y = "Reducción Colesterol, %")
In a one-way ANOVA, we are interested in comparing the dependent variable means of two or more groups defined by a categorical grouping factor.
library(data.table)
attach(cholesterol)
# Mean and StDev by treatment using data.table
cholTr <- data.table(cholesterol)
cholStat <- cholTr[, list(Media=mean(response), StDev=sd(response)), by=list(Treatment=trt)]
cholStat
# ANOVA
fit <- aov(response ~ trt)
summary(fit)
## Df Sum Sq Mean Sq F value Pr(>F)
## trt 4 1351.4 337.8 32.43 9.82e-13 ***
## Residuals 45 468.8 10.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
detach(cholesterol)
The results of the analysis indicate that there are significant differential effects of the treatments to reduce cholesterol, this is because the probability of a real no-difference (type I error) is very low (Pr(>F) = 9.82e-13).
But, wait! To apply the procedure above, there are a couple of requirements from the data:
First will see if the response variable has a normal distribution:
library(car)
qqPlot(lm(response ~ trt, data=cholesterol),
simulate=TRUE, main="Q-Q Plot", labels=FALSE, col = "red")
## [1] 19 38
The data falls within the 95% confidence envelope, suggesting that the normality assumption has been met fairly well.
Now we will test homogeneity of variance, using the Bartlett’s test:
bartlett.test(response ~ trt, data=cholesterol)
##
## Bartlett test of homogeneity of variances
##
## data: response by trt
## Bartlett's K-squared = 0.57975, df = 4, p-value = 0.9653
Bartlett’s test indicates that the variances in the five groups do not differ significantly (p = 0.97).
The ANOVA F test for treatment tells you that the five drug regimens are not equally effective, but it does not tell you which treatments differ from one another. You can use a multiple comparison procedure to answer this question. For example, the TukeyHSD() function provides a test of all pairwise differences between group means, as shown next.
TukeyHSD(fit)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = response ~ trt)
##
## $trt
## diff lwr upr p adj
## 2times-1time 3.44300 -0.6582817 7.544282 0.1380949
## 4times-1time 6.59281 2.4915283 10.694092 0.0003542
## drugD-1time 9.57920 5.4779183 13.680482 0.0000003
## drugE-1time 15.16555 11.0642683 19.266832 0.0000000
## 4times-2times 3.14981 -0.9514717 7.251092 0.2050382
## drugD-2times 6.13620 2.0349183 10.237482 0.0009611
## drugE-2times 11.72255 7.6212683 15.823832 0.0000000
## drugD-4times 2.98639 -1.1148917 7.087672 0.2512446
## drugE-4times 8.57274 4.4714583 12.674022 0.0000037
## drugE-drugD 5.58635 1.4850683 9.687632 0.0030633
par(las=2)
par(mar=c(5,8,4,2))
plot(TukeyHSD(fit))
For example, the mean cholesterol reductions for 1-time and 2-times are not significantly different from each other (p = 0.138), whereas the difference between 1-time and 4-times is significantly different (p < .001). In this graph, confidence intervals that include 0 indicate treatments that are not significantly different (p > 0.5).
When the hypothesis testing is about two samples, the ANOVA test is equivalent to a t-test.
#datos
drogaA <- c(8.8,8.4,7.9,8.7,9.1,9.6)
drogaB <- c(9.9,9.0,11.1,9.6,8.7,10.4,9.5)
#tabla de datos - no funciona data.frame(drogaA, drogaB)
tabla <- list(Droga_A = drogaA, Droga_B = drogaB)
as.data.frame(lapply(tabla, `length<-`, max(sapply(tabla, length))))
#hipótesis alterna: droga A - drogaB diferente de 0 (menor o mayor)
pruebat <- t.test(drogaA,drogaB, var.equal = TRUE, alternative = "two.sided")
pruebat
##
## Two Sample t-test
##
## data: drogaA and drogaB
## t = -2.4765, df = 11, p-value = 0.03076
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.8752609 -0.1104534
## sample estimates:
## mean of x mean of y
## 8.750000 9.742857