El análisis de varianza es similar al análisis de regresión y en realidad los dos pertenecen a la gran familia de los modelos lineales. Los modelos lineales se caracterizan por investigar la relación entre una variable respuesta cuantitativa y una o más variables exploratorias. Sin embargo, el análisis de varianza difiere del análisis de regresión en que en el ANOVA las variables exploratorias son cualitativas o factores. Vamos a llamar factor a una variable cualitativa que usaremos para designar a los grupos o tratamientos a comparar. Los niveles del factor serán el número de tratamientos o grupos.
Cargamops la base de datos “etchrate”"
read.table("etchrate.txt", header=T)->etchrate
head(etchrate) #muestra el encabezado
## RF run rate
## 1 160 1 575
## 2 160 2 542
## 3 160 3 530
## 4 160 4 539
## 5 160 5 570
## 6 180 1 565
str(etchrate) #observamos la caracteristica de cada variable. Note que no hay variables categóricas o factores
## 'data.frame': 20 obs. of 3 variables:
## $ RF : int 160 160 160 160 160 180 180 180 180 180 ...
## $ run : int 1 2 3 4 5 1 2 3 4 5 ...
## $ rate: int 575 542 530 539 570 565 593 590 579 610 ...
Análisis exploratorios
attach(etchrate)
grp.means<-tapply(rate,RF,mean)
grp.means
## 160 180 200 220
## 551.2 587.4 625.4 707.0
grp.sd<-tapply(rate,RF,sd)
grp.sd
## 160 180 200 220
## 20.01749 16.74216 20.52559 15.24795
grp.length<-tapply(rate,RF,length)
grp.length
## 160 180 200 220
## 5 5 5 5
grp.var<-tapply(rate,RF,var)
grp.var
## 160 180 200 220
## 400.7 280.3 421.3 232.5
# Convertimos cada variable a factor
etchrate$RF <- as.factor(etchrate$RF)
etchrate$run <- as.factor(etchrate$run)
str(etchrate) #Note que ahora nuestras variables de "RF" y "run" son factores
## 'data.frame': 20 obs. of 3 variables:
## $ RF : Factor w/ 4 levels "160","180","200",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ run : Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5 ...
## $ rate: int 575 542 530 539 570 565 593 590 579 610 ...
# Corremos modelo del anova
etch.rate.aov <- aov(rate~RF,data=etchrate)
summary(etch.rate.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 3 66871 22290 66.8 2.88e-09 ***
## Residuals 16 5339 334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Notamos que existe una diferencia signifcativa en el modelo. Que se interpreta como: Almenos una de las medias es diferente en los tratamientos datos en RF
#Exploramos visualmente
boxplot(rate~RF, col=456,xlab="RF Power (W)", ylab=expression(paste("Observed Etch Rate (",ring(A),"/min)")))
# Media global
(erate.mean <- mean(etchrate$rate))
## [1] 617.75
# desviación estándar global
(erate.sd <- sd(etchrate$rate))
## [1] 61.6483
#Calcular el efecto del tratamiento
model.tables(etch.rate.aov)
## Tables of effects
##
## RF
## RF
## 160 180 200 220
## -66.55 -30.35 7.65 89.25
etch.rate.lm <- lm(rate~RF,etchrate)
par(mfrow=c(2,2))
plot(etch.rate.lm)
#Normalidad
residuals(etch.rate.lm)->residuos
shapiro.test(residuos)
##
## Shapiro-Wilk normality test
##
## data: residuos
## W = 0.93752, p-value = 0.2152
#Si mi interés es analizar cada grupo de forma independiente
shapiro.test(etchrate$rate[etchrate$RF==160])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 160]
## W = 0.87227, p-value = 0.2758
shapiro.test(etchrate$rate[etchrate$RF==180])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 180]
## W = 0.9883, p-value = 0.9734
shapiro.test(etchrate$rate[etchrate$RF==200])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 200]
## W = 0.96916, p-value = 0.8699
shapiro.test(etchrate$rate[etchrate$RF==220])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 220]
## W = 0.98108, p-value = 0.9403
#Homogenidad de Varianzas
library(car)
ncvTest(etch.rate.lm)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2105756 Df = 1 p = 0.6463166
bartlett.test(rate~RF, etchrate)
##
## Bartlett test of homogeneity of variances
##
## data: rate by RF
## Bartlett's K-squared = 0.43349, df = 3, p-value = 0.9332
levene.test(rate~RF,etchrate)
## Warning: 'levene.test' is deprecated.
## Use 'leveneTest' instead.
## See help("Deprecated") and help("car-deprecated").
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.1959 0.8977
## 16
#ansari.test(rate~RF,etchrate), solo funciona para cuando hay únicamente dos grupos
fligner.test(rate~RF,etchrate)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: rate by RF
## Fligner-Killeen:med chi-squared = 1.0438, df = 3, p-value = 0.7907
Bartlett’s test - If the data is normally distributed, this is the best test to use. It is sensitive to data which is not non-normally distribution; it is more likely to return a “false positive” when the data is non-normal.
.Levene’s test - this is more robust to departures from normality than Bartlett’s test. It is in the car package.
.Fligner-Killeen test - this is a non-parametric test which is very robust against departures from normality.
library(Rcmdr)
## Loading required package: splines
## Loading required package: RcmdrMisc
## Loading required package: sandwich
## La interfaz R-Commander sólo funciona en sesiones interactivas
plotMeans(rate,factor(RF),error.bars=c("sd"))
TukeyHSD(etch.rate.aov)->tkn
plot(tkn)
tukey<-TukeyHSD(etch.rate.aov, conf.level=.95)
tukey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = rate ~ RF, data = etchrate)
##
## $RF
## diff lwr upr p adj
## 180-160 36.2 3.145624 69.25438 0.0294279
## 200-160 74.2 41.145624 107.25438 0.0000455
## 220-160 155.8 122.745624 188.85438 0.0000000
## 200-180 38.0 4.945624 71.05438 0.0215995
## 220-180 119.6 86.545624 152.65438 0.0000001
## 220-200 81.6 48.545624 114.65438 0.0000146
plot(tukey, sub="Test de diferencias significativas de Tukey" )
otros Test, Post Hoc (cuando se quiere comprobar todas las comparaciones por pares de las posibles diferencias)
pairwise.t.test(etchrate$rate,etchrate$RF,p.adjust.method="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: etchrate$rate and etchrate$RF
##
## 160 180 200
## 180 0.038 - -
## 200 5.1e-05 0.028 -
## 220 2.2e-09 1.0e-07 1.6e-05
##
## P value adjustment method: bonferroni
pairwise.t.test(etchrate$rate,etchrate$RF,p.adjust.method="hommel")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: etchrate$rate and etchrate$RF
##
## 160 180 200
## 180 0.0064 - -
## 200 2.5e-05 0.0064 -
## 220 2.2e-09 8.5e-08 1.1e-05
##
## P value adjustment method: hommel
Utilice la base de datos del paquete “nlme”, denominado “Alfalfa”
library(nlme)
head(Alfalfa,15)
## Grouped Data: Yield ~ Date | Block/Variety
## Variety Date Block Yield
## 1 Ladak None 1 2.17
## 2 Ladak S1 1 1.58
## 3 Ladak S20 1 2.29
## 4 Ladak O7 1 2.23
## 5 Ladak None 2 1.88
## 6 Ladak S1 2 1.26
## 7 Ladak S20 2 1.60
## 8 Ladak O7 2 2.01
## 9 Ladak None 3 1.62
## 10 Ladak S1 3 1.22
## 11 Ladak S20 3 1.67
## 12 Ladak O7 3 1.82
## 13 Ladak None 4 2.34
## 14 Ladak S1 4 1.59
## 15 Ladak S20 4 1.91
#?Alfalfa revise la información de la base de datos
aov(Alfalfa$Yield ~ Alfalfa$Variety)->ano.alf
ano.alf
## Call:
## aov(formula = Alfalfa$Yield ~ Alfalfa$Variety)
##
## Terms:
## Alfalfa$Variety Residuals
## Sum of Squares 0.178019 8.943746
## Deg. of Freedom 2 69
##
## Residual standard error: 0.3600271
## Estimated effects may be unbalanced
summary(ano.alf)
## Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety 2 0.178 0.08901 0.687 0.507
## Residuals 69 8.944 0.12962
#Medias de los tratamientos
tapply(Alfalfa$Yield , Alfalfa$Variety, mean)
## Cossack Ladak Ranger
## 1.571667 1.666250 1.552500
# Media global
(alfalfa.mean <- mean(Alfalfa$Yield))
## [1] 1.596806
# desviación estándar global
(alfalfa.sd <- sd(Alfalfa$Yield))
## [1] 0.3584349
#Calcular el efecto del tratamiento
model.tables(ano.alf)
## Tables of effects
##
## Alfalfa$Variety
## Alfalfa$Variety
## Cossack Ladak Ranger
## -0.02514 0.06944 -0.04431
boxplot(Yield ~ Variety, col=456,data=Alfalfa)
plotMeans(Alfalfa$Yield,factor(Alfalfa$Variety),error.bars=c("conf.int"))
Para estimar el tamaño mínimo de muestras para ejecutar un anova con cierta fiabilidad, se requiere conocer la media y varianza de los trtatamientos.
summary(ano.alf)
## Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety 2 0.178 0.08901 0.687 0.507
## Residuals 69 8.944 0.12962
mu <- c(A=1.571667, B=1.666250, C=1.552500) # given group means
var(mu)
## [1] 0.00370873
#Tamaño mínimo de muestra de los tratamientos
power.anova.test(groups = 3,
between.var = var(mu),
within.var = 0.12962^2,
sig.level = 0.05,
power = 0.80)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 22.85239
## between.var = 0.00370873
## within.var = 0.01680134
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
#leer el n
#Si queremos saber el poder de nuestro análisis
tapply(Alfalfa$Yield , Alfalfa$Variety, length)
## Cossack Ladak Ranger
## 24 24 24
summary(ano.alf)
## Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety 2 0.178 0.08901 0.687 0.507
## Residuals 69 8.944 0.12962
power.anova.test(groups = 3,
n = 24, # actual sample size
between.var = var(mu),
within.var = 0.12962^2)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 24
## between.var = 0.00370873
## within.var = 0.01680134
## sig.level = 0.05
## power = 0.8211324
##
## NOTE: n is number in each group
power.anova.test(groups = 3, n = 24, # actual sample size between.var = var(mu), within.var = 0.12962^2)
## [1] "count" "spray"
## A B C D E F
## 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
## [1] 44.48056
## A B C D E F
## 12 12 12 12 12 12
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Balanced one-way analysis of variance power calculation
##
## groups = 6
## n = 14.62923
## between.var = 44.48056
## within.var = 237.16
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
##
## Balanced one-way analysis of variance power calculation
##
## groups = 6
## n = 12
## between.var = 44.48056
## within.var = 237.16
## sig.level = 0.05
## power = 0.695408
##
## NOTE: n is number in each group
## 'data.frame': 72 obs. of 2 variables:
## $ count: num 10 7 20 14 14 12 10 23 17 20 ...
## $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
## count spray
## 1 10 A
## 2 7 A
## 3 20 A
## 4 14 A
## 5 14 A
## 6 12 A
## 7 10 A
## 8 23 A
## 9 17 A
## 10 20 A
## 11 14 A
## 12 13 A
## 13 11 B
## 14 17 B
## 15 21 B
## [1] 14.5
## A B C D E F
## 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667
## A B C D E F
## 12 12 12 12 12 12
## F B C D E A
## 16.666667 15.333333 2.083333 4.916667 3.500000 14.500000
## [1] TRUE
## Df Sum Sq Mean Sq F value Pr(>F)
## spray 5 2669 533.8 34.7 <2e-16 ***
## Residuals 66 1015 15.4
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = count ~ spray, data = InsectSprays)
##
## $spray
## diff lwr upr p adj
## B-A 0.8333333 -3.866075 5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A 2.1666667 -2.532742 6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B 1.3333333 -3.366075 6.032742 0.9603075
## D-C 2.8333333 -1.866075 7.532742 0.4920707
## E-C 1.4166667 -3.282742 6.116075 0.9488669
## F-C 14.5833333 9.883925 19.282742 0.0000000
## E-D -1.4166667 -6.116075 3.282742 0.9488669
## F-D 11.7500000 7.050591 16.449409 0.0000000
## F-E 13.1666667 8.467258 17.866075 0.0000000
##
## Call:
## aov(formula = count ~ spray, data = InsectSprays)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.333 -1.958 -0.500 1.667 9.333
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.5000 1.1322 12.807 < 2e-16 ***
## sprayB 0.8333 1.6011 0.520 0.604
## sprayC -12.4167 1.6011 -7.755 7.27e-11 ***
## sprayD -9.5833 1.6011 -5.985 9.82e-08 ***
## sprayE -11.0000 1.6011 -6.870 2.75e-09 ***
## sprayF 2.1667 1.6011 1.353 0.181
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7036
## F-statistic: 34.7 on 5 and 66 DF, p-value: < 2.2e-16
##
## Bartlett test of homogeneity of variances
##
## data: count by spray
## Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05
##
## One-way analysis of means (not assuming equal variances)
##
## data: count and spray
## F = 36.065, num df = 5.000, denom df = 30.043, p-value = 7.999e-12
##
## Kruskal-Wallis rank sum test
##
## data: count by spray
## Kruskal-Wallis chi-squared = 54.691, df = 5, p-value = 1.511e-10