ANOVA de una Vía

El análisis de varianza es similar al análisis de regresión y en realidad los dos pertenecen a la gran familia de los modelos lineales. Los modelos lineales se caracterizan por investigar la relación entre una variable respuesta cuantitativa y una o más variables exploratorias. Sin embargo, el análisis de varianza difiere del análisis de regresión en que en el ANOVA las variables exploratorias son cualitativas o factores. Vamos a llamar factor a una variable cualitativa que usaremos para designar a los grupos o tratamientos a comparar. Los niveles del factor serán el número de tratamientos o grupos.

Cargamops la base de datos “etchrate”"

read.table("etchrate.txt", header=T)->etchrate
head(etchrate) #muestra el encabezado

##    RF run rate
## 1 160   1  575
## 2 160   2  542
## 3 160   3  530
## 4 160   4  539
## 5 160   5  570
## 6 180   1  565

str(etchrate) #observamos la caracteristica de cada variable. Note que no hay variables categóricas o factores

## 'data.frame':    20 obs. of  3 variables:
##  $ RF  : int  160 160 160 160 160 180 180 180 180 180 ...
##  $ run : int  1 2 3 4 5 1 2 3 4 5 ...
##  $ rate: int  575 542 530 539 570 565 593 590 579 610 ...

Análisis exploratorios

attach(etchrate)
grp.means<-tapply(rate,RF,mean)
grp.means

##   160   180   200   220 
## 551.2 587.4 625.4 707.0

grp.sd<-tapply(rate,RF,sd)
grp.sd

##      160      180      200      220 
## 20.01749 16.74216 20.52559 15.24795

grp.length<-tapply(rate,RF,length)
grp.length

## 160 180 200 220 
##   5   5   5   5

grp.var<-tapply(rate,RF,var)
grp.var

##   160   180   200   220 
## 400.7 280.3 421.3 232.5

# Convertimos cada variable a factor
etchrate$RF <- as.factor(etchrate$RF)
etchrate$run <- as.factor(etchrate$run)
str(etchrate) #Note que ahora nuestras variables de "RF" y "run" son factores

## 'data.frame':    20 obs. of  3 variables:
##  $ RF  : Factor w/ 4 levels "160","180","200",..: 1 1 1 1 1 2 2 2 2 2 ...
##  $ run : Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5 ...
##  $ rate: int  575 542 530 539 570 565 593 590 579 610 ...

#  Corremos modelo del anova
etch.rate.aov <- aov(rate~RF,data=etchrate)
summary(etch.rate.aov)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## RF           3  66871   22290    66.8 2.88e-09 ***
## Residuals   16   5339     334                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#Notamos que existe una diferencia signifcativa en el  modelo. Que se interpreta como: Almenos una de las medias es diferente en los tratamientos datos en RF

#Exploramos visualmente
boxplot(rate~RF, col=456,xlab="RF Power (W)", ylab=expression(paste("Observed Etch Rate (",ring(A),"/min)")))

# Media global
(erate.mean <- mean(etchrate$rate))

## [1] 617.75

# desviación estándar global
(erate.sd <- sd(etchrate$rate))

## [1] 61.6483

#Calcular el efecto del tratamiento
model.tables(etch.rate.aov)

## Tables of effects
## 
##  RF 
## RF
##    160    180    200    220 
## -66.55 -30.35   7.65  89.25

Supuestos del ANOVA

etch.rate.lm <- lm(rate~RF,etchrate)
par(mfrow=c(2,2))
plot(etch.rate.lm)

#Normalidad

residuals(etch.rate.lm)->residuos
shapiro.test(residuos)

## 
##  Shapiro-Wilk normality test
## 
## data:  residuos
## W = 0.93752, p-value = 0.2152

#Si mi interés es analizar cada grupo de forma independiente
shapiro.test(etchrate$rate[etchrate$RF==160])

## 
##  Shapiro-Wilk normality test
## 
## data:  etchrate$rate[etchrate$RF == 160]
## W = 0.87227, p-value = 0.2758

shapiro.test(etchrate$rate[etchrate$RF==180])

## 
##  Shapiro-Wilk normality test
## 
## data:  etchrate$rate[etchrate$RF == 180]
## W = 0.9883, p-value = 0.9734

shapiro.test(etchrate$rate[etchrate$RF==200])

## 
##  Shapiro-Wilk normality test
## 
## data:  etchrate$rate[etchrate$RF == 200]
## W = 0.96916, p-value = 0.8699

shapiro.test(etchrate$rate[etchrate$RF==220])

## 
##  Shapiro-Wilk normality test
## 
## data:  etchrate$rate[etchrate$RF == 220]
## W = 0.98108, p-value = 0.9403

#Homogenidad de Varianzas
library(car)
ncvTest(etch.rate.lm)

## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.2105756    Df = 1     p = 0.6463166

bartlett.test(rate~RF, etchrate)

## 
##  Bartlett test of homogeneity of variances
## 
## data:  rate by RF
## Bartlett's K-squared = 0.43349, df = 3, p-value = 0.9332

levene.test(rate~RF,etchrate)

## Warning: 'levene.test' is deprecated.
## Use 'leveneTest' instead.
## See help("Deprecated") and help("car-deprecated").

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  3  0.1959 0.8977
##       16

#ansari.test(rate~RF,etchrate), solo funciona para cuando hay únicamente dos grupos
fligner.test(rate~RF,etchrate)

## 
##  Fligner-Killeen test of homogeneity of variances
## 
## data:  rate by RF
## Fligner-Killeen:med chi-squared = 1.0438, df = 3, p-value = 0.7907

Bartlett’s test - If the data is normally distributed, this is the best test to use. It is sensitive to data which is not non-normally distribution; it is more likely to return a “false positive” when the data is non-normal.

.Levene’s test - this is more robust to departures from normality than Bartlett’s test. It is in the car package.

.Fligner-Killeen test - this is a non-parametric test which is very robust against departures from normality.

library(Rcmdr)

## Loading required package: splines

## Loading required package: RcmdrMisc

## Loading required package: sandwich

## La interfaz R-Commander sólo funciona en sesiones interactivas

plotMeans(rate,factor(RF),error.bars=c("sd"))

TukeyHSD(etch.rate.aov)->tkn
plot(tkn)

tukey<-TukeyHSD(etch.rate.aov, conf.level=.95)
tukey

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = rate ~ RF, data = etchrate)
## 
## $RF
##          diff        lwr       upr     p adj
## 180-160  36.2   3.145624  69.25438 0.0294279
## 200-160  74.2  41.145624 107.25438 0.0000455
## 220-160 155.8 122.745624 188.85438 0.0000000
## 200-180  38.0   4.945624  71.05438 0.0215995
## 220-180 119.6  86.545624 152.65438 0.0000001
## 220-200  81.6  48.545624 114.65438 0.0000146

plot(tukey,  sub="Test de diferencias significativas de Tukey" )

otros Test, Post Hoc (cuando se quiere comprobar todas las comparaciones por pares de las posibles diferencias)

pairwise.t.test(etchrate$rate,etchrate$RF,p.adjust.method="bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  etchrate$rate and etchrate$RF 
## 
##     160     180     200    
## 180 0.038   -       -      
## 200 5.1e-05 0.028   -      
## 220 2.2e-09 1.0e-07 1.6e-05
## 
## P value adjustment method: bonferroni

pairwise.t.test(etchrate$rate,etchrate$RF,p.adjust.method="hommel")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  etchrate$rate and etchrate$RF 
## 
##     160     180     200    
## 180 0.0064  -       -      
## 200 2.5e-05 0.0064  -      
## 220 2.2e-09 8.5e-08 1.1e-05
## 
## P value adjustment method: hommel

Utilice la base de datos del paquete “nlme”, denominado “Alfalfa”

library(nlme)
head(Alfalfa,15)

## Grouped Data: Yield ~ Date | Block/Variety
##    Variety Date Block Yield
## 1    Ladak None     1  2.17
## 2    Ladak   S1     1  1.58
## 3    Ladak  S20     1  2.29
## 4    Ladak   O7     1  2.23
## 5    Ladak None     2  1.88
## 6    Ladak   S1     2  1.26
## 7    Ladak  S20     2  1.60
## 8    Ladak   O7     2  2.01
## 9    Ladak None     3  1.62
## 10   Ladak   S1     3  1.22
## 11   Ladak  S20     3  1.67
## 12   Ladak   O7     3  1.82
## 13   Ladak None     4  2.34
## 14   Ladak   S1     4  1.59
## 15   Ladak  S20     4  1.91

#?Alfalfa revise la información de la base de datos

aov(Alfalfa$Yield ~ Alfalfa$Variety)->ano.alf
ano.alf

## Call:
##    aov(formula = Alfalfa$Yield ~ Alfalfa$Variety)
## 
## Terms:
##                 Alfalfa$Variety Residuals
## Sum of Squares         0.178019  8.943746
## Deg. of Freedom               2        69
## 
## Residual standard error: 0.3600271
## Estimated effects may be unbalanced

summary(ano.alf)

##                 Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety  2  0.178 0.08901   0.687  0.507
## Residuals       69  8.944 0.12962

#Medias de los tratamientos
tapply(Alfalfa$Yield , Alfalfa$Variety, mean)

##  Cossack    Ladak   Ranger 
## 1.571667 1.666250 1.552500

# Media global
(alfalfa.mean <- mean(Alfalfa$Yield))

## [1] 1.596806

# desviación estándar global
(alfalfa.sd <- sd(Alfalfa$Yield))

## [1] 0.3584349

#Calcular el efecto del tratamiento
model.tables(ano.alf)

## Tables of effects
## 
##  Alfalfa$Variety 
## Alfalfa$Variety
##  Cossack    Ladak   Ranger 
## -0.02514  0.06944 -0.04431

boxplot(Yield ~ Variety, col=456,data=Alfalfa)

plotMeans(Alfalfa$Yield,factor(Alfalfa$Variety),error.bars=c("conf.int"))

Sample size para ANOVA

Para estimar el tamaño mínimo de muestras para ejecutar un anova con cierta fiabilidad, se requiere conocer la media y varianza de los trtatamientos.

summary(ano.alf)

##                 Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety  2  0.178 0.08901   0.687  0.507
## Residuals       69  8.944 0.12962

mu <- c(A=1.571667, B=1.666250, C=1.552500) # given group means
var(mu)

## [1] 0.00370873

#Tamaño mínimo de muestra de los tratamientos
power.anova.test(groups = 3,  
                 between.var = var(mu),
                 within.var = 0.12962^2, 
                 sig.level = 0.05,   
                 power = 0.80)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 22.85239
##     between.var = 0.00370873
##      within.var = 0.01680134
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

#leer el n

#Si queremos saber el poder de nuestro análisis
tapply(Alfalfa$Yield , Alfalfa$Variety, length)

## Cossack   Ladak  Ranger 
##      24      24      24

summary(ano.alf)

##                 Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety  2  0.178 0.08901   0.687  0.507
## Residuals       69  8.944 0.12962

power.anova.test(groups = 3,
                 n = 24,   # actual sample size
                 between.var = var(mu), 
                 within.var = 0.12962^2)

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 3
##               n = 24
##     between.var = 0.00370873
##      within.var = 0.01680134
##       sig.level = 0.05
##           power = 0.8211324
## 
## NOTE: n is number in each group

power.anova.test(groups = 3, n = 24, # actual sample size between.var = var(mu), within.var = 0.12962^2)

Práctica

Utilice la base de datos “InsectSprays”, ejecute un anova y calcule el tamaño mínimo de muestra, utilizando la variable count (variable independiente),spray (tratamiento)

Así mismo calcule el efecto de su resultado

## [1] "count" "spray"

##         A         B         C         D         E         F 
## 14.500000 15.333333  2.083333  4.916667  3.500000 16.666667

## [1] 44.48056

##  A  B  C  D  E  F 
## 12 12 12 12 12 12

##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 6
##               n = 14.62923
##     between.var = 44.48056
##      within.var = 237.16
##       sig.level = 0.05
##           power = 0.8
## 
## NOTE: n is number in each group

## 
##      Balanced one-way analysis of variance power calculation 
## 
##          groups = 6
##               n = 12
##     between.var = 44.48056
##      within.var = 237.16
##       sig.level = 0.05
##           power = 0.695408
## 
## NOTE: n is number in each group

## 'data.frame':    72 obs. of  2 variables:
##  $ count: num  10 7 20 14 14 12 10 23 17 20 ...
##  $ spray: Factor w/ 6 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

##    count spray
## 1     10     A
## 2      7     A
## 3     20     A
## 4     14     A
## 5     14     A
## 6     12     A
## 7     10     A
## 8     23     A
## 9     17     A
## 10    20     A
## 11    14     A
## 12    13     A
## 13    11     B
## 14    17     B
## 15    21     B

## [1] 14.5

##         A         B         C         D         E         F 
## 14.500000 15.333333  2.083333  4.916667  3.500000 16.666667

##  A  B  C  D  E  F 
## 12 12 12 12 12 12

##         F         B         C         D         E         A 
## 16.666667 15.333333  2.083333  4.916667  3.500000 14.500000

## [1] TRUE

##             Df Sum Sq Mean Sq F value Pr(>F)    
## spray        5   2669   533.8    34.7 <2e-16 ***
## Residuals   66   1015    15.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = count ~ spray, data = InsectSprays)
## 
## $spray
##            diff        lwr       upr     p adj
## B-A   0.8333333  -3.866075  5.532742 0.9951810
## C-A -12.4166667 -17.116075 -7.717258 0.0000000
## D-A  -9.5833333 -14.282742 -4.883925 0.0000014
## E-A -11.0000000 -15.699409 -6.300591 0.0000000
## F-A   2.1666667  -2.532742  6.866075 0.7542147
## C-B -13.2500000 -17.949409 -8.550591 0.0000000
## D-B -10.4166667 -15.116075 -5.717258 0.0000002
## E-B -11.8333333 -16.532742 -7.133925 0.0000000
## F-B   1.3333333  -3.366075  6.032742 0.9603075
## D-C   2.8333333  -1.866075  7.532742 0.4920707
## E-C   1.4166667  -3.282742  6.116075 0.9488669
## F-C  14.5833333   9.883925 19.282742 0.0000000
## E-D  -1.4166667  -6.116075  3.282742 0.9488669
## F-D  11.7500000   7.050591 16.449409 0.0000000
## F-E  13.1666667   8.467258 17.866075 0.0000000

## 
## Call:
## aov(formula = count ~ spray, data = InsectSprays)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.333 -1.958 -0.500  1.667  9.333 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  14.5000     1.1322  12.807  < 2e-16 ***
## sprayB        0.8333     1.6011   0.520    0.604    
## sprayC      -12.4167     1.6011  -7.755 7.27e-11 ***
## sprayD       -9.5833     1.6011  -5.985 9.82e-08 ***
## sprayE      -11.0000     1.6011  -6.870 2.75e-09 ***
## sprayF        2.1667     1.6011   1.353    0.181    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.922 on 66 degrees of freedom
## Multiple R-squared:  0.7244, Adjusted R-squared:  0.7036 
## F-statistic:  34.7 on 5 and 66 DF,  p-value: < 2.2e-16

## 
##  Bartlett test of homogeneity of variances
## 
## data:  count by spray
## Bartlett's K-squared = 25.96, df = 5, p-value = 9.085e-05

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  count and spray
## F = 36.065, num df = 5.000, denom df = 30.043, p-value = 7.999e-12

## 
##  Kruskal-Wallis rank sum test
## 
## data:  count by spray
## Kruskal-Wallis chi-squared = 54.691, df = 5, p-value = 1.511e-10

Analisis de varianza_un factor

Oscar Ramírez-Alán