El análisis de varianza es similar al análisis de regresión y en realidad los dos pertenecen a la gran familia de los modelos lineales. Los modelos lineales se caracterizan por investigar la relación entre una variable respuesta cuantitativa y una o más variables exploratorias. Sin embargo, el análisis de varianza difiere del análisis de regresión en que en el ANOVA las variables exploratorias son cualitativas o factores. Vamos a llamar factor a una variable cualitativa que usaremos para designar a los grupos o tratamientos a comparar. Los niveles del factor serín el número de tratamientos o grupos.
Cargamos la base de datos “etchrate”
etchrate <- read.table("https://raw.githubusercontent.com/osoramirez/bases_datos/master/etchrate.csv", sep="", header=TRUE)
etchrate
## RF run rate
## 1 160 1 575
## 2 160 2 542
## 3 160 3 530
## 4 160 4 539
## 5 160 5 570
## 6 180 1 565
## 7 180 2 593
## 8 180 3 590
## 9 180 4 579
## 10 180 5 610
## 11 200 1 600
## 12 200 2 651
## 13 200 3 610
## 14 200 4 637
## 15 200 5 629
## 16 220 1 725
## 17 220 2 700
## 18 220 3 715
## 19 220 4 685
## 20 220 5 710
head(etchrate)#muestra el encabezado
## RF run rate
## 1 160 1 575
## 2 160 2 542
## 3 160 3 530
## 4 160 4 539
## 5 160 5 570
## 6 180 1 565
str(etchrate) #observamos la caracteristica de cada variable. Note que no hay variables categóricas o factores
## 'data.frame': 20 obs. of 3 variables:
## $ RF : int 160 160 160 160 160 180 180 180 180 180 ...
## $ run : int 1 2 3 4 5 1 2 3 4 5 ...
## $ rate: int 575 542 530 539 570 565 593 590 579 610 ...
análisis exploratorios
attach(etchrate)
grp.means<-tapply(rate,RF,mean)
grp.means
## 160 180 200 220
## 551.2 587.4 625.4 707.0
grp.sd<-tapply(rate,RF,sd)
grp.sd
## 160 180 200 220
## 20.01749 16.74216 20.52559 15.24795
grp.length<-tapply(rate,RF,length)
grp.length
## 160 180 200 220
## 5 5 5 5
grp.var<-tapply(rate,RF,var)
grp.var
## 160 180 200 220
## 400.7 280.3 421.3 232.5
# Convertimos cada variable a factor
etchrate$RF <- as.factor(etchrate$RF)
etchrate$run <- as.factor(etchrate$run)
str(etchrate) #Note que ahora nuestras variables de "RF" y "run" son factores
## 'data.frame': 20 obs. of 3 variables:
## $ RF : Factor w/ 4 levels "160","180","200",..: 1 1 1 1 1 2 2 2 2 2 ...
## $ run : Factor w/ 5 levels "1","2","3","4",..: 1 2 3 4 5 1 2 3 4 5 ...
## $ rate: int 575 542 530 539 570 565 593 590 579 610 ...
# Corremos modelo del anova
etch.rate.aov <- aov(rate~RF,data=etchrate)
summary(etch.rate.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 3 66871 22290 66.8 2.88e-09 ***
## Residuals 16 5339 334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
###Notamos que existe una diferencia signifcativa en el modelo. Que se interpreta como: Almenos una de las medias es diferente en los tratamientos datos en RF
#Exploramos visualmente
boxplot(rate~RF, col=456,xlab="RF Power (W)", ylab=expression(paste("Observed Etch Rate (",ring(A),"/min)")))
# Media global
(erate.mean <- mean(etchrate$rate))
## [1] 617.75
# desviación estándar global
(erate.sd <- sd(etchrate$rate))
## [1] 61.6483
# Corremos modelo del anova
etch.rate.aov <- aov(rate~RF,data=etchrate)
summary(etch.rate.aov)
## Df Sum Sq Mean Sq F value Pr(>F)
## RF 3 66871 22290 66.8 2.88e-09 ***
## Residuals 16 5339 334
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#Calcular el efecto del tratamiento
model.tables(etch.rate.aov)
## Tables of effects
##
## RF
## RF
## 160 180 200 220
## -66.55 -30.35 7.65 89.25
etch.rate.lm <- lm(rate~RF,etchrate)
par(mfrow=c(2,2))
plot(etch.rate.lm)
#Normalidad
residuals(etch.rate.lm)->residuos
shapiro.test(residuos)
##
## Shapiro-Wilk normality test
##
## data: residuos
## W = 0.93752, p-value = 0.2152
#Si mi interés es analizar cada grupo de forma independiente
shapiro.test(etchrate$rate[etchrate$RF==160])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 160]
## W = 0.87227, p-value = 0.2758
shapiro.test(etchrate$rate[etchrate$RF==180])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 180]
## W = 0.9883, p-value = 0.9734
shapiro.test(etchrate$rate[etchrate$RF==200])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 200]
## W = 0.96916, p-value = 0.8699
shapiro.test(etchrate$rate[etchrate$RF==220])
##
## Shapiro-Wilk normality test
##
## data: etchrate$rate[etchrate$RF == 220]
## W = 0.98108, p-value = 0.9403
#Homogenidad de Varianzas
library(car)
## Loading required package: carData
ncvTest(etch.rate.lm)
## Non-constant Variance Score Test
## Variance formula: ~ fitted.values
## Chisquare = 0.2105756, Df = 1, p = 0.64632
bartlett.test(rate~RF, etchrate)
##
## Bartlett test of homogeneity of variances
##
## data: rate by RF
## Bartlett's K-squared = 0.43349, df = 3, p-value = 0.9332
leveneTest(rate~RF,etchrate)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 3 0.1959 0.8977
## 16
#ansari.test(rate~RF,etchrate), solo funciona para cuando hay únicamente dos grupos
fligner.test(rate~RF,etchrate)
##
## Fligner-Killeen test of homogeneity of variances
##
## data: rate by RF
## Fligner-Killeen:med chi-squared = 1.0438, df = 3, p-value = 0.7907
TukeyHSD(etch.rate.aov)->tkn
plot(tkn)
tukey<-TukeyHSD(etch.rate.aov, conf.level=.95)
tukey
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = rate ~ RF, data = etchrate)
##
## $RF
## diff lwr upr p adj
## 180-160 36.2 3.145624 69.25438 0.0294279
## 200-160 74.2 41.145624 107.25438 0.0000455
## 220-160 155.8 122.745624 188.85438 0.0000000
## 200-180 38.0 4.945624 71.05438 0.0215995
## 220-180 119.6 86.545624 152.65438 0.0000001
## 220-200 81.6 48.545624 114.65438 0.0000146
plot(tukey, sub="Test de diferencias significativas de Tukey" )
pairwise.t.test(etchrate$rate,etchrate$RF,p.adjust.method="bonferroni")
##
## Pairwise comparisons using t tests with pooled SD
##
## data: etchrate$rate and etchrate$RF
##
## 160 180 200
## 180 0.038 - -
## 200 5.1e-05 0.028 -
## 220 2.2e-09 1.0e-07 1.6e-05
##
## P value adjustment method: bonferroni
Bartlett’s test - If the data is normally distributed, this is the best test to use. It is sensitive to data which is not non-normally distribution; it is more likely to return a “false positive” when the data is non-normal.
.Levene’s test - this is more robust to departures from normality than Bartlett’s test. It is in the car package.
.Fligner-Killeen test - this is a non-parametric test which is very robust against departures from normality
library(Rcmdr)
plotMeans(rate,factor(RF),error.bars=c("sd"))
library(nlme)
head(Alfalfa,15)
## Grouped Data: Yield ~ Date | Block/Variety
## Variety Date Block Yield
## 1 Ladak None 1 2.17
## 2 Ladak S1 1 1.58
## 3 Ladak S20 1 2.29
## 4 Ladak O7 1 2.23
## 5 Ladak None 2 1.88
## 6 Ladak S1 2 1.26
## 7 Ladak S20 2 1.60
## 8 Ladak O7 2 2.01
## 9 Ladak None 3 1.62
## 10 Ladak S1 3 1.22
## 11 Ladak S20 3 1.67
## 12 Ladak O7 3 1.82
## 13 Ladak None 4 2.34
## 14 Ladak S1 4 1.59
## 15 Ladak S20 4 1.91
aov(Alfalfa$Yield ~ Alfalfa$Variety)->ano.alf
ano.alf
## Call:
## aov(formula = Alfalfa$Yield ~ Alfalfa$Variety)
##
## Terms:
## Alfalfa$Variety Residuals
## Sum of Squares 0.178019 8.943746
## Deg. of Freedom 2 69
##
## Residual standard error: 0.3600271
## Estimated effects may be unbalanced
summary(ano.alf)
## Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety 2 0.178 0.08901 0.687 0.507
## Residuals 69 8.944 0.12962
#Medias de los tratamientos
tapply(Alfalfa$Yield , Alfalfa$Variety, mean)
## Cossack Ladak Ranger
## 1.571667 1.666250 1.552500
# Media global
(alfalfa.mean <- mean(Alfalfa$Yield))
## [1] 1.596806
# desviación estándar global
(alfalfa.sd <- sd(Alfalfa$Yield))
## [1] 0.3584349
#Calcular el efecto del tratamiento
model.tables(ano.alf)
## Tables of effects
##
## Alfalfa$Variety
## Alfalfa$Variety
## Cossack Ladak Ranger
## -0.02514 0.06944 -0.04431
boxplot(Yield ~ Variety, col=456,data=Alfalfa)
plotMeans(Alfalfa$Yield,factor(Alfalfa$Variety),error.bars=c("conf.int"))
Para estimar el tamaño mínimo de muestras para ejecutar un anova con cierta fiabilidad, se requiere conocer la media y varianza de los trtatamientos.
mu <- c(A=1.571667, B=1.666250, C=1.552500) # given group means
var(mu)
## [1] 0.00370873
#Tamaño mínimo de muestra de los tratamientos
power.anova.test(groups = 3,
between.var = var(mu),
within.var = 0.12962^2,
sig.level = 0.05,
power = 0.80)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 22.85239
## between.var = 0.00370873
## within.var = 0.01680134
## sig.level = 0.05
## power = 0.8
##
## NOTE: n is number in each group
#Si queremos saber el poder de nuestro análisis, requerimos siempre información base del anova
summary(ano.alf)
## Df Sum Sq Mean Sq F value Pr(>F)
## Alfalfa$Variety 2 0.178 0.08901 0.687 0.507
## Residuals 69 8.944 0.12962
power.anova.test(groups = 3,
n = 24, # actual sample size
between.var = var(mu),
within.var = 0.12962^2)
##
## Balanced one-way analysis of variance power calculation
##
## groups = 3
## n = 24
## between.var = 0.00370873
## within.var = 0.01680134
## sig.level = 0.05
## power = 0.8211324
##
## NOTE: n is number in each group
power.anova.test(groups = 3, n = 24, # actual sample size between.var = var(mu), within.var = 0.12962^2)
Utilice la base de datos “InsectSprays”, ejecute un anova y calcule el tamaño mínimo de muestra, utilizando la variable count (variable dependiente), spray (tratamiento). Así mismo calcule el efecto de su resultado.
Obtenga:
. Anova
. Supuestos
. Sample size
. Poder
. Boxplot