Sameer Mathur
One-way ANOVA
---
# reading data into R
resGrades.df <- read.csv(paste("AdvertisingDataV2.csv"))
# attaching data columns of the dataframe
attach(resGrades.df)
# dimension of the dataframe
dim(resGrades.df)
[1] 30000 6
# descriptive statistics
library(psych)
describe(resGrades.df)[, c(1:5, 8:9)]
vars n mean sd median min max
adType* 1 30000 2.00 0.82 2.0 1 3
pageViews 2 30000 468.06 168.16 391.0 145 929
phoneCalls 3 30000 37.71 7.97 37.0 17 77
reservations 4 30000 36.55 7.99 36.0 15 79
businessID 5 30000 15000.50 8660.40 15000.5 1 30000
restaurantType* 6 30000 1.60 0.49 2.0 1 2
# number of observations in ad type
table(adType)
adType
Curr Ads New Ads No Ads
10000 10000 10000
# descriptive statistics of reservations by ad type
library(data.table)
dt <- data.table(resGrades.df)
dt[, list(count = .N,
mean = round(mean(reservations), 3),
sd = round(mean(reservations), 3),
median = round(median(reservations), 3),
min = min(reservations),
max = max(reservations)),
by = list(adType)]
adType count mean sd median min max
1: No Ads 10000 33.960 33.960 33 15 58
2: Curr Ads 10000 34.021 34.021 33 18 59
3: New Ads 10000 41.681 41.681 40 18 79
# box plot of ad type
boxplot(reservations ~ adType, data = resGrades.df,
main = "Boxplot of Ad Type",
xlab = "Ad Type", ylab = "Reservations",
digits = 2)
# mean plot by ad type
library(gplots)
plotmeans(reservations ~ adType, data = resGrades.df,
xlab = "Ad Type", ylab = "Number of Reservations",
digits = 2, col = "black", ccol = "blue", barwidth = 2,
legends = TRUE, mean.labels = TRUE, frame = TRUE)
# one-way ANOVA
oneWayfit <- aov(reservations ~ adType, data = resGrades.df)
# summary of the ANOVA model
summary(oneWayfit)
Df Sum Sq Mean Sq F value Pr(>F)
adType 2 394228 197114 3885 <2e-16 ***
Residuals 29997 1522018 51
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value < 0.05 for adType, we can conclude that there are significant differences in number of reservations between three different ad types.
# Tukey comparison test
TukeyHSD(oneWayfit)
Tukey multiple comparisons of means
95% family-wise confidence level
Fit: aov(formula = reservations ~ adType, data = resGrades.df)
$adType
diff lwr upr p adj
New Ads-Curr Ads 7.6593 7.4232043 7.8953957 0.0000000
No Ads-Curr Ads -0.0608 -0.2968957 0.1752957 0.8181708
No Ads-New Ads -7.7201 -7.9561957 -7.4840043 0.0000000
# Tukey parir-wise comparisons plot
plot(TukeyHSD(oneWayfit))
# residual versus fitted plot
plot(oneWayfit, 2)
# extract the residuals
aovResiduals <- residuals(oneWayfit)
# Anderson-Darling normality test
library(nortest)
ad.test(aovResiduals)
Anderson-Darling normality test
data: aovResiduals
A = 305.49, p-value < 2.2e-16
Null hypothesis (\( H_0 \)): The data is normally distributed.
We fail to reject the null hypothesis. So, we cannot assume the normality of the data.
# residual versus fitted plot
plot(oneWayfit, 1)
# test homogeneity of variance
library(car)
leveneTest(reservations ~ adType, data = resGrades.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 134.29 < 2.2e-16 ***
29997
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Null hypothesis (\( H_0 \)): The variance of number of reservations are homogenous between groups.
We fail to reject the null hypothesis. So, we cannot assume the homogeneity of variance between the groups.
# Box-cox transformation using MASS package
library(MASS)
BCTrans <- boxcox(oneWayfit, lambda = seq(-2, 2))
# Box-Cox transformation value of lambda
BCTrans$x[which(BCTrans$y == max(BCTrans$y))]
[1] -0.1414141
# Box-Cox transformation using caret package
library(caret)
ResTrans <- BoxCoxTrans(resGrades.df$reservations)
ResTrans
Box-Cox Transformation
30000 data points used to estimate Lambda
Input data summary:
Min. 1st Qu. Median Mean 3rd Qu. Max.
15.00 31.00 36.00 36.55 41.00 79.00
Largest/Smallest: 5.27
Sample Skewness: 0.777
Estimated Lambda: -0.2
With fudge factor, Lambda = 0 will be used for transformations
The Box-cox transformation using MASS package suggests \( \lambda = -0.1414141 \) and Box-Cox transformation using caret package suggests \( \lambda = -0.2 \).
However, both \( \lambda \) have negative fraction values. It is quit tough to transform our dependent variable also interpretation become too complicated.
Hence, we use simple approach that \( \lambda = 0 \). In this case, our dependent variable will be a log transformation.
# change Box Office Collection to Log Box Office Collection
resGrades.df$logReservations <- log(resGrades.df$reservations)
# attaching data columns
attach(resGrades.df)
# first few rows of the dataframe
head(resGrades.df)
adType pageViews phoneCalls reservations businessID restaurantType
1 No Ads 643 44 39 1 chain
2 No Ads 621 41 44 2 chain
3 No Ads 581 40 38 3 chain
4 No Ads 592 35 31 4 chain
5 No Ads 648 45 46 5 chain
6 No Ads 519 37 41 6 chain
logReservations
1 3.663562
2 3.784190
3 3.637586
4 3.433987
5 3.828641
6 3.713572
# one-way ANOVA (after transformations)
oneWayTransfit <- aov(logReservations ~ adType, data = resGrades.df)
# summary of the ANOVA model
summary(oneWayTransfit)
Df Sum Sq Mean Sq F value Pr(>F)
adType 2 278.1 139.05 3845 <2e-16 ***
Residuals 29997 1084.8 0.04
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Before Log-transformation
# normal Q-Q plot before log-transformation
plot(oneWayfit, 2)
After Log-transformation
# normal Q-Q plot after log-transformation
plot(oneWayTransfit, 2)
Before Log-transformation
# Anderson-Darling normality test before log-transformation
library(nortest)
ad.test(reservations)
Anderson-Darling normality test
data: reservations
A = 222.08, p-value < 2.2e-16
After Log-transformation
# Anderson-Darling normality test after log-transformation
library(nortest)
ad.test(logReservations)
Anderson-Darling normality test
data: logReservations
A = 36.301, p-value < 2.2e-16
Before Log-transformations
# residual versus fitted plot before log-transformation
plot(oneWayfit, 1)
After Log-transformations
# residual versus fitted plot after log-transformation
plot(oneWayTransfit, 1)
Before Log-transformation
# test homogeneity of variance after log-transformation
library(car)
leveneTest(reservations ~ adType, data = resGrades.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 134.29 < 2.2e-16 ***
29997
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After Log-transformation
# test homogeneity of variance after log-transformation
library(car)
leveneTest(logReservations ~ adType, data = resGrades.df)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 2 24.465 2.419e-11 ***
29997
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After log transformation of the dependent variable number of reservations, we find homogeneity problem i.e. we didn't have equal variances between the groups.
Same for normality assumption, we fail to accept the null hypothesis. That concludes, we cannot assume the normality of the data after the transformation.
Hence, we will use non-parametric Kruskal-Wallis rank sum test in the case of non-normality in the data.
A non-parametric alternative to ANOVA is Kruskal-Wallis rank sum test, which can be used when ANOVA assumptions are not met.
# Kruskal-Wallis rank sum test
kruskal.test(logReservations ~ adType, data = resGrades.df)
Kruskal-Wallis rank sum test
data: logReservations by adType
Kruskal-Wallis chi-squared = 5856.4, df = 2, p-value < 2.2e-16
pairwise.t.test(logReservations, adType, data = resGrades.df, p.adjust.method = "BH", pool.sd = FALSE)
Pairwise comparisons using t tests with non-pooled SD
data: logReservations and adType
Curr Ads New Ads
New Ads <2e-16 -
No Ads 0.32 <2e-16
P value adjustment method: BH
The classical one-way ANOVA test requires an assumption of equal variances for all groups. In our example, the homogeneity of variance assumption turned out to be fail. Hence, we use Welch one-way test.
# anova test when variances are not same
oneway.test(logReservations ~ adType, data = resGrades.df)
One-way analysis of means (not assuming equal variances)
data: logReservations and adType
F = 3892.4, num df = 2, denom df = 19993, p-value < 2.2e-16