This is the chapter 5 Project for stat 321 using data from https://www.kaggle.com/datasets/geoffnel/evs-one-electric-vehicle-dataset.
5.1:Overview of one-way anova
In a one-way ANOVA, the categorical variable divides the dataset into
groups, and the continuous outcome variable is measured for each group.
The goal is to assess whether there are statistically significant
differences in the means of the outcome variable across the groups.
Interpretation of Results:
Regardless of whether it’s an experiment or observational study, the interpretation of the results from a one-way ANOVA typically involves:
Assessing Overall Significance: The ANOVA test provides an F-statistic and associated p-value. A small p-value (< 0.05) indicates that there are statistically significant differences in the means of the groups.
Post-hoc Tests (if needed): If the overall ANOVA test is significant, post-hoc tests (e.g., Fisher’s Least Significant Difference (LSD), Tukey’s HSD (Honestly Significant Difference), Bonferroni correction) may be conducted to determine which specific group means are significantly different from each other.
5.3:Fitting the model
In the following, we provide R code for calculating group means:
D <- data.frame(
group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)),
y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20)
)
# Aggregate the data by group and calculate the mean of y for each group
aggregate(y ~ group, data = D, FUN = mean)
## group y
## 1 A 4.8
## 2 B 6.2
## 3 C 8.8
## 4 D 15.0
5.4:Statistical Inference Based on the ANOVA
Model
To make formal inference about the parameters, we must make some
assumptions on the model:
Independence: Observations within each group are independent of each other. Normality: The residuals (error terms ϵij ’s) are normally distributed with mean zero within each group. Homogeneity of variances: The variance of the residuals is the same for all groups, denoted by σ2 . The main goal of one-way ANOVA is to test the null hypothesis that all group means are equal against the alternative hypothesis that at least one group mean is different. This is typically done using an F-test, where the F-statistic is calculated as the ratio of the between-group variability (representing signal) to the within-group variability (representing noise).
In R, one-way ANOVA can be performed using the aov() function.
We demonstrate how to perform a one-way ANOVA analysis using R for an experiment.
Example 1: We use the data given before to perform a one-way ANOVA
analysis.
# Create the data frame D
D <- data.frame(
group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)),
y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20)
)
# Perform one-way ANOVA
anova_result <- aov(y ~ group, data = D)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 3 343.6 114.53 29.32 6.04e-07 ***
## Residuals 17 66.4 3.91
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The ANOVA table produces the same sums of squares as we did by hand. The p-value (6.04e-07 or 0.000000604) shows that differences in the mean of y for different groups are statistically significant at any commonly used significance level.
Example 2: Electric vehicles study
# Read the CSV file
EVdata <- read.csv("EVdata.csv")
# Print the "Brand" and "Efficiency_WhKm" columns
EVdata[, c("Brand", "Efficiency_WhKm")]
## Brand Efficiency_WhKm
## 1 Tesla 161
## 2 Volkswagen 167
## 3 Polestar 181
## 4 BMW 206
## 5 Honda 168
## 6 Lucid 180
## 7 Volkswagen 168
## 8 Peugeot 164
## 9 Tesla 153
## 10 Audi 193
## 11 Mercedes 216
## 12 Nissan 164
## 13 Hyundai 160
## 14 BMW 178
## 15 Hyundai 153
## 16 Volkswagen 175
## 17 Porsche 223
## 18 Volkswagen 166
## 19 MG 193
## 20 Mini 156
## 21 Opel 164
## 22 Tesla 171
## 23 Skoda 179
## 24 Audi 197
## 25 Tesla 167
## 26 Volkswagen 183
## 27 Volkswagen 166
## 28 Volvo 200
## 29 BMW 161
## 30 Peugeot 180
## 31 Audi 231
## 32 Kia 173
## 33 Renault 165
## 34 Tesla 267
## 35 Mazda 178
## 36 Nissan 172
## 37 Lexus 193
## 38 CUPRA 181
## 39 Renault 168
## 40 Mercedes 171
## 41 Tesla 184
## 42 Hyundai 154
## 43 Audi 228
## 44 Skoda 166
## 45 SEAT 166
## 46 Kia 175
## 47 Opel 173
## 48 Porsche 195
## 49 Lightyear 104
## 50 Aiways 188
## 51 Audi 237
## 52 Tesla 206
## 53 Opel 176
## 54 Skoda 183
## 55 Tesla 211
## 56 Honda 168
## 57 DS 180
## 58 Renault 164
## 59 Citroen 180
## 60 Tesla 188
## 61 Renault 161
## 62 Tesla 177
## 63 Nissan 198
## 64 Jaguar 232
## 65 Ford 200
## 66 Porsche 197
## 67 Nissan 200
## 68 Tesla 261
## 69 Renault 194
## 70 Ford 209
## 71 BMW 165
## 72 Skoda 193
## 73 Porsche 217
## 74 Byton 244
## 75 Sono 156
## 76 Kia 167
## 77 Audi 188
## 78 Smart 176
## 79 Ford 206
## 80 Porsche 215
## 81 Volkswagen 171
## 82 Tesla 216
## 83 Smart 167
## 84 Ford 194
## 85 Mercedes 273
## 86 Fiat 168
## 87 Tesla 256
## 88 Audi 219
## 89 Skoda 193
## 90 Skoda 181
## 91 Audi 270
## 92 Smart 176
## 93 Kia 175
## 94 Nissan 207
## 95 Fiat 168
## 96 Volkswagen 171
## 97 Kia 170
## 98 Byton 222
## 99 Nissan 191
## 100 Audi 258
## 101 Nissan 194
## 102 Nissan 232
## 103 Byton 238
We then visualize the data:\
# Create a boxplot of Efficiency_WhKm by Brand using base R graphics
boxplot(Efficiency_WhKm ~ Brand, data = EVdata,
main = "Boxplot of Efficiency by Brand",
xlab = "Brand", ylab = "Efficiency (Wh/Km)")
It seems there are differences in Efficiency by brand. Are the
differences significant?
We perform one-way ANOVA:
# Perform ANOVA test
anova_result <- aov(Efficiency_WhKm ~ Brand, data = EVdata)
# Summarize the ANOVA results
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## Brand 32 51984 1624.5 3.058 4.9e-05 ***
## Residuals 70 37185 531.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results appear to be statistically significant. This is indicated by the p-value (Pr(>F)) being less than the conventional significance level of 0.05.
How can we check whether the assumptions of the ANOVA are met?
We can use the graphical method. We can use formal test
procedures.
# graphical check of ANOVA assumptions
plot(anova_result, 1) # Identify outliers and graphically check equal variances
plot(anova_result, 2) # Graphically check normality
## Warning: not plotting observations with leverage one:
## 3, 6, 19, 20, 28, 35, 37, 38, 45, 49, 50, 57, 59, 64, 75
Here’s how you can interpret the graphs:
Graphically Check Equal Variances (Plot Type 1): If the groups have equal variances, you should see relatively equal range of residuals across the groups in the plot. If not, it indicates that the assumption of equal variances may be violated.
Graphically Check Normality (Plot Type 2): If the data points fall approximately along a straight line, it suggests that the data follows a normal distribution. Deviations from the expected pattern, such as significant skewness or outliers, may indicate departures from normality.
A formal test available from the “car” package for testing equal
variances is shown below:
library(car)
## Loading required package: carData
# Perform Levene's test
leveneTest(Efficiency_WhKm ~ Brand, data = EVdata)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 32 1.6892 0.03483 *
## 70
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the Levene’s test results, there is evidence to suggest that the variance across groups (brands) is statistically significantly different.
The p-value obtained from the Levene’s test is 0.03483, which is less than the conventional significance level of 0.05. This suggests that there is a statistically significant difference in variances across the groups.
Furthermore, the significance level is denoted by asterisks, where ’*’ indicates p < 0.05. In your case, there is one asterisk next to the p-value, which indicates that the result is significant at the 0.05 level.
Therefore, there is evidence of heterogeneity of variances across the groups (brands) in the dataset. We can also use the following R code to calculate group standard deviations:
aggregate(Efficiency_WhKm ~ Brand, data = EVdata, sd)
## Brand Efficiency_WhKm
## 1 Aiways NA
## 2 Audi 28.535553
## 3 BMW 20.338797
## 4 Byton 11.372481
## 5 Citroen NA
## 6 CUPRA NA
## 7 DS NA
## 8 Fiat 0.000000
## 9 Ford 6.652067
## 10 Honda 0.000000
## 11 Hyundai 3.785939
## 12 Jaguar NA
## 13 Kia 3.464102
## 14 Lexus NA
## 15 Lightyear NA
## 16 Lucid NA
## 17 Mazda NA
## 18 Mercedes 51.117512
## 19 MG NA
## 20 Mini NA
## 21 Nissan 20.885744
## 22 Opel 6.244998
## 23 Peugeot 11.313708
## 24 Polestar NA
## 25 Porsche 12.601587
## 26 Renault 13.427584
## 27 SEAT NA
## 28 Skoda 10.074721
## 29 Smart 5.196152
## 30 Sono NA
## 31 Tesla 39.075863
## 32 Volkswagen 5.792544
## 33 Volvo NA
If the largest standard deviation is no greater than twice the smallest standard deviation, there is no violation of the equal variances assumption. As the result shows, the largest standard deviation is more than double the lowest, so there is a violation of the equal variances assumption.
The Shapiro-Wilk test can be used to check normality:
shapiro.test(anova_result$residuals)
##
## Shapiro-Wilk normality test
##
## data: anova_result$residuals
## W = 0.87429, p-value = 7.447e-08
5.5:Transformation of the response variable
The one-way ANOVA assumes that the response variable is normally distributed within each group and exhibits equal variance across groups. However, in real-world data, these assumptions may not always hold true. When these conditions are violated, one approach to address the issue is to apply transformations to the response variable to stabilize variation.
Several commonly used transformations for the response variable y include:
the natural logarithmic transformation: log(y) the square-root transformation: y√ the reciprocal transformation: 1/y These transformations can help to achieve approximate normality and homogeneity of variances, thereby improving the validity of the ANOVA analysis.
For each transformation, the standard deviation of each group can be calculated. If the maximum standard deviation is no greater than twice the smallest standard deviation, the homogeneity in variance assumption is satisfied.
Example: Use the following data to find a transformation.
Group A: 0.9, 1, 1.1 Group B: 9, 10, 11 Group C: 90, 100, 110 Solution.
Plot without log-transformation:
# Create a data frame with two variables: 'group' and 'y'
library(ggplot2)
D <- data.frame(
group = rep(c("A", "B", "C"), each = 3), # 'group' variable contains repetitions of "A", "B", and "C", each 3 times
y = c(0.9, 1, 1.1, 9, 10, 11, 90, 100, 110)
)
# Convert the 'group' variable to a factor
D$group = factor(D$group)
# Plot the data using ggplot2
ggplot(D, aes(x=group, y=y)) +
geom_point()
The data in different groups have different variations.
Plot after log-transformation:
ggplot(D, aes(x=group, y=log(y))) +
geom_point()
After log-transformation, the data in different groups have the same
variation. Now, the 1-way ANOVA can be done with the transformed
data.
aov_result = aov(log(y)~group, data = D)
summary(aov_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## group 2 31.81 15.91 1579 6.82e-09 ***
## Residuals 6 0.06 0.01
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
5.7 Multiple Comparisons and Fisher’s Least Significant Difference
Fisher’s LSD is one of the oldest methods for pairwise comparisons following ANOVA. It is based on the t-distribution and is used when the assumption of equal variances between groups is met. Fisher’s LSD compares the difference between means of all possible pairs of groups and determines whether these differences are statistically significant. It tends to be more liberal (i.e., it may find more significant differences) compared to other methods.
Using the EVdataset from earlier, we can perform LSD by:
library(agricolae)
library(ggplot2)
# Perform ANOVA test
model <- aov(Efficiency_WhKm ~ Brand, data = EVdata)
# Conduct LSD test
out <- LSD.test(model, "Brand")
out
## $statistics
## MSerror Df Mean CV
## 531.2094 70 189.165 12.18406
##
## $parameters
## test p.ajusted name.t ntr alpha
## Fisher-LSD none Brand 33 0.05
##
## $means
## Efficiency_WhKm std r se LCL UCL Min Max
## Aiways 188.0000 NA 1 23.047980 142.03225 233.9677 188 188
## Audi 224.5556 28.535553 9 7.682660 209.23297 239.8781 188 270
## BMW 177.5000 20.338797 4 11.523990 154.51613 200.4839 161 206
## Byton 234.6667 11.372481 3 13.306758 208.12718 261.2062 222 244
## Citroen 180.0000 NA 1 23.047980 134.03225 225.9677 180 180
## CUPRA 181.0000 NA 1 23.047980 135.03225 226.9677 181 181
## DS 180.0000 NA 1 23.047980 134.03225 225.9677 180 180
## Fiat 168.0000 0.000000 2 16.297383 135.49589 200.5041 168 168
## Ford 202.2500 6.652067 4 11.523990 179.26613 225.2339 194 209
## Honda 168.0000 0.000000 2 16.297383 135.49589 200.5041 168 168
## Hyundai 155.6667 3.785939 3 13.306758 129.12718 182.2062 153 160
## Jaguar 232.0000 NA 1 23.047980 186.03225 277.9677 232 232
## Kia 172.0000 3.464102 5 10.307370 151.44260 192.5574 167 175
## Lexus 193.0000 NA 1 23.047980 147.03225 238.9677 193 193
## Lightyear 104.0000 NA 1 23.047980 58.03225 149.9677 104 104
## Lucid 180.0000 NA 1 23.047980 134.03225 225.9677 180 180
## Mazda 178.0000 NA 1 23.047980 132.03225 223.9677 178 178
## Mercedes 220.0000 51.117512 3 13.306758 193.46051 246.5395 171 273
## MG 193.0000 NA 1 23.047980 147.03225 238.9677 193 193
## Mini 156.0000 NA 1 23.047980 110.03225 201.9677 156 156
## Nissan 194.7500 20.885744 8 8.148692 178.49795 211.0021 164 232
## Opel 171.0000 6.244998 3 13.306758 144.46051 197.5395 164 176
## Peugeot 172.0000 11.313708 2 16.297383 139.49589 204.5041 164 180
## Polestar 181.0000 NA 1 23.047980 135.03225 226.9677 181 181
## Porsche 209.4000 12.601587 5 10.307370 188.84260 229.9574 195 223
## Renault 170.4000 13.427584 5 10.307370 149.84260 190.9574 161 194
## SEAT 166.0000 NA 1 23.047980 120.03225 211.9677 166 166
## Skoda 182.5000 10.074721 6 9.409299 163.73375 201.2663 166 193
## Smart 173.0000 5.196152 3 13.306758 146.46051 199.5395 167 176
## Sono 156.0000 NA 1 23.047980 110.03225 201.9677 156 156
## Tesla 201.3846 39.075863 13 6.392360 188.63546 214.1338 153 267
## Volkswagen 170.8750 5.792544 8 8.148692 154.62295 187.1271 166 183
## Volvo 200.0000 NA 1 23.047980 154.03225 245.9677 200 200
## Q25 Q50 Q75
## Aiways 188.00 188.0 188.00
## Audi 197.00 228.0 237.00
## BMW 164.00 171.5 185.00
## Byton 230.00 238.0 241.00
## Citroen 180.00 180.0 180.00
## CUPRA 181.00 181.0 181.00
## DS 180.00 180.0 180.00
## Fiat 168.00 168.0 168.00
## Ford 198.50 203.0 206.75
## Honda 168.00 168.0 168.00
## Hyundai 153.50 154.0 157.00
## Jaguar 232.00 232.0 232.00
## Kia 170.00 173.0 175.00
## Lexus 193.00 193.0 193.00
## Lightyear 104.00 104.0 104.00
## Lucid 180.00 180.0 180.00
## Mazda 178.00 178.0 178.00
## Mercedes 193.50 216.0 244.50
## MG 193.00 193.0 193.00
## Mini 156.00 156.0 156.00
## Nissan 186.25 196.0 201.75
## Opel 168.50 173.0 174.50
## Peugeot 168.00 172.0 176.00
## Polestar 181.00 181.0 181.00
## Porsche 197.00 215.0 217.00
## Renault 164.00 165.0 168.00
## SEAT 166.00 166.0 166.00
## Skoda 179.50 182.0 190.50
## Smart 171.50 176.0 176.00
## Sono 156.00 156.0 156.00
## Tesla 171.00 188.0 216.00
## Volkswagen 166.75 169.5 172.00
## Volvo 200.00 200.0 200.00
##
## $comparison
## NULL
##
## $groups
## Efficiency_WhKm groups
## Byton 234.6667 a
## Jaguar 232.0000 ab
## Audi 224.5556 ab
## Mercedes 220.0000 ab
## Porsche 209.4000 ab
## Ford 202.2500 abc
## Tesla 201.3846 bc
## Volvo 200.0000 bcd
## Nissan 194.7500 bcd
## Lexus 193.0000 bcd
## MG 193.0000 bcd
## Aiways 188.0000 bcd
## Skoda 182.5000 bcd
## CUPRA 181.0000 bcd
## Polestar 181.0000 bcd
## Citroen 180.0000 bcd
## DS 180.0000 bcd
## Lucid 180.0000 bcd
## Mazda 178.0000 bcd
## BMW 177.5000 cd
## Smart 173.0000 cd
## Kia 172.0000 cd
## Peugeot 172.0000 cd
## Opel 171.0000 cd
## Volkswagen 170.8750 d
## Renault 170.4000 d
## Fiat 168.0000 d
## Honda 168.0000 d
## SEAT 166.0000 de
## Mini 156.0000 de
## Sono 156.0000 de
## Hyundai 155.6667 de
## Lightyear 104.0000 e
##
## attr(,"class")
## [1] "group"
plot(out)
The chart reveals a significant difference in average driving range between car brands. Lexus and Polestar (both with a value around 250 miles) stand out with the farthest average range. Byton, on the other hand, has the shortest average range at roughly 100 miles. The remaining brands (Ford, BMW, Opel, Fiat, and Mini) seem to cluster in the middle, suggesting no significant distinction in their average driving range compared to each other.