Project3-One-way Anova

This is the chapter 5 Project for stat 321 using data from https://www.kaggle.com/datasets/geoffnel/evs-one-electric-vehicle-dataset.

5.1:Overview of one-way anova
In a one-way ANOVA, the categorical variable divides the dataset into groups, and the continuous outcome variable is measured for each group. The goal is to assess whether there are statistically significant differences in the means of the outcome variable across the groups.

Interpretation of Results:

Regardless of whether it’s an experiment or observational study, the interpretation of the results from a one-way ANOVA typically involves:

Assessing Overall Significance: The ANOVA test provides an F-statistic and associated p-value. A small p-value (< 0.05) indicates that there are statistically significant differences in the means of the groups.

Post-hoc Tests (if needed): If the overall ANOVA test is significant, post-hoc tests (e.g., Fisher’s Least Significant Difference (LSD), Tukey’s HSD (Honestly Significant Difference), Bonferroni correction) may be conducted to determine which specific group means are significantly different from each other.

5.3:Fitting the model

In the following, we provide R code for calculating group means:

D <- data.frame(
  group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)), 
  y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20) 
)

# Aggregate the data by group and calculate the mean of y for each group
aggregate(y ~ group, data = D, FUN = mean)

##   group    y
## 1     A  4.8
## 2     B  6.2
## 3     C  8.8
## 4     D 15.0

5.4:Statistical Inference Based on the ANOVA Model
To make formal inference about the parameters, we must make some assumptions on the model:

Independence: Observations within each group are independent of each other. Normality: The residuals (error terms ϵij ’s) are normally distributed with mean zero within each group. Homogeneity of variances: The variance of the residuals is the same for all groups, denoted by σ2 . The main goal of one-way ANOVA is to test the null hypothesis that all group means are equal against the alternative hypothesis that at least one group mean is different. This is typically done using an F-test, where the F-statistic is calculated as the ratio of the between-group variability (representing signal) to the within-group variability (representing noise).

In R, one-way ANOVA can be performed using the aov() function.

We demonstrate how to perform a one-way ANOVA analysis using R for an experiment.

Example 1: We use the data given before to perform a one-way ANOVA analysis.

# Create the data frame D
D <- data.frame(
  group = rep(c("A", "B", "C", "D"), c(5, 5, 5, 6)), 
  y = c(3, 6, 4, 4, 7, 5, 7, 8, 6, 5, 8, 8, 9, 10, 9, 13, 15, 12, 13, 17, 20) 
)

# Perform one-way ANOVA
anova_result <- aov(y ~ group, data = D)
summary(anova_result)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        3  343.6  114.53   29.32 6.04e-07 ***
## Residuals   17   66.4    3.91                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The ANOVA table produces the same sums of squares as we did by hand. The p-value (6.04e-07 or 0.000000604) shows that differences in the mean of y for different groups are statistically significant at any commonly used significance level.

Example 2: Electric vehicles study

# Read the CSV file
EVdata <- read.csv("EVdata.csv")

# Print the "Brand" and "Efficiency_WhKm" columns
EVdata[, c("Brand", "Efficiency_WhKm")]

##           Brand Efficiency_WhKm
## 1        Tesla              161
## 2   Volkswagen              167
## 3     Polestar              181
## 4          BMW              206
## 5        Honda              168
## 6        Lucid              180
## 7   Volkswagen              168
## 8      Peugeot              164
## 9        Tesla              153
## 10        Audi              193
## 11    Mercedes              216
## 12      Nissan              164
## 13     Hyundai              160
## 14         BMW              178
## 15     Hyundai              153
## 16  Volkswagen              175
## 17     Porsche              223
## 18  Volkswagen              166
## 19          MG              193
## 20        Mini              156
## 21        Opel              164
## 22       Tesla              171
## 23       Skoda              179
## 24        Audi              197
## 25       Tesla              167
## 26  Volkswagen              183
## 27  Volkswagen              166
## 28       Volvo              200
## 29         BMW              161
## 30     Peugeot              180
## 31        Audi              231
## 32         Kia              173
## 33     Renault              165
## 34       Tesla              267
## 35       Mazda              178
## 36      Nissan              172
## 37       Lexus              193
## 38       CUPRA              181
## 39     Renault              168
## 40    Mercedes              171
## 41       Tesla              184
## 42     Hyundai              154
## 43        Audi              228
## 44       Skoda              166
## 45        SEAT              166
## 46         Kia              175
## 47        Opel              173
## 48     Porsche              195
## 49   Lightyear              104
## 50      Aiways              188
## 51        Audi              237
## 52       Tesla              206
## 53        Opel              176
## 54       Skoda              183
## 55       Tesla              211
## 56       Honda              168
## 57          DS              180
## 58     Renault              164
## 59     Citroen              180
## 60       Tesla              188
## 61     Renault              161
## 62       Tesla              177
## 63      Nissan              198
## 64      Jaguar              232
## 65        Ford              200
## 66     Porsche              197
## 67      Nissan              200
## 68       Tesla              261
## 69     Renault              194
## 70        Ford              209
## 71         BMW              165
## 72       Skoda              193
## 73     Porsche              217
## 74       Byton              244
## 75        Sono              156
## 76         Kia              167
## 77        Audi              188
## 78       Smart              176
## 79        Ford              206
## 80     Porsche              215
## 81  Volkswagen              171
## 82       Tesla              216
## 83       Smart              167
## 84        Ford              194
## 85    Mercedes              273
## 86        Fiat              168
## 87       Tesla              256
## 88        Audi              219
## 89       Skoda              193
## 90       Skoda              181
## 91        Audi              270
## 92       Smart              176
## 93         Kia              175
## 94      Nissan              207
## 95        Fiat              168
## 96  Volkswagen              171
## 97         Kia              170
## 98       Byton              222
## 99      Nissan              191
## 100       Audi              258
## 101     Nissan              194
## 102     Nissan              232
## 103      Byton              238

We then visualize the data:\

# Create a boxplot of Efficiency_WhKm by Brand using base R graphics
boxplot(Efficiency_WhKm ~ Brand, data = EVdata, 
        main = "Boxplot of Efficiency by Brand",
        xlab = "Brand", ylab = "Efficiency (Wh/Km)")

It seems there are differences in Efficiency by brand. Are the differences significant?

We perform one-way ANOVA:

# Perform ANOVA test
anova_result <- aov(Efficiency_WhKm ~ Brand, data = EVdata)

# Summarize the ANOVA results
summary(anova_result)

##             Df Sum Sq Mean Sq F value  Pr(>F)    
## Brand       32  51984  1624.5   3.058 4.9e-05 ***
## Residuals   70  37185   531.2                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results appear to be statistically significant. This is indicated by the p-value (Pr(>F)) being less than the conventional significance level of 0.05.

How can we check whether the assumptions of the ANOVA are met?

We can use the graphical method. We can use formal test procedures.

# graphical check of ANOVA assumptions
plot(anova_result, 1) # Identify outliers and graphically check equal variances

plot(anova_result, 2) # Graphically check normality

## Warning: not plotting observations with leverage one:
##   3, 6, 19, 20, 28, 35, 37, 38, 45, 49, 50, 57, 59, 64, 75

Here’s how you can interpret the graphs:

Graphically Check Equal Variances (Plot Type 1): If the groups have equal variances, you should see relatively equal range of residuals across the groups in the plot. If not, it indicates that the assumption of equal variances may be violated.

Graphically Check Normality (Plot Type 2): If the data points fall approximately along a straight line, it suggests that the data follows a normal distribution. Deviations from the expected pattern, such as significant skewness or outliers, may indicate departures from normality.

A formal test available from the “car” package for testing equal variances is shown below:

library(car)

## Loading required package: carData

# Perform Levene's test
leveneTest(Efficiency_WhKm ~ Brand, data = EVdata)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group 32  1.6892 0.03483 *
##       70                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the Levene’s test results, there is evidence to suggest that the variance across groups (brands) is statistically significantly different.

The p-value obtained from the Levene’s test is 0.03483, which is less than the conventional significance level of 0.05. This suggests that there is a statistically significant difference in variances across the groups.

Furthermore, the significance level is denoted by asterisks, where ’*’ indicates p < 0.05. In your case, there is one asterisk next to the p-value, which indicates that the result is significant at the 0.05 level.

Therefore, there is evidence of heterogeneity of variances across the groups (brands) in the dataset. We can also use the following R code to calculate group standard deviations:

aggregate(Efficiency_WhKm ~ Brand, data = EVdata, sd)

##          Brand Efficiency_WhKm
## 1      Aiways               NA
## 2        Audi        28.535553
## 3         BMW        20.338797
## 4       Byton        11.372481
## 5     Citroen               NA
## 6       CUPRA               NA
## 7          DS               NA
## 8        Fiat         0.000000
## 9        Ford         6.652067
## 10      Honda         0.000000
## 11    Hyundai         3.785939
## 12     Jaguar               NA
## 13        Kia         3.464102
## 14      Lexus               NA
## 15  Lightyear               NA
## 16      Lucid               NA
## 17      Mazda               NA
## 18   Mercedes        51.117512
## 19         MG               NA
## 20       Mini               NA
## 21     Nissan        20.885744
## 22       Opel         6.244998
## 23    Peugeot        11.313708
## 24   Polestar               NA
## 25    Porsche        12.601587
## 26    Renault        13.427584
## 27       SEAT               NA
## 28      Skoda        10.074721
## 29      Smart         5.196152
## 30       Sono               NA
## 31      Tesla        39.075863
## 32 Volkswagen         5.792544
## 33      Volvo               NA

If the largest standard deviation is no greater than twice the smallest standard deviation, there is no violation of the equal variances assumption. As the result shows, the largest standard deviation is more than double the lowest, so there is a violation of the equal variances assumption.

The Shapiro-Wilk test can be used to check normality:

shapiro.test(anova_result$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  anova_result$residuals
## W = 0.87429, p-value = 7.447e-08

5.5:Transformation of the response variable

The one-way ANOVA assumes that the response variable is normally distributed within each group and exhibits equal variance across groups. However, in real-world data, these assumptions may not always hold true. When these conditions are violated, one approach to address the issue is to apply transformations to the response variable to stabilize variation.

Several commonly used transformations for the response variable y include:

the natural logarithmic transformation: log(y) the square-root transformation: y√ the reciprocal transformation: 1/y These transformations can help to achieve approximate normality and homogeneity of variances, thereby improving the validity of the ANOVA analysis.

For each transformation, the standard deviation of each group can be calculated. If the maximum standard deviation is no greater than twice the smallest standard deviation, the homogeneity in variance assumption is satisfied.

Example: Use the following data to find a transformation.

Group A: 0.9, 1, 1.1 Group B: 9, 10, 11 Group C: 90, 100, 110 Solution.

Plot without log-transformation:

# Create a data frame with two variables: 'group' and 'y'
library(ggplot2)
D <- data.frame(
  group = rep(c("A", "B", "C"), each = 3),  # 'group' variable contains repetitions of "A", "B", and "C", each 3 times
  y = c(0.9, 1, 1.1, 9, 10, 11, 90, 100, 110)  
)

# Convert the 'group' variable to a factor
D$group = factor(D$group)

# Plot the data using ggplot2
ggplot(D, aes(x=group, y=y)) +
  geom_point()

The data in different groups have different variations.

Plot after log-transformation:

ggplot(D, aes(x=group, y=log(y))) +
  geom_point()

After log-transformation, the data in different groups have the same variation. Now, the 1-way ANOVA can be done with the transformed data.

aov_result = aov(log(y)~group, data = D)
summary(aov_result)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## group        2  31.81   15.91    1579 6.82e-09 ***
## Residuals    6   0.06    0.01                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.7 Multiple Comparisons and Fisher’s Least Significant Difference

Fisher’s LSD is one of the oldest methods for pairwise comparisons following ANOVA. It is based on the t-distribution and is used when the assumption of equal variances between groups is met. Fisher’s LSD compares the difference between means of all possible pairs of groups and determines whether these differences are statistically significant. It tends to be more liberal (i.e., it may find more significant differences) compared to other methods.

Using the EVdataset from earlier, we can perform LSD by:

library(agricolae)
library(ggplot2)
# Perform ANOVA test
model <- aov(Efficiency_WhKm ~ Brand, data = EVdata)

# Conduct LSD test
out <- LSD.test(model, "Brand")
out

## $statistics
##    MSerror Df    Mean       CV
##   531.2094 70 189.165 12.18406
## 
## $parameters
##         test p.ajusted name.t ntr alpha
##   Fisher-LSD      none  Brand  33  0.05
## 
## $means
##             Efficiency_WhKm       std  r        se       LCL      UCL Min Max
## Aiways             188.0000        NA  1 23.047980 142.03225 233.9677 188 188
## Audi               224.5556 28.535553  9  7.682660 209.23297 239.8781 188 270
## BMW                177.5000 20.338797  4 11.523990 154.51613 200.4839 161 206
## Byton              234.6667 11.372481  3 13.306758 208.12718 261.2062 222 244
## Citroen            180.0000        NA  1 23.047980 134.03225 225.9677 180 180
## CUPRA              181.0000        NA  1 23.047980 135.03225 226.9677 181 181
## DS                 180.0000        NA  1 23.047980 134.03225 225.9677 180 180
## Fiat               168.0000  0.000000  2 16.297383 135.49589 200.5041 168 168
## Ford               202.2500  6.652067  4 11.523990 179.26613 225.2339 194 209
## Honda              168.0000  0.000000  2 16.297383 135.49589 200.5041 168 168
## Hyundai            155.6667  3.785939  3 13.306758 129.12718 182.2062 153 160
## Jaguar             232.0000        NA  1 23.047980 186.03225 277.9677 232 232
## Kia                172.0000  3.464102  5 10.307370 151.44260 192.5574 167 175
## Lexus              193.0000        NA  1 23.047980 147.03225 238.9677 193 193
## Lightyear          104.0000        NA  1 23.047980  58.03225 149.9677 104 104
## Lucid              180.0000        NA  1 23.047980 134.03225 225.9677 180 180
## Mazda              178.0000        NA  1 23.047980 132.03225 223.9677 178 178
## Mercedes           220.0000 51.117512  3 13.306758 193.46051 246.5395 171 273
## MG                 193.0000        NA  1 23.047980 147.03225 238.9677 193 193
## Mini               156.0000        NA  1 23.047980 110.03225 201.9677 156 156
## Nissan             194.7500 20.885744  8  8.148692 178.49795 211.0021 164 232
## Opel               171.0000  6.244998  3 13.306758 144.46051 197.5395 164 176
## Peugeot            172.0000 11.313708  2 16.297383 139.49589 204.5041 164 180
## Polestar           181.0000        NA  1 23.047980 135.03225 226.9677 181 181
## Porsche            209.4000 12.601587  5 10.307370 188.84260 229.9574 195 223
## Renault            170.4000 13.427584  5 10.307370 149.84260 190.9574 161 194
## SEAT               166.0000        NA  1 23.047980 120.03225 211.9677 166 166
## Skoda              182.5000 10.074721  6  9.409299 163.73375 201.2663 166 193
## Smart              173.0000  5.196152  3 13.306758 146.46051 199.5395 167 176
## Sono               156.0000        NA  1 23.047980 110.03225 201.9677 156 156
## Tesla              201.3846 39.075863 13  6.392360 188.63546 214.1338 153 267
## Volkswagen         170.8750  5.792544  8  8.148692 154.62295 187.1271 166 183
## Volvo              200.0000        NA  1 23.047980 154.03225 245.9677 200 200
##                Q25   Q50    Q75
## Aiways      188.00 188.0 188.00
## Audi        197.00 228.0 237.00
## BMW         164.00 171.5 185.00
## Byton       230.00 238.0 241.00
## Citroen     180.00 180.0 180.00
## CUPRA       181.00 181.0 181.00
## DS          180.00 180.0 180.00
## Fiat        168.00 168.0 168.00
## Ford        198.50 203.0 206.75
## Honda       168.00 168.0 168.00
## Hyundai     153.50 154.0 157.00
## Jaguar      232.00 232.0 232.00
## Kia         170.00 173.0 175.00
## Lexus       193.00 193.0 193.00
## Lightyear   104.00 104.0 104.00
## Lucid       180.00 180.0 180.00
## Mazda       178.00 178.0 178.00
## Mercedes    193.50 216.0 244.50
## MG          193.00 193.0 193.00
## Mini        156.00 156.0 156.00
## Nissan      186.25 196.0 201.75
## Opel        168.50 173.0 174.50
## Peugeot     168.00 172.0 176.00
## Polestar    181.00 181.0 181.00
## Porsche     197.00 215.0 217.00
## Renault     164.00 165.0 168.00
## SEAT        166.00 166.0 166.00
## Skoda       179.50 182.0 190.50
## Smart       171.50 176.0 176.00
## Sono        156.00 156.0 156.00
## Tesla       171.00 188.0 216.00
## Volkswagen  166.75 169.5 172.00
## Volvo       200.00 200.0 200.00
## 
## $comparison
## NULL
## 
## $groups
##             Efficiency_WhKm groups
## Byton              234.6667      a
## Jaguar             232.0000     ab
## Audi               224.5556     ab
## Mercedes           220.0000     ab
## Porsche            209.4000     ab
## Ford               202.2500    abc
## Tesla              201.3846     bc
## Volvo              200.0000    bcd
## Nissan             194.7500    bcd
## Lexus              193.0000    bcd
## MG                 193.0000    bcd
## Aiways             188.0000    bcd
## Skoda              182.5000    bcd
## CUPRA              181.0000    bcd
## Polestar           181.0000    bcd
## Citroen            180.0000    bcd
## DS                 180.0000    bcd
## Lucid              180.0000    bcd
## Mazda              178.0000    bcd
## BMW                177.5000     cd
## Smart              173.0000     cd
## Kia                172.0000     cd
## Peugeot            172.0000     cd
## Opel               171.0000     cd
## Volkswagen         170.8750      d
## Renault            170.4000      d
## Fiat               168.0000      d
## Honda              168.0000      d
## SEAT               166.0000     de
## Mini               156.0000     de
## Sono               156.0000     de
## Hyundai            155.6667     de
## Lightyear          104.0000      e
## 
## attr(,"class")
## [1] "group"

plot(out)

The chart reveals a significant difference in average driving range between car brands. Lexus and Polestar (both with a value around 250 miles) stand out with the farthest average range. Byton, on the other hand, has the shortest average range at roughly 100 miles. The remaining brands (Ford, BMW, Opel, Fiat, and Mini) seem to cluster in the middle, suggesting no significant distinction in their average driving range compared to each other.

Project3-One-way Anova

Abdul

2024-04-05