A usage of ANOVA

In this blog, I am going to use ANOVA to see the average difference of multiple groups. ANOVA is useful when you have to check the average across different groups. It verifies the average difference statistically across the groups. It is a very useful tool in different areas. I have been using it to check the difference on supply chain performance based on any factor.

library(tidyverse)
library(reshape2)

# Reading SIP 
sip <- read_csv("C:/Users/hukha/Desktop/MS - Data Science/Data 698 - Final Project/sip.csv") %>% select(Pjct_Name, SIP_YR, DateComplt)

# Converting text into date
sip$DateComplt <- as.Date(sip$DateComplt, format="%m/%d/%Y")

sip <- sip %>% 
  filter(DateComplt >= '2012-07-01') %>% 
  mutate(Pjct_Name = str_trim(Pjct_Name, side=c("both")),
         Pjct_Name = gsub("at.*", "", Pjct_Name),
         Pjct_Name = gsub("and.*", "", Pjct_Name),
         Pjct_Name = gsub("&.*", "", Pjct_Name),
         Pjct_Name = gsub("Avenue", "Ave", Pjct_Name),
         Pjct_Name = gsub("Ave", "Avenue", Pjct_Name),
         Pjct_Name = str_to_upper(Pjct_Name),
         Pjct_Name = gsub("TH ", " ", Pjct_Name),
         Pjct_Name = gsub("RD ", " ", Pjct_Name),
         Pjct_Name = gsub("BLVD", "BOULEVARD", Pjct_Name)) %>% 
  rename(`ON STREET NAME` = Pjct_Name) %>% 
  select(`ON STREET NAME`) %>% 
  unique()


# Before1
before_sip <- crash_joined %>% 
  filter(`CRASH DATE` >= "2012-11-07" & `CRASH DATE` <= "2014-11-06") %>% 
  mutate(`ON STREET NAME` = str_trim(`ON STREET NAME`, side=c("both"))) %>% 
  group_by(`ON STREET NAME`) %>% 
  count() %>% 
  filter(!is.na(`ON STREET NAME`)) %>% 
  arrange(desc(n))

#After1
after_sip1 <- crash_joined %>% 
  filter(`CRASH DATE` >= "2014-11-07" & `CRASH DATE` <= "2016-11-06") %>% 
  mutate(`ON STREET NAME` = str_trim(`ON STREET NAME`, side=c("both"))) %>% 
  group_by(`ON STREET NAME`) %>% 
  count() %>% 
  filter(!is.na(`ON STREET NAME`)) %>% 
  arrange(desc(n))

#After2
after_sip2 <- crash_joined %>% 
  filter(`CRASH DATE` >= "2016-11-07" & `CRASH DATE` <= "2018-11-06") %>% 
  mutate(`ON STREET NAME` = str_trim(`ON STREET NAME`, side=c("both"))) %>% 
  group_by(`ON STREET NAME`) %>% 
  count() %>% 
  filter(!is.na(`ON STREET NAME`)) %>% 
  arrange(desc(n))

# Write
#write_csv(sip, "SIP_EXCEL.csv") # List of SIP Projects
#write_csv(before_sip, "BEFORE_SIP2.csv") 2012-2014
#write_csv(after_sip1, "AFTER_SIP1.csv") # 2014-2016
#write_csv(after_sip2, "AFTER_SIP2.csv") # 2016-2018
# These files were exported to excel for vlookup. For some reason R was not merging properly 

# Reading the file
sip_melted <- read_csv("C:/Users/hukha/Desktop/MS - Data Science/Data 698 - Final Project/SIP_EXCEL.csv") %>% 
  drop_na() %>% 
  melt(., id.vars="ON STREET NAME", variable.name = "Groups") %>% 
  arrange(`ON STREET NAME`)


# Convertint Groups into factor
sip_melted$Groups <- as.factor(sip_melted$Groups)


# Plotting the number of accidents throughout years
sip_melted %>% ggplot(., aes(x=`Groups`, y=log10(value)))+geom_boxplot(fill="steelblue") +
  theme_classic() + labs(title="Intersection Improvement Projects", x="Years", y="Number of Accidents")

# ANOVA Testing
sip_anova <- aov(log(value) ~ Groups, data=sip_melted)
summary(sip_anova)

##              Df Sum Sq Mean Sq F value  Pr(>F)   
## Groups        2     73   36.60   5.073 0.00654 **
## Residuals   594   4286    7.22                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Groups with different means
TukeyHSD(sip_anova)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = log(value) ~ Groups, data = sip_melted)
## 
## $Groups
##                             diff        lwr        upr     p adj
## Y2014-2016-Y2012-2014 -0.8119720 -1.4446855 -0.1792585 0.0075352
## Y2016-2018-Y2012-2014 -0.6454756 -1.2781891 -0.0127621 0.0443381
## Y2016-2018-Y2014-2016  0.1664964 -0.4662171  0.7992099 0.8101642

Results from graph shows that there is a difference in the number of accidents in NYC and Intersection Improvement Projects have helped reducing the number of accidents in following years. Null hypothesis is rejected that means there is difference in the average of all groups. Number of accidents dropped in 2014-2016 and then slightly increased again in 2016-2018 which you can see in the result above from Tukey’s Test.

Data 621 - Blog 3

Habib Khan

12/6/2020

A usage of ANOVA