Revised Milestone #2

# final tidy data, may still need some work
#View(stations_income_df2)
head(stations_income_df2)

## # A tibble: 6 x 29
##    ...1 `Station Name`       `Street Address` City  State ZIP   `Groups With A~`
##   <dbl> <chr>                <chr>            <chr> <chr> <chr> <chr>           
## 1     1 LADWP - Truesdale C~ 11797 Truesdale~ Sun ~ CA    91352 Private         
## 2     2 LADWP - Truesdale C~ 11797 Truesdale~ Sun ~ CA    91352 Private         
## 3     3 LADWP - Truesdale C~ 11797 Truesdale~ Sun ~ CA    91352 Private         
## 4     4 LADWP - West LA Dis~ 1394 S Sepulved~ Los ~ CA    90024 Private         
## 5     5 LADWP - West LA Dis~ 1394 S Sepulved~ Los ~ CA    90024 Private         
## 6     6 LADWP - West LA Dis~ 1394 S Sepulved~ Los ~ CA    90024 Private         
## # ... with 22 more variables: `Access Days Time` <chr>, `EV Network` <chr>,
## #   `Geocode Status` <chr>, Latitude <dbl>, Longitude <dbl>,
## #   `Date Last Confirmed` <date>, ID <dbl>, `Open Date` <date>,
## #   `EV Connector Types` <chr>, `Access Code` <chr>,
## #   `Access Detail Code` <chr>, `Facility Type` <chr>, `EV Pricing` <chr>,
## #   `EV On-Site Renewable Source` <chr>, Station.Type <chr>,
## #   Number.Stations <dbl>, household <dbl>, families <dbl>, ...

Introduction

We hope to understand some of the factors that might be influencing the spatial distribution of Electric Vehicle charging stations. Our initial stations data set includes such variables as access to stations (public or private), type (lvl 1, lvl 2, or DC Fast), as well as the zip, City, and State of the stations. Our initial income data set includes various income estimates from the ACS by zip code. By joining the two data sets we can include income by location as a variable to relate to EV stations. A third data set containing median income by state is joined as well. The broader implications include understanding whether factors such as region of the country and median household income are related to number, type, and accessibility of EV charging stations. This can help us get a better understanding of what factors might affect the process of installing new EV infrastructure as interest in electric vehicles increases over time.

3 questions that we want to explore with this dataset

How does median income by state, if at all, relate to the number of EV stations by state?
Is there any relation between access to a renewable energy source on site and whether it is private or public access (or where the station is located, e.g. hotel, gas station)? Is number of sites with renewable energy sources on site related to state or region of the country?
Is the region the state is located in (north, south, east, west) associated with the number of charging stations, whether they are private vs public, and what type of station they are (lvl 1, lvl2, and DC Fast)?

Graphics

# top 5 states with the most EV stations
ggplot(top_5_num_stations, aes(x = State, y = state_num_stations.y))+
  geom_col()

Variables for Graphic 1

The explanatory variable is the state (categorical) and the response variable is the number of stations by state (numeric). Since this is summarized data geom_col is used. This graphic helps us see the states with the most EV stations, and we can use that to select these states and compare income and other factors of these states to see if anything stands out in common among them.

# bottom 5 states with the least EV stations
ggplot(bottom_5_num_stations, aes(x = State, y = state_num_stations.y))+
  geom_col()

Variables for Graphic 2

The explanatory variable is the state (categorical) and the response variable is the number of stations by state (numeric). Since this is summarized data geom_col is used. This graphic helps us see the states with the least EV stations, and we can use that to select these states and compare income and other factors of these states to see if anything stands out in common among them.

# median state income vs num stations by state
ggplot(stations_income_df2, aes(med_income18_20, state_num_stations.y))+
  geom_point()

## Warning: Removed 924 rows containing missing values (geom_point).

## remove num stations outlier
no_outlier <- stations_income_df2 %>%
  filter(state_num_stations.y < 40000)
ggplot(no_outlier, aes(med_income18_20, state_num_stations.y))+
  geom_point()

## Warning: Removed 924 rows containing missing values (geom_point).

Variables for Graphic 3

The explanatory variable is the median income by state (numeric) and the response variable is the number of EV stations per state (numeric). Since this is two numeric variables geom_point is used. This graphic helps us see if there may be a relationship between median income and number of EV stations per state. With the CA outlier of over 4,000 stations removed it looks like there could maybe be a positive relationship between the two variables.

# exploring renewable energy on site variable
groupby_renewable <- stations_income_df2 %>% 
  group_by(`EV On-Site Renewable Source`) %>% 
  mutate(`EV On-Site Renewable Source` = as.factor(`EV On-Site Renewable Source`)) %>%
  summarize(Number.Stations, State, City, `Open Date`, `Access Code`, `Facility Type`)

## `summarise()` has grouped output by 'EV On-Site Renewable Source'. You can
## override using the `.groups` argument.

# plotting number of stations per renewable source
ggplot(groupby_renewable, aes(`EV On-Site Renewable Source`, Number.Stations))+
         geom_col()+
  ylim(0, 1000)

## Warning: Removed 147715 rows containing missing values (position_stack).

## Warning: Removed 51929 rows containing missing values (geom_col).

Variables for Graphic 4

The explanatory variable is the type of on-site renewable source (factor) and the response variable is the number of EV stations (numeric). Since this is one categorical variable and one summarized numeric variable geom_col is used. This graphic helps us see which on-site renewable sources are most common at stations in the US.

bystate_stations_top5_df2 <- by_state_desc %>% 
  distinct(State, state_num_stations.y, Number.Stations, Station.Type) %>%
  filter(State == "CA"|State =="NY"|State =="FL"|State =="TX"|State =="MA")

bystate_stations_bottom5_df2 <- by_state_desc %>% 
  distinct(State, state_num_stations.y, Number.Stations, Station.Type) %>%
  filter(State == "SD"|State =="ND"|State =="AK"|State =="PR"|State =="ON")

Station_types_names <- c("DC Fast", "1", "2")

ggplot(bystate_stations_top5_df2, aes(Station.Type, Number.Stations))+
  geom_boxplot()+
  facet_wrap(~State, scales = "free")+ 
  scale_x_discrete(labels= Station_types_names)

## Warning: Removed 15 rows containing non-finite values (stat_boxplot).

ggplot(bystate_stations_bottom5_df2, aes(Station.Type, Number.Stations))+
  geom_boxplot()+
  facet_wrap(~State, scales = "free")+ 
  scale_x_discrete(labels= Station_types_names)

## Warning: Removed 14 rows containing non-finite values (stat_boxplot).

Variables for Graphic 5

The explanatory variables are the type of station (factor) and the state (categorical) and the response variable is the number of EV stations (numeric). Since this is one categorical variable and one numeric variable is used I’m using a boxplot, then faceting by the second categorical variable. This graphic helps us see how the type of station (DC Fast, lvl1, and lvl2) is distributed among the top five states with the highest number of stations and the bottom five states.