The following document is a contininuation of the project found at this R Markdown link.
For this project I am going to visualize Electric Vehicle data in the states of North Carolina and South Carolina. My goal is to visualize the proportion of networked vs. non-networked charging stations added in the region since 2010.
Networked charging stations are those that belong to a larger corporate system (ie. Tesla, Chargepoint) wheras non-networked charging stations tend to be owned entirelly by local governments and businesses. A great visualization for this project is the stacked bar chart, normalized by height to represent proportions. In order create this visualization we must first load in our data and the requisite packages:
EV_Bar_Data <- read.csv("EV_Stations_Cleaned.csv", as.is=TRUE)
library(dplyr)
library(ggplot2)
library(tidyr)
colnames(EV_Bar_Data)
## [1] "X" "OBJECTID" "Station_Name"
## [4] "Street_Address" "City" "State"
## [7] "ZIP" "Charging_Outlets" "EV_Network"
## [10] "Geocode_Status" "Latitude" "Longitude"
## [13] "Date_Last_Confirmed" "ID" "EV_Connector_Types"
## [16] "Open" "Open_Year"
The first thing that I need to do is examine my data on EV charging station networks to see what kinds of values are contained within it.
I can use the unique() function with my EV network column subsetted to explore what is contained within it:
unique(EV_Bar_Data$EV_Network)
## [1] "Non-Networked" "ChargePoint Network" "EV Connect"
## [4] "OpConnect" "Tesla" "Tesla Destination"
I can now see that my EV Network contains 6 variables. Only one variable represents non-networked stations, while the other 5 are references to specific networks. For my visualization I only want to compare the proportions of networked and non-networked stations with eachother, I do not want to compare individual networks with eachother.
Keeping the intent of this project in mind, I therefore need to group each of the 5 EV networks together into a single variable called “networked”.
There are multiple ways I could go about combining and reclassifying my data, but I prefer to use loops. Loops are easily understood by almost anyone with coding experience. Outside actors reviewing my code, who may not have a high familiarty with base R functions of tidyverse syntax would be more easily able to understand a loop. Replicability and cross-platform considerations are very important when working with data.
For this project I will use a single for() loop to reclassify the 5 different EV network variables into a single variable called “networked”:
EV_Bar_Data$Networked <- NA
for(i in 1:length(EV_Bar_Data$EV_Network)) {
if(EV_Bar_Data$EV_Network[i] == "Non-Networked"){
EV_Bar_Data$Networked[i] <- EV_Bar_Data$EV_Network[i]
}else EV_Bar_Data$Networked[i] <- "Networked"
}
Now when I run the unique() function again, I will see only 2 variables:
unique(EV_Bar_Data$Networked)
## [1] "Non-Networked" "Networked"
Now that my EV Network data is in a good state, I need to consider the time component of the visualization I am trying to make. Since I want to chart the proportion of EV stations added over time, by network, I need consider what temporal unit I want to use for this task.
To me, the year that each station was opened is what I am most interested in. One of the columns in my dataframe is called “Open_Year”. I want to see what years are contained within this column. I only want between 2010 and 2019, and will have to remove any other years.
sort(unique(EV_Bar_Data$Open_Year))
## [1] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Since order is important when dealing with years, I nested the unique() function within a sort() function. What I can see is that the years 2010-2020 are represented in my data. The only year I do not want is 2020. I can remove it with a filter()
EV_Bar_Data = filter(EV_Bar_Data, Open_Year != 2020)
sort(unique(EV_Bar_Data$Open_Year))
## [1] 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
With 2020 now removed, I can move on to my visualization:
Now that my data is cleaned, I can use ggplot to create my visualization. For this bar graph, I add the position “fill” argument into the geom_bar() function to indicate that I am interested in a stacked bar graph. Using a continious scale with percent labels, I can create the desired proportional effect I have been looking for.
ggplot(EV_Bar_Data, aes(factor(Open_Year), fill = factor(Networked))) +
geom_bar(position = "fill") + scale_y_continuous(labels = scales::percent) + scale_fill_discrete("Network Status") +
ggtitle("% Networked vs. Non-Networked Charging Stations Added by Year")+
theme_bw() + theme(plot.background = element_blank()) +
xlab("Year") + ylab("Percent")
This visualization shows that non-networked charging stations were more commonly installed in the early years of electric vehicle charging, with networked charging stations taking on more importance with time.