The following dataset can be downloaded from the U.S. Department of Energy.

For this project I am going to clean and transform a subset of EV Station data for future geospatial analysis of diffusion. This project is concerned only with EV Stations in the Carolinas.


Step 1: Read in the Data


The first thing that I needed to do for this project was read in my data. In this case, I saved my file as a .csv, so I will use the function read.csv() to read in my data, setting the parameter as.is = TRUE, in order to read it in as a data frame.

For this project I intend to use Hadley Wickham’s dplyr package for faster data manipulation.Dplyr is part of Hadley’s greater Tidvyverse, a collection of packages that simplify some of the rote tasks commonly performed in data science. Two other Tidyverse packages I will use in this project are tidyr and ggplot2.


EV_Stations <- read.csv("EV_Stations_NCSC.csv", as.is=TRUE)

library(dplyr)
library(tidyr)
library(ggplot2)

colnames(EV_Stations)
##  [1] "OBJECTID"                "Station_Name"           
##  [3] "Street_Address"          "City"                   
##  [5] "State"                   "ZIP"                    
##  [7] "Station_Phone"           "Groups_With_Access_Code"
##  [9] "Access_Days_Time"        "EV_Level1_EVSE_Num"     
## [11] "EV_Level2_EVSE_Num"      "EV_DC_Fast_Count"       
## [13] "Charging_Outlets"        "EV_Network"             
## [15] "Geocode_Status"          "Latitude"               
## [17] "Longitude"               "Date_Last_Confirmed"    
## [19] "ID"                      "Open_Date"              
## [21] "EV_Connector_Types"      "Facility_Type"          
## [23] "EV_Pricing"


After loading in my data, I use the Base R colnames() function to better understand the columns in my data frame. For this project, there are quite a few columns that are irrelevant to me. These can be quickly removed thanks to dplyr’s select() function. Here I specify which columns to remove by their index value, but I could have also removed them by name.


EV_Stations = select(EV_Stations, -7:-12, -22,-23)


Step 2: Removing Empty Values from the Data

Now that my dataframe is slightly more managable, I need to begin to examine my data. For this project I am interested in diffusion of EV stations over space and time. Therefore, the most vital column in my data frame is going to be “Open_Date”. The first thing I want to do is check this column for NA/missing values.


is.na(EV_Stations$Open_Date)
##   [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [349] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [361] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [373] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [385] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [397] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [409] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [421] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [433] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [445] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [457] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [469] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [481] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [493] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [505] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [517] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [529] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [541] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [553] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [565] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [577] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [589] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [601] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [613] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [625] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [637] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [649] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [661] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [673] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [685] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [697] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [709] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [721] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [733] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [745] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [757] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [769] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [781] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [793] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [805] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [817] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [829] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [841] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [853] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [865] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [877] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [889] FALSE FALSE FALSE FALSE FALSE FALSE


Just eyeballing the data, there does not appear to be any NA values. This is good news, however there may still be blank/empty values in my data.


unique(EV_Stations$Open_Date)
##   [1] "2010-12-15 0:00:00" "2011-03-15 0:00:00" "2011-03-01 0:00:00"
##   [4] "2010-10-15 0:00:00" "2011-02-15 0:00:00" "2012-02-01 0:00:00"
##   [7] "2011-06-15 0:00:00" "2011-07-15 0:00:00" "2011-09-15 0:00:00"
##  [10] "2011-10-01 0:00:00" "2011-07-17 0:00:00" "2011-08-15 0:00:00"
##  [13] ""                   "2011-10-03 0:00:00" "2011-09-01 0:00:00"
##  [16] "2011-05-25 0:00:00" "2011-11-14 0:00:00" "2011-08-08 0:00:00"
##  [19] "2012-02-20 0:00:00" "2012-02-03 0:00:00" "2011-04-01 0:00:00"
##  [22] "2012-02-29 0:00:00" "2012-01-01 0:00:00" "2017-05-01 0:00:00"
##  [25] "2012-02-15 0:00:00" "2012-02-04 0:00:00" "2011-08-01 0:00:00"
##  [28] "2012-01-31 0:00:00" "2011-12-01 0:00:00" "2017-07-06 0:00:00"
##  [31] "2012-01-03 0:00:00" "2012-03-01 0:00:00" "2011-12-15 0:00:00"
##  [34] "2018-01-01 0:00:00" "2012-05-25 0:00:00" "2012-04-12 0:00:00"
##  [37] "2019-11-06 0:00:00" "2012-07-05 0:00:00" "2011-08-31 0:00:00"
##  [40] "2012-05-30 0:00:00" "2011-07-21 0:00:00" "2012-06-15 0:00:00"
##  [43] "2012-10-24 0:00:00" "2012-05-26 0:00:00" "2012-10-25 0:00:00"
##  [46] "2012-08-27 0:00:00" "2012-10-09 0:00:00" "2012-05-31 0:00:00"
##  [49] "2012-12-21 0:00:00" "2018-02-01 0:00:00" "2012-07-01 0:00:00"
##  [52] "2011-12-31 0:00:00" "2012-11-30 0:00:00" "2012-06-30 0:00:00"
##  [55] "2013-08-01 0:00:00" "2013-04-01 0:00:00" "2013-06-30 0:00:00"
##  [58] "2013-07-01 0:00:00" "2013-05-01 0:00:00" "2013-11-01 0:00:00"
##  [61] "2014-02-01 0:00:00" "2014-04-01 0:00:00" "2014-03-07 0:00:00"
##  [64] "2014-05-12 0:00:00" "2012-04-01 0:00:00" "2014-07-31 0:00:00"
##  [67] "2017-11-15 0:00:00" "2014-08-11 0:00:00" "2014-09-07 0:00:00"
##  [70] "2013-06-01 0:00:00" "2013-02-01 0:00:00" "2014-10-01 0:00:00"
##  [73] "2014-10-15 0:00:00" "2011-10-31 0:00:00" "2015-01-01 0:00:00"
##  [76] "2014-09-01 0:00:00" "2014-11-01 0:00:00" "2015-06-29 0:00:00"
##  [79] "2014-02-26 0:00:00" "2015-04-18 0:00:00" "2018-06-01 0:00:00"
##  [82] "2015-09-01 0:00:00" "2016-01-01 0:00:00" "2016-03-01 0:00:00"
##  [85] "2015-01-15 0:00:00" "2016-06-01 0:00:00" "2014-08-01 0:00:00"
##  [88] "2015-06-01 0:00:00" "2016-12-15 0:00:00" "2017-01-01 0:00:00"
##  [91] "2016-05-11 0:00:00" "2017-04-01 0:00:00" "2017-06-01 0:00:00"
##  [94] "2017-06-28 0:00:00" "2017-07-01 0:00:00" "2017-05-04 0:00:00"
##  [97] "2017-03-01 0:00:00" "2013-07-06 0:00:00" "2017-10-01 0:00:00"
## [100] "2018-01-31 0:00:00" "2017-09-01 0:00:00" "2017-09-27 0:00:00"
## [103] "2017-11-23 0:00:00" "2018-03-01 0:00:00" "2016-10-01 0:00:00"
## [106] "2018-06-28 0:00:00" "2018-05-14 0:00:00" "2018-08-01 0:00:00"
## [109] "2015-12-19 0:00:00" "2013-10-31 0:00:00" "2015-03-01 0:00:00"
## [112] "2018-07-23 0:00:00" "2014-01-01 0:00:00" "2017-11-01 0:00:00"
## [115] "2013-12-15 0:00:00" "2016-12-01 0:00:00" "2015-11-01 0:00:00"
## [118] "2018-10-01 0:00:00" "2018-09-01 0:00:00" "2017-12-01 0:00:00"
## [121] "2016-02-01 0:00:00" "2015-10-01 0:00:00" "2016-08-15 0:00:00"
## [124] "2018-11-01 0:00:00" "2012-04-30 0:00:00" "2015-05-01 0:00:00"
## [127] "2016-11-01 0:00:00" "2016-05-01 0:00:00" "2015-12-01 0:00:00"
## [130] "2016-09-01 0:00:00" "2016-04-01 0:00:00" "2016-04-07 0:00:00"
## [133] "2019-03-01 0:00:00" "2015-07-01 0:00:00" "2019-01-01 0:00:00"
## [136] "2019-04-01 0:00:00" "2019-03-13 0:00:00" "2019-05-01 0:00:00"
## [139] "2019-06-01 0:00:00" "2019-07-01 0:00:00" "2012-04-02 0:00:00"
## [142] "2014-03-01 0:00:00" "2019-10-19 0:00:00" "2019-09-01 0:00:00"
## [145] "2019-09-30 0:00:00" "2019-11-19 0:00:00" "2019-05-23 0:00:00"
## [148] "2020-01-31 0:00:00" "2020-01-01 0:00:00" "2020-05-15 0:00:00"
## [151] "2019-12-31 0:00:00"


The unique() function can be used to examine every unique value in my dataset. By using it, I can clearly see that there are empty values in the Open_Date column, represented by " ".

R does not handle empty values well. When data is missing, it is much better to work with NA values, rather than empty ones.

A simple for() loop can be used to iterate through my data, changing empty values to NA. Here I tell the loop to save the output to a new column called “open”. Once this is complete, the old Open_Data frame can be removed.


EV_Stations$Open <- NA

for(i in 1:length(EV_Stations$Open_Date)){
  if(EV_Stations$Open_Date[i]!=""){
    EV_Stations$Open[i] <- EV_Stations$Open_Date[i]
  } else EV_Stations$Open[i] <- NA
}

EV_Stations = select(EV_Stations, -c(Open_Date))


Step 3: Parsing Data and Removing NA Values


Now that all the empty values in my Open column have been replaced with proper NA values, I would like to remove these respective rows from my dataset. Since these locations do not have a known opening date, I am not all that interested in them.

Thanks to the tidyr package, removing rows with NA values is made simple by the drop_na() function and dplyr’s piping operator:


EV_Stations_Cleaned <- EV_Stations %>% drop_na(Open)

nrow(EV_Stations_Cleaned)
## [1] 457


This reduced the number of records in my data from 894 to 457. Though this loss may seem significant, for this research it is not that damaging. The purpose of this research is to get a general sense of diffusive processes, and 457 observations is still more than enough to achieve that goal.

With my data now much cleaner, I would like to parse my opening dates so that I have a seperate column containing the years that each station was installed. I also would like to lop the uneeded hours and minutes off the end of my data and remove my uncleaned dataset so that I do not mistakingly use it in the future.

I can do this quite simply using the substr() function:


remove(EV_Stations)

EV_Stations_Cleaned$Open <- substr(EV_Stations_Cleaned$Open, start = 1, stop = 10)

EV_Stations_Cleaned$Open_Year <- substr(EV_Stations_Cleaned$Open, start = 1, stop = 4)


With that, my final columns are as follows:


colnames(EV_Stations_Cleaned)
##  [1] "OBJECTID"            "Station_Name"        "Street_Address"     
##  [4] "City"                "State"               "ZIP"                
##  [7] "Charging_Outlets"    "EV_Network"          "Geocode_Status"     
## [10] "Latitude"            "Longitude"           "Date_Last_Confirmed"
## [13] "ID"                  "EV_Connector_Types"  "Open"               
## [16] "Open_Year"


Step 4: Analyzing and Exporting my Data

With my data in a state I am happy with, I can now analyze my data in a Geographic Information System (GIS) of my choice. While R can be used as a GIS, I prefer using ESRI’s ArcGIS Pro.

Before I do that though, I want to use this opportunity to quickly visualize my data. This is where ggplot 2 comes in handy. With ggplot 2 I can create a quick bar chart of my data, to get a better idea about the number of EV Charging Stations that were added by year since 2010.


ggplot(EV_Stations_Cleaned, aes(x=factor(Open_Year))) +
  geom_bar(stat="count", width = 0.7, fill="steelblue")+theme_minimal() +
  coord_flip() + ggtitle("Number of New EV Charging Stations in the Carolinas by Year") +
  xlab("Year") + ylab("Count")


By examining this bar graph, it would appear as though 2012 was a peak year for EV charging station instillation, while 2013 and 2014 were slumps. Moving forward I could attempt to see whether a certian charging network (ie. Tesla) was responsible for the major addition of charging stations in 2012, but at the moment my main goal is to examine the spatial diffusion of these stations over time.

To do that I will write my data back to a .csv file, and continue working within a GIS environment.

write.csv(EV_Stations_Cleaned, "EV_Stations_Cleaned")