Tidyverse Assignment

Assignment

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Data obtained from Kaggle website at “https://www.kaggle.com/datasets/utkarshx27/electric-vehicle-population-data” It is in CSV format and located on my GitHub site at “https://raw.githubusercontent.com/Aconrard/DATA606/main/Tidyverse%20Project/Electric_Vehicle_Population_Data.csv”

url <- "https://raw.githubusercontent.com/Aconrard/DATA606/main/Tidyverse%20Project/Electric_Vehicle_Population_Data.csv"
ev_data <- read.csv(url)

Tidyverse Package-Tidyr

For this assignment vignette we will be using the Tidyr package for R. This package has various tools for reshaping and tidying data to perform data analysis and visualization. There are some very powerful functions that can be leveraged from the package including pivot_longer, pivot_wider, splitting and combining cells, and creating, reshaping, and transforming nested data.

Data Source and Selection

The dataset selected is available on the Kaggle website and provides data on Electric Vehicles in the state of Washington. This dataset lends itself readily for tidying and transforming with the Tidyr package. There 17 variables that include county, city, year, make, model of car, electric vehicle type. There are over 135,000 observations.

Tibble Function

While the Tibble function is not contained within the Tidyr package, it is contained within the overall Tidyverse library. We could of course use the “glimpse” or “view” functions to get a look at the dataset; however, the overall size of the dataset makes viewing problematic, and the glimpse function not visually optimal. The tibble function allows for better visualization of the dataset, and also allows us to select columns for a new data frame and present it as a table.

as_tibble(ev_data)

## # A tibble: 135,038 × 17
##    VIN..1.10. County    City       State Postal.Code Model.Year Make   Model    
##    <chr>      <chr>     <chr>      <chr>       <int>      <int> <chr>  <chr>    
##  1 5YJ3E1EA0K Thurston  Tumwater   WA          98512       2019 TESLA  MODEL 3  
##  2 1N4BZ1DV4N Island    Clinton    WA          98236       2022 NISSAN LEAF     
##  3 5YJ3E1EA0L Snohomish Snohomish  WA          98290       2020 TESLA  MODEL 3  
##  4 5YJ3E1EBXL King      Seattle    WA          98134       2020 TESLA  MODEL 3  
##  5 5YJSA1CP0D Snohomish Edmonds    WA          98020       2013 TESLA  MODEL S  
##  6 WBY7Z8C5XJ Chelan    Manson     WA          98831       2018 BMW    I3       
##  7 5YJ3E1EC1L Snohomish Marysville WA          98271       2020 TESLA  MODEL 3  
##  8 WA1F2AFY7M Snohomish Edmonds    WA          98026       2021 AUDI   Q5 E     
##  9 JTDKARFPXK Thurston  Olympia    WA          98501       2019 TOYOTA PRIUS PR…
## 10 1N4AZ1CP2J Thurston  Lacey      WA          98503       2018 NISSAN LEAF     
## # ℹ 135,028 more rows
## # ℹ 9 more variables: Electric.Vehicle.Type <chr>,
## #   Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility <chr>,
## #   Electric.Range <int>, Base.MSRP <int>, Legislative.District <int>,
## #   DOL.Vehicle.ID <int>, Vehicle.Location <chr>, Electric.Utility <chr>,
## #   X2020.Census.Tract <dbl>

Variable Selection

We want to know the electric range and types of EV vehicles, and the county, model, make and year of the vehicles. So we will narrow down the dataset to include only those columns for further transformation.

ev_data_select <- ev_data |> select(County, Model.Year, Make, Model, Electric.Vehicle.Type, Electric.Range)
tibble(ev_data_select)

## # A tibble: 135,038 × 6
##    County    Model.Year Make   Model       Electric.Vehicle.Type  Electric.Range
##    <chr>          <int> <chr>  <chr>       <chr>                           <int>
##  1 Thurston        2019 TESLA  MODEL 3     Battery Electric Vehi…            220
##  2 Island          2022 NISSAN LEAF        Battery Electric Vehi…              0
##  3 Snohomish       2020 TESLA  MODEL 3     Battery Electric Vehi…            266
##  4 King            2020 TESLA  MODEL 3     Battery Electric Vehi…            322
##  5 Snohomish       2013 TESLA  MODEL S     Battery Electric Vehi…            208
##  6 Chelan          2018 BMW    I3          Plug-in Hybrid Electr…             97
##  7 Snohomish       2020 TESLA  MODEL 3     Battery Electric Vehi…            308
##  8 Snohomish       2021 AUDI   Q5 E        Plug-in Hybrid Electr…             18
##  9 Thurston        2019 TOYOTA PRIUS PRIME Plug-in Hybrid Electr…             25
## 10 Thurston        2018 NISSAN LEAF        Battery Electric Vehi…            151
## # ℹ 135,028 more rows

Unite Function

The “unite” function allows us to collapse cells across several columns. For example, we would like to collapse the model year, make, and model into a single column variable called EV_vehicle.

ev_data_unite <- ev_data_select |> unite("EV_vehicle", Model.Year:Make:Model, sep = " ", na.rm = TRUE)
tibble(ev_data_unite)

## # A tibble: 135,038 × 4
##    County    EV_vehicle              Electric.Vehicle.Type        Electric.Range
##    <chr>     <chr>                   <chr>                                 <int>
##  1 Thurston  2019 TESLA MODEL 3      Battery Electric Vehicle (B…            220
##  2 Island    2022 NISSAN LEAF        Battery Electric Vehicle (B…              0
##  3 Snohomish 2020 TESLA MODEL 3      Battery Electric Vehicle (B…            266
##  4 King      2020 TESLA MODEL 3      Battery Electric Vehicle (B…            322
##  5 Snohomish 2013 TESLA MODEL S      Battery Electric Vehicle (B…            208
##  6 Chelan    2018 BMW I3             Plug-in Hybrid Electric Veh…             97
##  7 Snohomish 2020 TESLA MODEL 3      Battery Electric Vehicle (B…            308
##  8 Snohomish 2021 AUDI Q5 E          Plug-in Hybrid Electric Veh…             18
##  9 Thurston  2019 TOYOTA PRIUS PRIME Plug-in Hybrid Electric Veh…             25
## 10 Thurston  2018 NISSAN LEAF        Battery Electric Vehicle (B…            151
## # ℹ 135,028 more rows

Separate Function

The “separate” function is a complement to the “unite” function and allows us to make the dataset wider or longer. For this example we will separate the Electric.Vehicle.Type column into two separate columns names “EV_type_desc” and “EV_type_abbrev”. We will also remove the associated parentheses using the base package “gsub” function by replacing the “)” with ““.

ev_data_sep <- ev_data_unite |>separate_wider_delim(Electric.Vehicle.Type, delim = "(", names = c("EV_type_desc", "EV_type_abbrev"))
ev_data_sep$EV_type_abbrev <- gsub("\\)","", ev_data_sep$EV_type_abbrev)
tibble(ev_data_sep)

## # A tibble: 135,038 × 5
##    County    EV_vehicle              EV_type_desc  EV_type_abbrev Electric.Range
##    <chr>     <chr>                   <chr>         <chr>                   <int>
##  1 Thurston  2019 TESLA MODEL 3      "Battery Ele… BEV                       220
##  2 Island    2022 NISSAN LEAF        "Battery Ele… BEV                         0
##  3 Snohomish 2020 TESLA MODEL 3      "Battery Ele… BEV                       266
##  4 King      2020 TESLA MODEL 3      "Battery Ele… BEV                       322
##  5 Snohomish 2013 TESLA MODEL S      "Battery Ele… BEV                       208
##  6 Chelan    2018 BMW I3             "Plug-in Hyb… PHEV                       97
##  7 Snohomish 2020 TESLA MODEL 3      "Battery Ele… BEV                       308
##  8 Snohomish 2021 AUDI Q5 E          "Plug-in Hyb… PHEV                       18
##  9 Thurston  2019 TOYOTA PRIUS PRIME "Plug-in Hyb… PHEV                       25
## 10 Thurston  2018 NISSAN LEAF        "Battery Ele… BEV                       151
## # ℹ 135,028 more rows

Pivot_Wider Function

The pivot_wider function allows us to expand the data frame variables to visualize the data contained in a specific column. For this example we are going to clean up the data a little before we pivot the data frame. We are going to remove any rows that have missing data in the County variable. We are then only going to look at the counties in Washington state, of which there are thirty_nine (39). And then we are going to look at the model years of 2011 through 2023 for the EV cars purchased in each of those counties.

ev_data_clean <- ev_data[-c(150,116885,124299,126013,126132,130434,131492,131587),]
ev_data_count <- ev_data_clean |> filter(State == "WA", Model.Year >= 2011 & Model.Year < 2024) |> group_by(County, Model.Year) |> summarise(count=n())

## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.

ev_data_wider <- ev_data_count |> pivot_wider(, names_from = Model.Year, values_from = count)
kable(ev_data_wider, format = "html", caption = "EV Model Year by County") |> kable_classic_2()

EV Model Year by County
County	2011	2012	2013	2014	2015	2016	2017	2018	2019	2020	2021	2022	2023
Adams	1	2	1	2	2	1	5	1	1	7	4	7	5
Asotin	NA	2	3	6	1	1	5	5	1	5	7	10	9
Benton	12	39	60	58	56	74	134	181	135	147	214	328	247
Chelan	NA	10	25	18	47	35	64	99	63	67	93	153	122
Clallam	10	22	47	31	36	26	83	91	85	53	117	120	107
Clark	48	130	316	202	292	326	597	782	582	650	1191	1688	1182
Columbia	NA	NA	1	NA	NA	1	2	1	NA	NA	1	7	2
Cowlitz	4	14	26	14	23	21	50	71	43	62	98	134	104
Douglas	2	3	18	12	8	4	16	27	25	19	32	61	34
Ferry	NA	NA	NA	2	2	NA	NA	4	2	4	4	6	1
Franklin	2	8	23	20	12	18	25	49	34	39	56	100	52
Garfield	NA	NA	NA	NA	NA	1	NA	1	NA	NA	1	NA	NA
Grant	4	13	18	21	15	16	19	46	26	38	47	94	52
Grays Harbor	4	12	25	17	38	26	36	59	40	30	69	64	56
Island	8	30	68	50	69	62	118	175	127	136	217	261	174
Jefferson	11	27	57	22	45	41	83	104	74	64	85	126	87
King	353	621	1961	1615	2347	2873	4027	7800	5591	5910	9852	15150	12702
Kitsap	38	78	186	146	190	236	331	485	347	381	614	826	599
Kittitas	2	8	16	8	14	23	31	46	36	33	78	103	57
Klickitat	4	3	9	7	4	6	11	15	24	18	33	45	38
Lewis	4	15	20	16	27	26	45	55	35	48	76	89	59
Lincoln	NA	NA	1	5	2	2	2	4	2	1	3	7	3
Mason	2	15	33	25	23	38	58	76	57	43	71	117	81
Okanogan	2	6	6	5	2	3	16	26	13	19	21	43	30
Pacific	2	3	7	3	7	5	14	22	20	13	16	38	16
Pend Oreille	NA	NA	3	NA	NA	NA	3	5	2	7	8	5	7
Pierce	61	140	375	332	389	451	687	989	799	848	1402	2096	1831
San Juan	14	26	53	50	64	65	92	70	78	55	62	100	69
Skagit	10	31	69	48	74	67	112	168	120	126	211	279	187
Skamania	NA	1	4	2	7	9	11	27	15	9	29	24	20
Snohomish	91	170	536	404	533	591	817	1485	1122	1229	2094	3317	2862
Spokane	30	51	132	108	145	166	227	323	222	256	470	768	424
Stevens	3	5	7	3	4	4	23	23	18	8	23	30	25
Thurston	47	70	197	166	187	212	467	506	445	427	578	893	654
Wahkiakum	NA	1	2	NA	2	NA	1	6	6	5	7	11	4
Walla Walla	1	7	17	11	11	19	33	41	17	30	47	79	31
Whatcom	37	66	232	126	203	180	249	345	277	258	395	543	400
Whitman	3	4	10	7	7	8	18	26	17	16	29	51	33
Yakima	5	22	25	26	22	39	46	75	54	58	123	147	93

Conclusion

While this a relatively simple dataset to work with, it gives a person the opportunity to experiment with the various functions of the the Tidyr package within the Tidyverse. Other functions available can allow the data frame to be nested using various dplyr and purr functions.