Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)
Data obtained from Kaggle website at “https://www.kaggle.com/datasets/utkarshx27/electric-vehicle-population-data” It is in CSV format and located on my GitHub site at “https://raw.githubusercontent.com/Aconrard/DATA606/main/Tidyverse%20Project/Electric_Vehicle_Population_Data.csv”
url <- "https://raw.githubusercontent.com/Aconrard/DATA606/main/Tidyverse%20Project/Electric_Vehicle_Population_Data.csv"
ev_data <- read.csv(url)
For this assignment vignette we will be using the Tidyr package for R. This package has various tools for reshaping and tidying data to perform data analysis and visualization. There are some very powerful functions that can be leveraged from the package including pivot_longer, pivot_wider, splitting and combining cells, and creating, reshaping, and transforming nested data.
The dataset selected is available on the Kaggle website and provides data on Electric Vehicles in the state of Washington. This dataset lends itself readily for tidying and transforming with the Tidyr package. There 17 variables that include county, city, year, make, model of car, electric vehicle type. There are over 135,000 observations.
While the Tibble function is not contained within the Tidyr package, it is contained within the overall Tidyverse library. We could of course use the “glimpse” or “view” functions to get a look at the dataset; however, the overall size of the dataset makes viewing problematic, and the glimpse function not visually optimal. The tibble function allows for better visualization of the dataset, and also allows us to select columns for a new data frame and present it as a table.
as_tibble(ev_data)
## # A tibble: 135,038 × 17
## VIN..1.10. County City State Postal.Code Model.Year Make Model
## <chr> <chr> <chr> <chr> <int> <int> <chr> <chr>
## 1 5YJ3E1EA0K Thurston Tumwater WA 98512 2019 TESLA MODEL 3
## 2 1N4BZ1DV4N Island Clinton WA 98236 2022 NISSAN LEAF
## 3 5YJ3E1EA0L Snohomish Snohomish WA 98290 2020 TESLA MODEL 3
## 4 5YJ3E1EBXL King Seattle WA 98134 2020 TESLA MODEL 3
## 5 5YJSA1CP0D Snohomish Edmonds WA 98020 2013 TESLA MODEL S
## 6 WBY7Z8C5XJ Chelan Manson WA 98831 2018 BMW I3
## 7 5YJ3E1EC1L Snohomish Marysville WA 98271 2020 TESLA MODEL 3
## 8 WA1F2AFY7M Snohomish Edmonds WA 98026 2021 AUDI Q5 E
## 9 JTDKARFPXK Thurston Olympia WA 98501 2019 TOYOTA PRIUS PR…
## 10 1N4AZ1CP2J Thurston Lacey WA 98503 2018 NISSAN LEAF
## # ℹ 135,028 more rows
## # ℹ 9 more variables: Electric.Vehicle.Type <chr>,
## # Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility <chr>,
## # Electric.Range <int>, Base.MSRP <int>, Legislative.District <int>,
## # DOL.Vehicle.ID <int>, Vehicle.Location <chr>, Electric.Utility <chr>,
## # X2020.Census.Tract <dbl>
We want to know the electric range and types of EV vehicles, and the county, model, make and year of the vehicles. So we will narrow down the dataset to include only those columns for further transformation.
ev_data_select <- ev_data |> select(County, Model.Year, Make, Model, Electric.Vehicle.Type, Electric.Range)
tibble(ev_data_select)
## # A tibble: 135,038 × 6
## County Model.Year Make Model Electric.Vehicle.Type Electric.Range
## <chr> <int> <chr> <chr> <chr> <int>
## 1 Thurston 2019 TESLA MODEL 3 Battery Electric Vehi… 220
## 2 Island 2022 NISSAN LEAF Battery Electric Vehi… 0
## 3 Snohomish 2020 TESLA MODEL 3 Battery Electric Vehi… 266
## 4 King 2020 TESLA MODEL 3 Battery Electric Vehi… 322
## 5 Snohomish 2013 TESLA MODEL S Battery Electric Vehi… 208
## 6 Chelan 2018 BMW I3 Plug-in Hybrid Electr… 97
## 7 Snohomish 2020 TESLA MODEL 3 Battery Electric Vehi… 308
## 8 Snohomish 2021 AUDI Q5 E Plug-in Hybrid Electr… 18
## 9 Thurston 2019 TOYOTA PRIUS PRIME Plug-in Hybrid Electr… 25
## 10 Thurston 2018 NISSAN LEAF Battery Electric Vehi… 151
## # ℹ 135,028 more rows
The “unite” function allows us to collapse cells across several columns. For example, we would like to collapse the model year, make, and model into a single column variable called EV_vehicle.
ev_data_unite <- ev_data_select |> unite("EV_vehicle", Model.Year:Make:Model, sep = " ", na.rm = TRUE)
tibble(ev_data_unite)
## # A tibble: 135,038 × 4
## County EV_vehicle Electric.Vehicle.Type Electric.Range
## <chr> <chr> <chr> <int>
## 1 Thurston 2019 TESLA MODEL 3 Battery Electric Vehicle (B… 220
## 2 Island 2022 NISSAN LEAF Battery Electric Vehicle (B… 0
## 3 Snohomish 2020 TESLA MODEL 3 Battery Electric Vehicle (B… 266
## 4 King 2020 TESLA MODEL 3 Battery Electric Vehicle (B… 322
## 5 Snohomish 2013 TESLA MODEL S Battery Electric Vehicle (B… 208
## 6 Chelan 2018 BMW I3 Plug-in Hybrid Electric Veh… 97
## 7 Snohomish 2020 TESLA MODEL 3 Battery Electric Vehicle (B… 308
## 8 Snohomish 2021 AUDI Q5 E Plug-in Hybrid Electric Veh… 18
## 9 Thurston 2019 TOYOTA PRIUS PRIME Plug-in Hybrid Electric Veh… 25
## 10 Thurston 2018 NISSAN LEAF Battery Electric Vehicle (B… 151
## # ℹ 135,028 more rows
The “separate” function is a complement to the “unite” function and allows us to make the dataset wider or longer. For this example we will separate the Electric.Vehicle.Type column into two separate columns names “EV_type_desc” and “EV_type_abbrev”. We will also remove the associated parentheses using the base package “gsub” function by replacing the “)” with ““.
ev_data_sep <- ev_data_unite |>separate_wider_delim(Electric.Vehicle.Type, delim = "(", names = c("EV_type_desc", "EV_type_abbrev"))
ev_data_sep$EV_type_abbrev <- gsub("\\)","", ev_data_sep$EV_type_abbrev)
tibble(ev_data_sep)
## # A tibble: 135,038 × 5
## County EV_vehicle EV_type_desc EV_type_abbrev Electric.Range
## <chr> <chr> <chr> <chr> <int>
## 1 Thurston 2019 TESLA MODEL 3 "Battery Ele… BEV 220
## 2 Island 2022 NISSAN LEAF "Battery Ele… BEV 0
## 3 Snohomish 2020 TESLA MODEL 3 "Battery Ele… BEV 266
## 4 King 2020 TESLA MODEL 3 "Battery Ele… BEV 322
## 5 Snohomish 2013 TESLA MODEL S "Battery Ele… BEV 208
## 6 Chelan 2018 BMW I3 "Plug-in Hyb… PHEV 97
## 7 Snohomish 2020 TESLA MODEL 3 "Battery Ele… BEV 308
## 8 Snohomish 2021 AUDI Q5 E "Plug-in Hyb… PHEV 18
## 9 Thurston 2019 TOYOTA PRIUS PRIME "Plug-in Hyb… PHEV 25
## 10 Thurston 2018 NISSAN LEAF "Battery Ele… BEV 151
## # ℹ 135,028 more rows
The pivot_wider function allows us to expand the data frame variables to visualize the data contained in a specific column. For this example we are going to clean up the data a little before we pivot the data frame. We are going to remove any rows that have missing data in the County variable. We are then only going to look at the counties in Washington state, of which there are thirty_nine (39). And then we are going to look at the model years of 2011 through 2023 for the EV cars purchased in each of those counties.
ev_data_clean <- ev_data[-c(150,116885,124299,126013,126132,130434,131492,131587),]
ev_data_count <- ev_data_clean |> filter(State == "WA", Model.Year >= 2011 & Model.Year < 2024) |> group_by(County, Model.Year) |> summarise(count=n())
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
ev_data_wider <- ev_data_count |> pivot_wider(, names_from = Model.Year, values_from = count)
kable(ev_data_wider, format = "html", caption = "EV Model Year by County") |> kable_classic_2()
County | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | 2019 | 2020 | 2021 | 2022 | 2023 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Adams | 1 | 2 | 1 | 2 | 2 | 1 | 5 | 1 | 1 | 7 | 4 | 7 | 5 |
Asotin | NA | 2 | 3 | 6 | 1 | 1 | 5 | 5 | 1 | 5 | 7 | 10 | 9 |
Benton | 12 | 39 | 60 | 58 | 56 | 74 | 134 | 181 | 135 | 147 | 214 | 328 | 247 |
Chelan | NA | 10 | 25 | 18 | 47 | 35 | 64 | 99 | 63 | 67 | 93 | 153 | 122 |
Clallam | 10 | 22 | 47 | 31 | 36 | 26 | 83 | 91 | 85 | 53 | 117 | 120 | 107 |
Clark | 48 | 130 | 316 | 202 | 292 | 326 | 597 | 782 | 582 | 650 | 1191 | 1688 | 1182 |
Columbia | NA | NA | 1 | NA | NA | 1 | 2 | 1 | NA | NA | 1 | 7 | 2 |
Cowlitz | 4 | 14 | 26 | 14 | 23 | 21 | 50 | 71 | 43 | 62 | 98 | 134 | 104 |
Douglas | 2 | 3 | 18 | 12 | 8 | 4 | 16 | 27 | 25 | 19 | 32 | 61 | 34 |
Ferry | NA | NA | NA | 2 | 2 | NA | NA | 4 | 2 | 4 | 4 | 6 | 1 |
Franklin | 2 | 8 | 23 | 20 | 12 | 18 | 25 | 49 | 34 | 39 | 56 | 100 | 52 |
Garfield | NA | NA | NA | NA | NA | 1 | NA | 1 | NA | NA | 1 | NA | NA |
Grant | 4 | 13 | 18 | 21 | 15 | 16 | 19 | 46 | 26 | 38 | 47 | 94 | 52 |
Grays Harbor | 4 | 12 | 25 | 17 | 38 | 26 | 36 | 59 | 40 | 30 | 69 | 64 | 56 |
Island | 8 | 30 | 68 | 50 | 69 | 62 | 118 | 175 | 127 | 136 | 217 | 261 | 174 |
Jefferson | 11 | 27 | 57 | 22 | 45 | 41 | 83 | 104 | 74 | 64 | 85 | 126 | 87 |
King | 353 | 621 | 1961 | 1615 | 2347 | 2873 | 4027 | 7800 | 5591 | 5910 | 9852 | 15150 | 12702 |
Kitsap | 38 | 78 | 186 | 146 | 190 | 236 | 331 | 485 | 347 | 381 | 614 | 826 | 599 |
Kittitas | 2 | 8 | 16 | 8 | 14 | 23 | 31 | 46 | 36 | 33 | 78 | 103 | 57 |
Klickitat | 4 | 3 | 9 | 7 | 4 | 6 | 11 | 15 | 24 | 18 | 33 | 45 | 38 |
Lewis | 4 | 15 | 20 | 16 | 27 | 26 | 45 | 55 | 35 | 48 | 76 | 89 | 59 |
Lincoln | NA | NA | 1 | 5 | 2 | 2 | 2 | 4 | 2 | 1 | 3 | 7 | 3 |
Mason | 2 | 15 | 33 | 25 | 23 | 38 | 58 | 76 | 57 | 43 | 71 | 117 | 81 |
Okanogan | 2 | 6 | 6 | 5 | 2 | 3 | 16 | 26 | 13 | 19 | 21 | 43 | 30 |
Pacific | 2 | 3 | 7 | 3 | 7 | 5 | 14 | 22 | 20 | 13 | 16 | 38 | 16 |
Pend Oreille | NA | NA | 3 | NA | NA | NA | 3 | 5 | 2 | 7 | 8 | 5 | 7 |
Pierce | 61 | 140 | 375 | 332 | 389 | 451 | 687 | 989 | 799 | 848 | 1402 | 2096 | 1831 |
San Juan | 14 | 26 | 53 | 50 | 64 | 65 | 92 | 70 | 78 | 55 | 62 | 100 | 69 |
Skagit | 10 | 31 | 69 | 48 | 74 | 67 | 112 | 168 | 120 | 126 | 211 | 279 | 187 |
Skamania | NA | 1 | 4 | 2 | 7 | 9 | 11 | 27 | 15 | 9 | 29 | 24 | 20 |
Snohomish | 91 | 170 | 536 | 404 | 533 | 591 | 817 | 1485 | 1122 | 1229 | 2094 | 3317 | 2862 |
Spokane | 30 | 51 | 132 | 108 | 145 | 166 | 227 | 323 | 222 | 256 | 470 | 768 | 424 |
Stevens | 3 | 5 | 7 | 3 | 4 | 4 | 23 | 23 | 18 | 8 | 23 | 30 | 25 |
Thurston | 47 | 70 | 197 | 166 | 187 | 212 | 467 | 506 | 445 | 427 | 578 | 893 | 654 |
Wahkiakum | NA | 1 | 2 | NA | 2 | NA | 1 | 6 | 6 | 5 | 7 | 11 | 4 |
Walla Walla | 1 | 7 | 17 | 11 | 11 | 19 | 33 | 41 | 17 | 30 | 47 | 79 | 31 |
Whatcom | 37 | 66 | 232 | 126 | 203 | 180 | 249 | 345 | 277 | 258 | 395 | 543 | 400 |
Whitman | 3 | 4 | 10 | 7 | 7 | 8 | 18 | 26 | 17 | 16 | 29 | 51 | 33 |
Yakima | 5 | 22 | 25 | 26 | 22 | 39 | 46 | 75 | 54 | 58 | 123 | 147 | 93 |
While this a relatively simple dataset to work with, it gives a person the opportunity to experiment with the various functions of the the Tidyr package within the Tidyverse. Other functions available can allow the data frame to be nested using various dplyr and purr functions.