Assignment

Your task here is to Create an Example. Using one or more TidyVerse packages, and any dataset from fivethirtyeight.com or Kaggle, create a programming sample “vignette” that demonstrates how to use one or more of the capabilities of the selected TidyVerse package with your selected dataset. (25 points)

Data obtained from Kaggle website at “https://www.kaggle.com/datasets/utkarshx27/electric-vehicle-population-data” It is in CSV format and located on my GitHub site at “https://raw.githubusercontent.com/Aconrard/DATA606/main/Tidyverse%20Project/Electric_Vehicle_Population_Data.csv

url <- "https://raw.githubusercontent.com/Aconrard/DATA606/main/Tidyverse%20Project/Electric_Vehicle_Population_Data.csv"
ev_data <- read.csv(url)

Data Source and Selection

The dataset selected is available on the Kaggle website and provides data on Electric Vehicles in the state of Washington. This dataset lends itself readily for tidying and transforming with the Tidyr package. There 17 variables that include county, city, year, make, model of car, electric vehicle type. There are over 135,000 observations.

Tibble Function

While the Tibble function is not contained within the Tidyr package, it is contained within the overall Tidyverse library. We could of course use the “glimpse” or “view” functions to get a look at the dataset; however, the overall size of the dataset makes viewing problematic, and the glimpse function not visually optimal. The tibble function allows for better visualization of the dataset, and also allows us to select columns for a new data frame and present it as a table.

as_tibble(ev_data)
## # A tibble: 135,038 × 17
##    VIN..1.10. County    City       State Postal.Code Model.Year Make   Model    
##    <chr>      <chr>     <chr>      <chr>       <int>      <int> <chr>  <chr>    
##  1 5YJ3E1EA0K Thurston  Tumwater   WA          98512       2019 TESLA  MODEL 3  
##  2 1N4BZ1DV4N Island    Clinton    WA          98236       2022 NISSAN LEAF     
##  3 5YJ3E1EA0L Snohomish Snohomish  WA          98290       2020 TESLA  MODEL 3  
##  4 5YJ3E1EBXL King      Seattle    WA          98134       2020 TESLA  MODEL 3  
##  5 5YJSA1CP0D Snohomish Edmonds    WA          98020       2013 TESLA  MODEL S  
##  6 WBY7Z8C5XJ Chelan    Manson     WA          98831       2018 BMW    I3       
##  7 5YJ3E1EC1L Snohomish Marysville WA          98271       2020 TESLA  MODEL 3  
##  8 WA1F2AFY7M Snohomish Edmonds    WA          98026       2021 AUDI   Q5 E     
##  9 JTDKARFPXK Thurston  Olympia    WA          98501       2019 TOYOTA PRIUS PR…
## 10 1N4AZ1CP2J Thurston  Lacey      WA          98503       2018 NISSAN LEAF     
## # ℹ 135,028 more rows
## # ℹ 9 more variables: Electric.Vehicle.Type <chr>,
## #   Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility <chr>,
## #   Electric.Range <int>, Base.MSRP <int>, Legislative.District <int>,
## #   DOL.Vehicle.ID <int>, Vehicle.Location <chr>, Electric.Utility <chr>,
## #   X2020.Census.Tract <dbl>

Variable Selection

We want to know the electric range and types of EV vehicles, and the county, model, make and year of the vehicles. So we will narrow down the dataset to include only those columns for further transformation.

ev_data_select <- ev_data |> select(County, Model.Year, Make, Model, Electric.Vehicle.Type, Electric.Range)
tibble(ev_data_select)
## # A tibble: 135,038 × 6
##    County    Model.Year Make   Model       Electric.Vehicle.Type  Electric.Range
##    <chr>          <int> <chr>  <chr>       <chr>                           <int>
##  1 Thurston        2019 TESLA  MODEL 3     Battery Electric Vehi…            220
##  2 Island          2022 NISSAN LEAF        Battery Electric Vehi…              0
##  3 Snohomish       2020 TESLA  MODEL 3     Battery Electric Vehi…            266
##  4 King            2020 TESLA  MODEL 3     Battery Electric Vehi…            322
##  5 Snohomish       2013 TESLA  MODEL S     Battery Electric Vehi…            208
##  6 Chelan          2018 BMW    I3          Plug-in Hybrid Electr…             97
##  7 Snohomish       2020 TESLA  MODEL 3     Battery Electric Vehi…            308
##  8 Snohomish       2021 AUDI   Q5 E        Plug-in Hybrid Electr…             18
##  9 Thurston        2019 TOYOTA PRIUS PRIME Plug-in Hybrid Electr…             25
## 10 Thurston        2018 NISSAN LEAF        Battery Electric Vehi…            151
## # ℹ 135,028 more rows

Unite Function

The “unite” function allows us to collapse cells across several columns. For example, we would like to collapse the model year, make, and model into a single column variable called EV_vehicle.

ev_data_unite <- ev_data_select |> unite("EV_vehicle", Model.Year:Make:Model, sep = " ", na.rm = TRUE)
tibble(ev_data_unite)
## # A tibble: 135,038 × 4
##    County    EV_vehicle              Electric.Vehicle.Type        Electric.Range
##    <chr>     <chr>                   <chr>                                 <int>
##  1 Thurston  2019 TESLA MODEL 3      Battery Electric Vehicle (B…            220
##  2 Island    2022 NISSAN LEAF        Battery Electric Vehicle (B…              0
##  3 Snohomish 2020 TESLA MODEL 3      Battery Electric Vehicle (B…            266
##  4 King      2020 TESLA MODEL 3      Battery Electric Vehicle (B…            322
##  5 Snohomish 2013 TESLA MODEL S      Battery Electric Vehicle (B…            208
##  6 Chelan    2018 BMW I3             Plug-in Hybrid Electric Veh…             97
##  7 Snohomish 2020 TESLA MODEL 3      Battery Electric Vehicle (B…            308
##  8 Snohomish 2021 AUDI Q5 E          Plug-in Hybrid Electric Veh…             18
##  9 Thurston  2019 TOYOTA PRIUS PRIME Plug-in Hybrid Electric Veh…             25
## 10 Thurston  2018 NISSAN LEAF        Battery Electric Vehicle (B…            151
## # ℹ 135,028 more rows

Separate Function

The “separate” function is a complement to the “unite” function and allows us to make the dataset wider or longer. For this example we will separate the Electric.Vehicle.Type column into two separate columns names “EV_type_desc” and “EV_type_abbrev”. We will also remove the associated parentheses using the base package “gsub” function by replacing the “)” with ““.

ev_data_sep <- ev_data_unite |>separate_wider_delim(Electric.Vehicle.Type, delim = "(", names = c("EV_type_desc", "EV_type_abbrev"))
ev_data_sep$EV_type_abbrev <- gsub("\\)","", ev_data_sep$EV_type_abbrev)
tibble(ev_data_sep)
## # A tibble: 135,038 × 5
##    County    EV_vehicle              EV_type_desc  EV_type_abbrev Electric.Range
##    <chr>     <chr>                   <chr>         <chr>                   <int>
##  1 Thurston  2019 TESLA MODEL 3      "Battery Ele… BEV                       220
##  2 Island    2022 NISSAN LEAF        "Battery Ele… BEV                         0
##  3 Snohomish 2020 TESLA MODEL 3      "Battery Ele… BEV                       266
##  4 King      2020 TESLA MODEL 3      "Battery Ele… BEV                       322
##  5 Snohomish 2013 TESLA MODEL S      "Battery Ele… BEV                       208
##  6 Chelan    2018 BMW I3             "Plug-in Hyb… PHEV                       97
##  7 Snohomish 2020 TESLA MODEL 3      "Battery Ele… BEV                       308
##  8 Snohomish 2021 AUDI Q5 E          "Plug-in Hyb… PHEV                       18
##  9 Thurston  2019 TOYOTA PRIUS PRIME "Plug-in Hyb… PHEV                       25
## 10 Thurston  2018 NISSAN LEAF        "Battery Ele… BEV                       151
## # ℹ 135,028 more rows

Pivot_Wider Function

The pivot_wider function allows us to expand the data frame variables to visualize the data contained in a specific column. For this example we are going to clean up the data a little before we pivot the data frame. We are going to remove any rows that have missing data in the County variable. We are then only going to look at the counties in Washington state, of which there are thirty_nine (39). And then we are going to look at the model years of 2011 through 2023 for the EV cars purchased in each of those counties.

ev_data_clean <- ev_data[-c(150,116885,124299,126013,126132,130434,131492,131587),]
ev_data_count <- ev_data_clean |> filter(State == "WA", Model.Year >= 2011 & Model.Year < 2024) |> group_by(County, Model.Year) |> summarise(count=n())
## `summarise()` has grouped output by 'County'. You can override using the
## `.groups` argument.
ev_data_wider <- ev_data_count |> pivot_wider(, names_from = Model.Year, values_from = count)
kable(ev_data_wider, format = "html", caption = "EV Model Year by County") |> kable_classic_2()
EV Model Year by County
County 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023
Adams 1 2 1 2 2 1 5 1 1 7 4 7 5
Asotin NA 2 3 6 1 1 5 5 1 5 7 10 9
Benton 12 39 60 58 56 74 134 181 135 147 214 328 247
Chelan NA 10 25 18 47 35 64 99 63 67 93 153 122
Clallam 10 22 47 31 36 26 83 91 85 53 117 120 107
Clark 48 130 316 202 292 326 597 782 582 650 1191 1688 1182
Columbia NA NA 1 NA NA 1 2 1 NA NA 1 7 2
Cowlitz 4 14 26 14 23 21 50 71 43 62 98 134 104
Douglas 2 3 18 12 8 4 16 27 25 19 32 61 34
Ferry NA NA NA 2 2 NA NA 4 2 4 4 6 1
Franklin 2 8 23 20 12 18 25 49 34 39 56 100 52
Garfield NA NA NA NA NA 1 NA 1 NA NA 1 NA NA
Grant 4 13 18 21 15 16 19 46 26 38 47 94 52
Grays Harbor 4 12 25 17 38 26 36 59 40 30 69 64 56
Island 8 30 68 50 69 62 118 175 127 136 217 261 174
Jefferson 11 27 57 22 45 41 83 104 74 64 85 126 87
King 353 621 1961 1615 2347 2873 4027 7800 5591 5910 9852 15150 12702
Kitsap 38 78 186 146 190 236 331 485 347 381 614 826 599
Kittitas 2 8 16 8 14 23 31 46 36 33 78 103 57
Klickitat 4 3 9 7 4 6 11 15 24 18 33 45 38
Lewis 4 15 20 16 27 26 45 55 35 48 76 89 59
Lincoln NA NA 1 5 2 2 2 4 2 1 3 7 3
Mason 2 15 33 25 23 38 58 76 57 43 71 117 81
Okanogan 2 6 6 5 2 3 16 26 13 19 21 43 30
Pacific 2 3 7 3 7 5 14 22 20 13 16 38 16
Pend Oreille NA NA 3 NA NA NA 3 5 2 7 8 5 7
Pierce 61 140 375 332 389 451 687 989 799 848 1402 2096 1831
San Juan 14 26 53 50 64 65 92 70 78 55 62 100 69
Skagit 10 31 69 48 74 67 112 168 120 126 211 279 187
Skamania NA 1 4 2 7 9 11 27 15 9 29 24 20
Snohomish 91 170 536 404 533 591 817 1485 1122 1229 2094 3317 2862
Spokane 30 51 132 108 145 166 227 323 222 256 470 768 424
Stevens 3 5 7 3 4 4 23 23 18 8 23 30 25
Thurston 47 70 197 166 187 212 467 506 445 427 578 893 654
Wahkiakum NA 1 2 NA 2 NA 1 6 6 5 7 11 4
Walla Walla 1 7 17 11 11 19 33 41 17 30 47 79 31
Whatcom 37 66 232 126 203 180 249 345 277 258 395 543 400
Whitman 3 4 10 7 7 8 18 26 17 16 29 51 33
Yakima 5 22 25 26 22 39 46 75 54 58 123 147 93

Conclusion

While this a relatively simple dataset to work with, it gives a person the opportunity to experiment with the various functions of the the Tidyr package within the Tidyverse. Other functions available can allow the data frame to be nested using various dplyr and purr functions.