This dataset explores data from the New York State Energy Research and Development Authority on electric vehicle rebates distributed through 2017 to 2024. The dataset provides insight into consumer adoption of electric vehicles across New York, including trends in vehicle type, transaction type, and estimated environmental impact. The goal of this analysis is to identify patterns in rebate use over time, assess the effectiveness of the program in promoting cleaner transportation, and better understand how consumer choices align with state sustainability goals. The data was obtained directly from the New York State Energy Research and Development Authority.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 150328 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Data through Date, Submitted Date, Make, Model, County, EV Type, Tr...
dbl (4): ZIP, Annual GHG Emissions Reductions (MT CO2e), Annual Petroleum Re...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
make_counts <-count(ev_clean, Make, sort =TRUE)top_makes <-slice_head(make_counts, n =5)top_make_names <-pull(top_makes, Make)print(make_counts)
# A tibble: 30 × 2
Make n
<chr> <int>
1 Tesla 68052
2 Toyota 25837
3 Jeep 12075
4 Hyundai 8548
5 Chevrolet 7655
6 Ford 6002
7 BMW 4431
8 Kia 3682
9 Honda 2311
10 Volvo 2143
# ℹ 20 more rows
Get top 10 most common counties
county_counts <-count(ev_clean, County, sort =TRUE)top_counties <-slice_head(county_counts, n =10)top_county_names <-pull(top_counties, County)print(county_counts)
# A tibble: 62 × 2
County n
<chr> <int>
1 Nassau 23655
2 Suffolk 22843
3 Westchester 14597
4 Queens 13026
5 Kings 8666
6 Monroe 7855
7 Erie 6956
8 New York 5601
9 Onondaga 3964
10 Richmond 3702
# ℹ 52 more rows
Filter only top 5 makes and top 10 counties
ev_filtered <-filter( ev_clean, Make %in% top_make_names & County %in% top_county_names)
Calculate average GHG by County and Make
ghg_avg <-summarise(group_by(ev_filtered, County, Make),Avg_GHG_Reduction =mean(Annual_GHG_Emissions_Reductions_MT_CO2e, na.rm =TRUE))
`summarise()` has grouped output by 'County'. You can override using the
`.groups` argument.
Plotting everything together
ggplot(ghg_avg, aes(x = County, y = Avg_GHG_Reduction, group = Make, color = Make)) +geom_line(size =1.2) +labs(title ="Avg GHG Emission Reduction by County and Car Make from 2017-2024",x ="County",y ="Avg GHG Reduction (Metric Tons CO2e)",color ="Car Make",caption ="Source: New York State Energy Research and Development Authority(NYSERDA)" ) +scale_color_manual(values =c("#1b9e77", "#d95f02", "maroon", "purple", "#66a61e")) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1),plot.title =element_text(face ="bold", size =14),legend.title =element_text(size =11) )
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
I initially used read_csv() to load the dataset in order to prepare it for analysis. I then tidied up the column names by substituting underscores for spaces and parentheses to make them easier to find in code. I then eliminated any rows that had missing values for the following important variables: Annual_GHG_Emissions_Reductions_MT_CO2e, Make, and County. This guaranteed the completeness and significance of the data used for analysis. To determine the top ten counties by number of rebate entries and the top five most popular car makes, I made frequency tables. To cut down on noise and concentrate the display on the most pertinent data segments, these subsets were employed. Lastly, I computed the average reduction in greenhouse gas emissions for each group after grouping the filtered data by County and Make.
For each of the top five automobile brands, the generated line graph shows the average yearly reduction in GHG emissions per county. With counties on the x-axis and emissions reductions on the y-axis, each line represents a distinct car manufacturer. Interesting geographical and brand-specific tendencies are revealed by the visualization I created. For instance, Tesla routinely displays greater average GHG reductions in the majority of counties, most likely as a result of their completely electric car selection. The average GHG reduction numbers for Jeep, on the other hand, are much lower and in some areas even negative. This might be a sign of problems with data reporting, hybrid vehicle classifications, or low efficiency benefits. These differences show how various brands have varying effects on the environment and call for more research into the models and combinations that each manufacturer offers.
One limitation I encountered was in attempting to convert the Submitted Date column into a proper date format to analyze trends over time. Due to formatting inconsistencies or parsing issues in the original data(or it could have really just been me), this step failed to produce usable results. As a result, I shifted the focus of the analysis from a temporal comparison to a geographic one using the counties provided.