Project 1

Author

Jhonathan Urquilla

This dataset explores data from the New York State Energy Research and Development Authority on electric vehicle rebates distributed through 2017 to 2024. The dataset provides insight into consumer adoption of electric vehicles across New York, including trends in vehicle type, transaction type, and estimated environmental impact. The goal of this analysis is to identify patterns in rebate use over time, assess the effectiveness of the program in promoting cleaner transportation, and better understand how consumer choices align with state sustainability goals. The data was obtained directly from the New York State Energy Research and Development Authority.

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(readr)
library(ggplot2)

Set directory and load data

setwd("C:/Users/ubjho/Downloads")
electricVehicle <- read_csv("Electric_Vehicle_Drive_Clean_Rebate_2017NYSERDA.csv")

Rows: 150328 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Data through Date, Submitted Date, Make, Model, County, EV Type, Tr...
dbl (4): ZIP, Annual GHG Emissions Reductions (MT CO2e), Annual Petroleum Re...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Clean column names

names(electricVehicle) <- gsub(" ", "_", names(electricVehicle))
names(electricVehicle) <- gsub("[()]", "", names(electricVehicle))

Filter out rows with missing Make, County, or GHG values

ev_clean <- filter(
  electricVehicle,
  !is.na(Make) & !is.na(County) & !is.na(Annual_GHG_Emissions_Reductions_MT_CO2e)
)

Get top 5 most common makes

make_counts <- count(ev_clean, Make, sort = TRUE)
top_makes <- slice_head(make_counts, n = 5)
top_make_names <- pull(top_makes, Make)
print(make_counts)

# A tibble: 30 × 2
   Make          n
   <chr>     <int>
 1 Tesla     68052
 2 Toyota    25837
 3 Jeep      12075
 4 Hyundai    8548
 5 Chevrolet  7655
 6 Ford       6002
 7 BMW        4431
 8 Kia        3682
 9 Honda      2311
10 Volvo      2143
# ℹ 20 more rows

Get top 10 most common counties

county_counts <- count(ev_clean, County, sort = TRUE)
top_counties <- slice_head(county_counts, n = 10)
top_county_names <- pull(top_counties, County)
print(county_counts)

# A tibble: 62 × 2
   County          n
   <chr>       <int>
 1 Nassau      23655
 2 Suffolk     22843
 3 Westchester 14597
 4 Queens      13026
 5 Kings        8666
 6 Monroe       7855
 7 Erie         6956
 8 New York     5601
 9 Onondaga     3964
10 Richmond     3702
# ℹ 52 more rows

Filter only top 5 makes and top 10 counties

ev_filtered <- filter(
  ev_clean,
  Make %in% top_make_names & County %in% top_county_names
)

Calculate average GHG by County and Make

ghg_avg <- summarise(
  group_by(ev_filtered, County, Make),
  Avg_GHG_Reduction = mean(Annual_GHG_Emissions_Reductions_MT_CO2e, na.rm = TRUE)
)

`summarise()` has grouped output by 'County'. You can override using the
`.groups` argument.

Plotting everything together

ggplot(ghg_avg, aes(x = County, y = Avg_GHG_Reduction, group = Make, color = Make)) +
  geom_line(size = 1.2) +
  labs(
    title = "Avg GHG Emission Reduction by County and Car Make from 2017-2024",
    x = "County",
    y = "Avg GHG Reduction (Metric Tons CO2e)",
    color = "Car Make",
    caption = "Source: New York State Energy Research and Development Authority(NYSERDA)"
  ) +
  scale_color_manual(values = c("#1b9e77", "#d95f02", "maroon", "purple", "#66a61e")) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    plot.title = element_text(face = "bold", size = 14),
    legend.title = element_text(size = 11)
  )

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

I initially used read_csv() to load the dataset in order to prepare it for analysis. I then tidied up the column names by substituting underscores for spaces and parentheses to make them easier to find in code. I then eliminated any rows that had missing values for the following important variables: Annual_GHG_Emissions_Reductions_MT_CO2e, Make, and County. This guaranteed the completeness and significance of the data used for analysis. To determine the top ten counties by number of rebate entries and the top five most popular car makes, I made frequency tables. To cut down on noise and concentrate the display on the most pertinent data segments, these subsets were employed. Lastly, I computed the average reduction in greenhouse gas emissions for each group after grouping the filtered data by County and Make.

For each of the top five automobile brands, the generated line graph shows the average yearly reduction in GHG emissions per county. With counties on the x-axis and emissions reductions on the y-axis, each line represents a distinct car manufacturer. Interesting geographical and brand-specific tendencies are revealed by the visualization I created. For instance, Tesla routinely displays greater average GHG reductions in the majority of counties, most likely as a result of their completely electric car selection. The average GHG reduction numbers for Jeep, on the other hand, are much lower and in some areas even negative. This might be a sign of problems with data reporting, hybrid vehicle classifications, or low efficiency benefits. These differences show how various brands have varying effects on the environment and call for more research into the models and combinations that each manufacturer offers.

One limitation I encountered was in attempting to convert the Submitted Date column into a proper date format to analyze trends over time. Due to formatting inconsistencies or parsing issues in the original data(or it could have really just been me), this step failed to produce usable results. As a result, I shifted the focus of the analysis from a temporal comparison to a geographic one using the counties provided.