For my second project, I’ll use the “Drive Electric in New York State” dataset. It’s an interesting dataset about the purchase or lease of electric vehicles in NY. We have in this dataset variables like make, model,transaction type, data through date, submitted date. In this project I will use filter, group_by and some other function to make my analysis and come to a conclusion. I’m going to work with make, transaction type, county, annualghgemissionsreductions_mtco2e,rebateamount_usd, and also with a column that I’ve created: year. After this analysis I’ll be able to conclude wich manufacturer has the most purchased or leased car, the make trend over the year. The goal of this analysis is to identify the leading manufacturers in New York’s EV market, explore trends in adoption, and understand the impact of government rebates and environmental benefits. The source of this dataset is: Drive Electric in New York State - NYSERDA(www.nyserda.ny.gov).
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
setwd("/Users/leikarayjoseph/Desktop/Data 110") #upload my working directory so I can install my file.EV_data <-read_csv("Electric_Vehicle_Drive_Clean_Rebate_2017NYSERDA.csv")
Rows: 150328 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Data through Date, Submitted Date, Make, Model, County, EV Type, Tr...
dbl (4): ZIP, Annual GHG Emissions Reductions (MT CO2e), Annual Petroleum Re...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
library(ggalluvial)library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
#putting the headers in lower casenames(EV_data) <-tolower(names(EV_data))names(EV_data) <-gsub(" ","",names(EV_data))head(EV_data)
# A tibble: 6 × 11
datathroughdate submitteddate make model county zip evtype transactiontype
<chr> <chr> <chr> <chr> <chr> <dbl> <chr> <chr>
1 4/30/2024 5/28/2020 Tesla Mode… <NA> 10509 BEV Purchase
2 4/30/2024 8/30/2023 Chevr… Bolt <NA> NA BEV Purchase
3 4/30/2024 11/8/2023 Jeep Gran… <NA> 13647 PHEV Lease
4 4/30/2024 4/4/2024 Toyota Priu… <NA> 12922 PHEV Purchase
5 4/30/2024 3/30/2017 Audi A3 e… Albany 12189 PHEV Purchase
6 4/30/2024 3/30/2017 Toyota Priu… Albany 12211 PHEV Purchase
# ℹ 3 more variables: `annualghgemissionsreductions(mtco2e)` <dbl>,
# `annualpetroleumreductions(gallons)` <dbl>, `rebateamount(usd)` <dbl>
In this chunk I’m going to work with the make a column for year an
## Remove missing value in my dataset#is.na(EV_data)
Filter my dataset
# Count the variable "make" to see wich manifacturer got more car registered in NY.Make_count <- EV_data |>group_by(make, year) |>count() |># The variable make for each manifacturers.arrange(n) # Arrange in ascending order. Make_count
# A tibble: 173 × 3
# Groups: make, year [173]
make year n
<chr> <dbl> <int>
1 Audi 2017 1
2 Audi 2018 1
3 Cadillac 2022 1
4 Land Rover 2021 1
5 MINI 2017 1
6 Jaguar 2023 2
7 Mercedes-Benz 2017 2
8 Lincoln 2020 3
9 Porsche 2017 3
10 Porsche 2018 3
# ℹ 163 more rows
# Count the variable "make" to see wich manifacturer got more car registered in NY.Make_count1 <- EV_data |>group_by(make) |>count() |># The variable make for each manifacturers.arrange(n) # Arrange in ascending order. Make_count1
# A tibble: 30 × 2
# Groups: make [30]
make n
<chr> <int>
1 Land Rover 1
2 Jaguar 54
3 Alfa Romeo 90
4 Smart 94
5 Lincoln 109
6 Genesis 119
7 Mazda 127
8 Dodge 132
9 Porsche 133
10 Cadillac 152
# ℹ 20 more rows
Now Let’s make a plot for Make_count
plot1 <-ggplot(data = Make_count, aes(x = year,y = n,alluvium= make,fill = make, label = make)) +geom_alluvium() +geom_flow() +#geom_stratum(alpha = 0.5) +labs(x="Year", y="Count", title ="Vehicles Makes over the Years",caption ="source: Drive Electric in New York State - NYSERDA") +theme_minimal() #ggtitle("Vehicles Makes over the Years")plot1
Filter my data and only choose some make to analyse:
Filter_EV <- EV_data |>#filter by choosing some make with the largest countfilter(make %in%c( "Tesla", "Toyota", "Jeep", "Hyundai", "Chevrolet", "Ford", "BMW")) |>select(make, county, transactiontype, `annualghgemissionsreductions(mtco2e)`, `rebateamount(usd)`, year) Filter_EV
ggplot(Filter_EV,aes(x = year, y = rebateamount_usd, color = make)) +geom_point(size=4, alpha=0.7) +#geom_line() +labs(x="Year",y="Rebate Amount (usd)",title ="Electric Vehicle Rebate Amount by Year",caption ="source: Drive Electric in New York State - NYSERDA") +theme_minimal(base_size =14) +# Add the linear regrseeion linegeom_smooth(method ="lm", color='black', se=TRUE)
Call:
lm(formula = year ~ rebateamount_usd, data = Filter_EV)
Residuals:
Min 1Q Median 3Q Max
-5.2822 -0.4345 0.3337 0.7178 3.5655
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.023e+03 7.962e-03 254070.9 <2e-16 ***
rebateamount_usd -1.232e-03 6.916e-06 -178.1 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.561 on 132602 degrees of freedom
Multiple R-squared: 0.193, Adjusted R-squared: 0.193
F-statistic: 3.172e+04 on 1 and 132602 DF, p-value: < 2.2e-16
Linear Regression Equation
rebateamount_usd = -1.23(year) + 2.02
What I get from the correlation analysis
This analysis show that there’s a noticeable and statistically significant negative correlation between the year and the rebate amount in USD. This means that as the years go by, the rebate amounts generally tend to decrease.
Let’s continue with my analysis make some important plot
# Count the variable "make" to see wich manifacturer got more car registered in NY.Count_transactiontype <- Filter_EV |>group_by(make, transactiontype) |>summarise(count =n(), .groups ="drop") #Count the transactiontype#plot plot2 <-ggplot(Count_transactiontype, aes(x= make, y= count, fill= transactiontype))+labs(caption ="source: Drive Electric in New York State - NYSERDA") +geom_bar(stat="identity", position="dodge") +theme_dark()plot2
# Making a scatterplot for each make to see the relation between make and annualghgemissionsreductions (mtco2e)F <-ggplot(Filter_EV, aes(x = rebateamount_usd , y= annualghgemissionsreductions_mtco2e, color = make)) +geom_point() +# Stacked bar chartfacet_wrap(~make) +geom_smooth(method ="lm", color='black', se=TRUE) +labs( x ="rebateamount (usd)", y="annualghgemissionsreductions (mtco2e)", title ="Relationship Between Rebate Amounts and Emissions Reductions", caption ="Source: Drive Electric in New York State - NYSERDA", color="Manufacturer") +theme_minimal(base_size =10) +scale_color_brewer(palette="Set3") # choose the colorF
`geom_smooth()` using formula = 'y ~ x'
ggplot(Filter_EV, aes(x = rebateamount_usd , y= annualghgemissionsreductions_mtco2e, color = make)) +geom_point() +# Stacked bar chartgeom_smooth(method ="lm", color='black', se=TRUE) +labs( x ="rebateamount (usd)", y="annualghgemissionsreductions (mtco2e)", title ="Relationship Between Rebate Amounts and Emissions Reductions", caption ="Source: Drive Electric in New York State - NYSERDA", color="Manufacturer") +theme_minimal(base_size =10) +scale_color_brewer(palette ="Accent")
`geom_smooth()` using formula = 'y ~ x'
Final Plot
Graph <- Filter_EV |>#Creates bar graph with x axis and legendggplot() +geom_bar(aes(x = transactiontype, fill = make)) +#Changes the names for the axis, legend, title, and captionlabs(x ="Transaction Type",y ="Number of Cars",fill ="Manufacturer",title ="Transaction type and the manufacturer of the car",caption ="Source: Drive Electric in New York State - NYSERDA") +#Changes the theme to minimaltheme_minimal(base_size =10) +#Changes the color palette to "Pastel1"scale_fill_brewer(palette ="Pastel1")ggplotly(Graph)
Final essay
As I wrap up this project, I truly enjoyed working with this dataset. As someone who appreciates electric vehicles, I found it fascinating to explore and learn about the trends surrounding them. Creating visualizations to make these insights accessible and impactful added an exciting dimension to the analysis. In this project for my regression analysis there’s a noticeable and statistically significant negative correlation between the year and the rebate amount in USD. This means that as the years go by, the rebate amounts generally tend to decrease. Manufacturers like Tesla are leading the EV market, in terms of emissions reductions.I also discover that the majority car are purchased rather than leased, which could show that people prefer owning their cars. Tesla is the one dominating the the purchased category and the lease category. While Jeep is the second car most lease in NY it is the less purchased. I wanted to do more with this dataset like creating a map to have more insight and see the make accross each county. Working in this project, I had a lot of difficulties but one thing I still didn’t get to figure out was how to see caption when using plotly. As someone who cares about the environment and plans to get a car soon, I think people should keep learning about electric vehicles to find the ones that work best for them. Switching to EVs can help reduce air pollution and fight climate change. By making smart choices and supporting new ideas in the EV industry, we can help create cleaner air and a better future for everyone.