data110project2

|title: “Project 2”

author: “Leika Joseph”

format: html

editor: visual |

Project 2

Introduction

For my second project, I’ll use the “Drive Electric in New York State” dataset. It’s an interesting dataset about the purchase or lease of electric vehicles in NY. We have in this dataset variables like make, model,transaction type, data through date, submitted date. In this project I will use filter, group_by and some other function to make my analysis and come to a conclusion. I’m going to work with make, transaction type, county, annualghgemissionsreductions_mtco2e,rebateamount_usd, and also with a column that I’ve created: year. After this analysis I’ll be able to conclude wich manufacturer has the most purchased or leased car, the make trend over the year. The goal of this analysis is to identify the leading manufacturers in New York’s EV market, explore trends in adoption, and understand the impact of government rebates and environmental benefits. The source of this dataset is: Drive Electric in New York State - NYSERDA(www.nyserda.ny.gov).

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

setwd("/Users/leikarayjoseph/Desktop/Data 110") 
#upload my working directory so I can install my file.
EV_data <- read_csv("Electric_Vehicle_Drive_Clean_Rebate_2017NYSERDA.csv")

Rows: 150328 Columns: 11
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): Data through Date, Submitted Date, Make, Model, County, EV Type, Tr...
dbl (4): ZIP, Annual GHG Emissions Reductions (MT CO2e), Annual Petroleum Re...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

library(ggalluvial)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

#putting the headers in lower case
names(EV_data) <- tolower(names(EV_data))
names(EV_data) <- gsub(" ","",names(EV_data))
head(EV_data)

# A tibble: 6 × 11
  datathroughdate submitteddate make   model county   zip evtype transactiontype
  <chr>           <chr>         <chr>  <chr> <chr>  <dbl> <chr>  <chr>          
1 4/30/2024       5/28/2020     Tesla  Mode… <NA>   10509 BEV    Purchase       
2 4/30/2024       8/30/2023     Chevr… Bolt  <NA>      NA BEV    Purchase       
3 4/30/2024       11/8/2023     Jeep   Gran… <NA>   13647 PHEV   Lease          
4 4/30/2024       4/4/2024      Toyota Priu… <NA>   12922 PHEV   Purchase       
5 4/30/2024       3/30/2017     Audi   A3 e… Albany 12189 PHEV   Purchase       
6 4/30/2024       3/30/2017     Toyota Priu… Albany 12211 PHEV   Purchase       
# ℹ 3 more variables: `annualghgemissionsreductions(mtco2e)` <dbl>,
#   `annualpetroleumreductions(gallons)` <dbl>, `rebateamount(usd)` <dbl>

In this chunk I’m going to work with the make a column for year an

EV_data$Date <- as.Date(EV_data$submitteddate, format= "%m/%d/%Y")

EV_data$year <- year(EV_data$Date)

## Remove missing value in my dataset
#is.na(EV_data)

Filter my dataset

# Count the variable "make" to see wich manifacturer got more car registered in NY.
Make_count <- EV_data |>
  group_by(make, year) |>
  count() |> # The variable make for each manifacturers.
  arrange(n) # Arrange in ascending order.

  Make_count

# A tibble: 173 × 3
# Groups:   make, year [173]
   make           year     n
   <chr>         <dbl> <int>
 1 Audi           2017     1
 2 Audi           2018     1
 3 Cadillac       2022     1
 4 Land Rover     2021     1
 5 MINI           2017     1
 6 Jaguar         2023     2
 7 Mercedes-Benz  2017     2
 8 Lincoln        2020     3
 9 Porsche        2017     3
10 Porsche        2018     3
# ℹ 163 more rows

# Count the variable "make" to see wich manifacturer got more car registered in NY.
Make_count1 <- EV_data |>
 group_by(make) |>
 count() |> 
  # The variable make for each manifacturers.
 arrange(n) 
# Arrange in ascending order.

 Make_count1

# A tibble: 30 × 2
# Groups:   make [30]
   make           n
   <chr>      <int>
 1 Land Rover     1
 2 Jaguar        54
 3 Alfa Romeo    90
 4 Smart         94
 5 Lincoln      109
 6 Genesis      119
 7 Mazda        127
 8 Dodge        132
 9 Porsche      133
10 Cadillac     152
# ℹ 20 more rows

Now Let’s make a plot for Make_count

 plot1 <- ggplot(data = Make_count, aes(x = year,
           y = n,
           alluvium= make,
           fill = make, label = make)) +
  geom_alluvium() +
  geom_flow() +
  #geom_stratum(alpha = 0.5) +
   labs(x= "Year", 
          y= "Count", 
          title = "Vehicles Makes over the Years",
          caption = "source: Drive Electric in New York State - NYSERDA") +
  theme_minimal() 
 #ggtitle("Vehicles Makes over the Years")

plot1

Filter my data and only choose some make to analyse:

Filter_EV <- EV_data |>
#filter by choosing some make with the largest count
    filter(make %in% c( "Tesla", "Toyota", "Jeep", "Hyundai", "Chevrolet", "Ford", "BMW")) |>
  select(make, county, transactiontype, `annualghgemissionsreductions(mtco2e)`, `rebateamount(usd)`, year)

  Filter_EV

# A tibble: 132,604 × 6
   make  county transactiontype annualghgemissionsre…¹ `rebateamount(usd)`  year
   <chr> <chr>  <chr>                            <dbl>               <dbl> <dbl>
 1 Tesla <NA>   Purchase                         3.02                 2000  2020
 2 Chev… <NA>   Purchase                         3.02                 2000  2023
 3 Jeep  <NA>   Lease                            0.089                 500  2023
 4 Toyo… <NA>   Purchase                         2.96                  500  2024
 5 Toyo… Albany Purchase                         2.96                 1100  2017
 6 Toyo… Albany Purchase                         2.96                 1100  2017
 7 Chev… Albany Purchase                         2.65                 1700  2017
 8 Chev… Albany Purchase                         2.65                 1700  2017
 9 Toyo… Albany Purchase                         2.96                 1100  2017
10 Toyo… Albany Purchase                         2.96                 1100  2017
# ℹ 132,594 more rows
# ℹ abbreviated name: ¹`annualghgemissionsreductions(mtco2e)`

clean variables

names(Filter_EV) <- gsub("[(]", "_", names(Filter_EV))
names(Filter_EV) <- gsub("[)]", "", names(Filter_EV))
head(Filter_EV)

# A tibble: 6 × 6
  make      county transactiontype annualghgemissionsre…¹ rebateamount_usd  year
  <chr>     <chr>  <chr>                            <dbl>            <dbl> <dbl>
1 Tesla     <NA>   Purchase                         3.02              2000  2020
2 Chevrolet <NA>   Purchase                         3.02              2000  2023
3 Jeep      <NA>   Lease                            0.089              500  2023
4 Toyota    <NA>   Purchase                         2.96               500  2024
5 Toyota    Albany Purchase                         2.96              1100  2017
6 Toyota    Albany Purchase                         2.96              1100  2017
# ℹ abbreviated name: ¹annualghgemissionsreductions_mtco2e

names(Filter_EV)

[1] "make"                                "county"                             
[3] "transactiontype"                     "annualghgemissionsreductions_mtco2e"
[5] "rebateamount_usd"                    "year"

Plot of year versus rebate amount

ggplot(Filter_EV,aes(x = year, y = rebateamount_usd, color = make)) +
  geom_point(size= 4, alpha= 0.7) +
  #geom_line() +
  labs(x= "Year",
       y= "Rebate Amount (usd)",
       title = "Electric Vehicle Rebate Amount by Year",
       caption = "source: Drive Electric in New York State - NYSERDA") +
  theme_minimal(base_size = 14) +
  # Add the linear regrseeion line
  geom_smooth(method = "lm", color= 'black', se= TRUE)

`geom_smooth()` using formula = 'y ~ x'

Correlation between Year and Rebate Amount

cor(Filter_EV$year, Filter_EV$rebateamount_usd)

[1] -0.4393679

cor.test(Filter_EV$year, Filter_EV$rebateamount_usd)


    Pearson's product-moment correlation

data:  Filter_EV$year and Filter_EV$rebateamount_usd
t = -178.11, df = 132602, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.4437010 -0.4350143
sample estimates:
       cor 
-0.4393679

Looking for the Linear Regression Equation

Eq <- lm(year ~ rebateamount_usd, data= Filter_EV)
summary(Eq)


Call:
lm(formula = year ~ rebateamount_usd, data = Filter_EV)

Residuals:
    Min      1Q  Median      3Q     Max 
-5.2822 -0.4345  0.3337  0.7178  3.5655 

Coefficients:
                   Estimate Std. Error  t value Pr(>|t|)    
(Intercept)       2.023e+03  7.962e-03 254070.9   <2e-16 ***
rebateamount_usd -1.232e-03  6.916e-06   -178.1   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.561 on 132602 degrees of freedom
Multiple R-squared:  0.193, Adjusted R-squared:  0.193 
F-statistic: 3.172e+04 on 1 and 132602 DF,  p-value: < 2.2e-16

Linear Regression Equation

rebateamount_usd = -1.23(year) + 2.02

What I get from the correlation analysis

This analysis show that there’s a noticeable and statistically significant negative correlation between the year and the rebate amount in USD. This means that as the years go by, the rebate amounts generally tend to decrease.

Let’s continue with my analysis make some important plot

# Count the variable "make" to see wich manifacturer got more car registered in NY.
Count_transactiontype <- Filter_EV |>
 group_by(make, transactiontype) |>
 summarise(count = n(), .groups = "drop")  
  #Count the transactiontype
 #plot
  plot2 <- ggplot(Count_transactiontype, aes(x= make, y= count, fill= transactiontype))+
  labs(caption = "source: Drive Electric in New York State - NYSERDA") +
  geom_bar(stat= "identity", position= "dodge") +
    theme_dark()
plot2

# Making a scatterplot for each make to see the relation between make and annualghgemissionsreductions (mtco2e)
F <- ggplot(Filter_EV, aes(x = rebateamount_usd , y= annualghgemissionsreductions_mtco2e, color = make)) +
  geom_point() +  # Stacked bar chart
  facet_wrap(~make) +
  geom_smooth(method = "lm", color= 'black', se= TRUE) +
  labs( x = "rebateamount (usd)", 
        y= "annualghgemissionsreductions (mtco2e)", 
        title = "Relationship Between Rebate Amounts and Emissions Reductions", 
        caption = "Source: Drive Electric in New York State - NYSERDA", 
        color= "Manufacturer") +
  theme_minimal(base_size = 10) +
  scale_color_brewer(palette= "Set3") # choose the color
F

`geom_smooth()` using formula = 'y ~ x'

ggplot(Filter_EV, aes(x = rebateamount_usd , y= annualghgemissionsreductions_mtco2e, color = make)) +
  geom_point() +  # Stacked bar chart
  geom_smooth(method = "lm", color= 'black', se= TRUE) +
  labs( x = "rebateamount (usd)", 
        y= "annualghgemissionsreductions (mtco2e)", 
        title = "Relationship Between Rebate Amounts and Emissions Reductions", 
        caption = "Source: Drive Electric in New York State - NYSERDA", 
        color= "Manufacturer") +
  theme_minimal(base_size = 10) +
  scale_color_brewer(palette = "Accent")

`geom_smooth()` using formula = 'y ~ x'

Final Plot

Graph <- Filter_EV |>
#Creates bar graph with x axis and legend
  ggplot() +
  geom_bar(aes(x = transactiontype, fill = make)) +
#Changes the names for the axis, legend, title, and caption
  labs(x = "Transaction Type",
       y = "Number of Cars",
       fill = "Manufacturer",
       title = "Transaction type and the manufacturer of the car",
       caption = "Source: Drive Electric in New York State - NYSERDA") +
#Changes the theme to minimal
  theme_minimal(base_size = 10) +
#Changes the color palette to "Pastel1"
  scale_fill_brewer(palette = "Pastel1")

ggplotly(Graph)

Final essay

As I wrap up this project, I truly enjoyed working with this dataset. As someone who appreciates electric vehicles, I found it fascinating to explore and learn about the trends surrounding them. Creating visualizations to make these insights accessible and impactful added an exciting dimension to the analysis. In this project for my regression analysis there’s a noticeable and statistically significant negative correlation between the year and the rebate amount in USD. This means that as the years go by, the rebate amounts generally tend to decrease. Manufacturers like Tesla are leading the EV market, in terms of emissions reductions.I also discover that the majority car are purchased rather than leased, which could show that people prefer owning their cars. Tesla is the one dominating the the purchased category and the lease category. While Jeep is the second car most lease in NY it is the less purchased. I wanted to do more with this dataset like creating a map to have more insight and see the make accross each county. Working in this project, I had a lot of difficulties but one thing I still didn’t get to figure out was how to see caption when using plotly. As someone who cares about the environment and plans to get a car soon, I think people should keep learning about electric vehicles to find the ones that work best for them. Switching to EVs can help reduce air pollution and fight climate change. By making smart choices and supporting new ideas in the EV industry, we can help create cleaner air and a better future for everyone.