Introduction

With the lack of fossil fuels and increasing alternative energy. Electric vehicles have recently become a trending topic in the timeline. In some parts of the United States, the use of electric vehicles has been in use for a long time. This data is taken from https://catalog.data.gov/dataset/electric-vehicle-population-data.

Data Preprocessing

Import Dataset

We can import data and type data we use csv file

# Read Data
EV_Population <- read.csv("Electric_Vehicle_Population_Data.csv")

Data Inspection

we use head to see at a glance the contents of the data

# Inspection data
head(EV_Population)

To see the data structure we use library Tidyverse

library(tidyverse)
# Check data
glimpse(EV_Population)
#> Rows: 135,038
#> Columns: 17
#> $ VIN..1.10.                                        <chr> "5YJ3E1EA0K", "1N4BZ…
#> $ County                                            <chr> "Thurston", "Island"…
#> $ City                                              <chr> "Tumwater", "Clinton…
#> $ State                                             <chr> "WA", "WA", "WA", "W…
#> $ Postal.Code                                       <int> 98512, 98236, 98290,…
#> $ Model.Year                                        <int> 2019, 2022, 2020, 20…
#> $ Make                                              <chr> "TESLA", "NISSAN", "…
#> $ Model                                             <chr> "MODEL 3", "LEAF", "…
#> $ Electric.Vehicle.Type                             <chr> "Battery Electric Ve…
#> $ Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility <chr> "Clean Alternative F…
#> $ Electric.Range                                    <int> 220, 0, 266, 322, 20…
#> $ Base.MSRP                                         <int> 0, 0, 0, 0, 69900, 0…
#> $ Legislative.District                              <int> 22, 10, 44, 11, 21, …
#> $ DOL.Vehicle.ID                                    <int> 242565116, 183272785…
#> $ Vehicle.Location                                  <chr> "POINT (-122.9131016…
#> $ Electric.Utility                                  <chr> "PUGET SOUND ENERGY …
#> $ X2020.Census.Tract                                <dbl> 53067010910, 5302997…

This data use 17 coloumns and 135.038 row, Data from https://catalog.data.gov/dataset/electric-vehicle-population-data. Explain coloumn this below:

-  VIN (1-10) = The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).
- County = This is the geographic region of a state that a vehicle's owner is listed to reside within. Vehicles registered in Washington state may be located in other states
- City = The city in which the registered owner resides.
- State = This is the geographic region of the country associated with the record. These addresses may be located in other states.
- Postal Code = The 5 digit zip code in which the registered owner resides.
- Model Year =  The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Make = The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Model = The model of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Electric Vehicle Type = This distinguishes the vehicle as all electric or a plug-in hybrid.
- Clean Alternative Fuel Vehicle (CAFV) Eligibility = This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session.
- Electric Range = Describes how far a vehicle can travel purely on its electric charge.
- Base MSRP = This is the lowest Manufacturer's Suggested Retail Price (MSRP) for any trim level of the model in question.
- Legislative District = The specific section of Washington State that the vehicle's owner resides in, as represented in the state legislature.
- DOL Vehicle ID = Unique number assigned to each vehicle by Department of Licensing for identification purposes.
- Vehicle Location = The center of the ZIP Code for the registered vehicle.
- Electric Utility = This is the electric power retail service territories serving the address of the registered vehicle
- 2020 Census Tract = The census tract identifier is a combination of the state, county, and census tract codes as assigned by the United States Census Bureau in the 2020 census, also known as Geographic Identifier (GEOID)

Cleaning Data

for this case we do not use all columns. we will only use columns according to the case we are going to solve

# Delete coloumn
EV_Population_Clean <-  select(EV_Population,
       -State,
       -Postal.Code,
       -Legislative.District,
       -Vehicle.Location,
       -X2020.Census.Tract,
       -Electric.Vehicle.Type,
       -Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility )

Check missing value all data

# Find missing values
colSums(is.na(EV_Population_Clean))
#>       VIN..1.10.           County             City       Model.Year 
#>                0                0                0                0 
#>             Make            Model   Electric.Range        Base.MSRP 
#>                0                0                1                1 
#>   DOL.Vehicle.ID Electric.Utility 
#>                0                0

In this data, missing values in Electric.Range and Base.MSRP. We can fill with 0 value

# Treatment missing value
EV_Population_Clean[is.na(EV_Population_Clean)] <- 0

Change Data Type

In this case we must change type data. Conversion into correct data type contributes to memory saving and enable data manipulation using specific function designed for each datatype.

# Change type data
EV_Population_Clean <- mutate(EV_Population_Clean,
                              County = as.factor(County),
                              City = as.factor(City),
                              Make = as.factor(Make),
                              Model = as.factor(Model))

Exploratory data

using summary() to extract the basic statistical information of each column in our EV_Population_Clean data frame.

# Check summary data
summary(EV_Population_Clean)
#>   VIN..1.10.              County             City         Model.Year  
#>  Length:135038      King     :70842   Seattle  :23489   Min.   :1997  
#>  Class :character   Snohomish:15258   Bellevue : 6960   1st Qu.:2018  
#>  Mode  :character   Pierce   :10410   Redmond  : 4965   Median :2021  
#>                     Clark    : 7997   Vancouver: 4819   Mean   :2020  
#>                     Thurston : 4851   Kirkland : 4201   3rd Qu.:2022  
#>                     Kitsap   : 4461   Bothell  : 4196   Max.   :2024  
#>                     (Other)  :21219   (Other)  :86408                 
#>         Make           Model       Electric.Range     Base.MSRP     
#>  TESLA    :61808   MODEL 3:25837   Min.   :  0.00   Min.   :     0  
#>  NISSAN   :13150   MODEL Y:23577   1st Qu.:  0.00   1st Qu.:     0  
#>  CHEVROLET:11437   LEAF   :13020   Median : 21.00   Median :     0  
#>  FORD     : 6897   MODEL S: 7473   Mean   : 74.59   Mean   :  1448  
#>  BMW      : 5895   BOLT EV: 5419   3rd Qu.:150.00   3rd Qu.:     0  
#>  KIA      : 5491   VOLT   : 4881   Max.   :337.00   Max.   :845000  
#>  (Other)  :30360   (Other):54831                                    
#>  DOL.Vehicle.ID      Electric.Utility  
#>  Min.   :     4385   Length:135038     
#>  1st Qu.:160630473   Class :character  
#>  Median :205956344   Mode  :character  
#>  Mean   :206343192                     
#>  3rd Qu.:230888831                     
#>  Max.   :479254772                     
#> 

from the data summary we can draw conclusions, for the electrical range the mean is 74.59 and for the base MSRP the mean is 1448

Distribution Analysis

We can sea Seattle city is most count electric vehicle, we can use histogram for distribution analysis

# Aggregation data for county
EV_County <- EV_Population_Clean %>% 
            group_by(County) %>%  summarise(count_make = length(Make)) %>% 
            arrange(-count_make) %>% 
            head(10)
# Library for plot
library(ggplot2)
# Make plot from EV_County
ggplot(data = EV_County, mapping = aes(x = County, y = count_make)) +
  geom_col()

# Aggregation data for city
EV_city <- EV_Population_Clean %>%  
            group_by(City) %>%  summarise(count_mk = length(Make)) %>% 
            arrange(-count_mk) %>% head(10)
# Make plot from EV_City
ggplot(data = EV_city, mapping = aes(x = City, y = count_mk)) +
  geom_col()

From the histogram above, the city of Seattle has the most population of electric vehicles

# Aggregation data from mean
EV_mean_er_msrp <- EV_Population_Clean %>%
        filter(Base.MSRP > 0) %>%
        group_by(Make) %>%
        summarise(avg_range = mean(Electric.Range, round(2)),
                  avg_price = mean(Base.MSRP, round(2)))%>%
        arrange(-avg_price)
# Make plot data from EV_mean_er_msrp
ggplot(data = EV_mean_er_msrp, mapping = aes(x = avg_price, y = avg_range)) +
  geom_point()

From the scatter plot above, we can conclude that the highest electrical vehicle range is up to 200 and the highest base MSRP is around 100,000.

Explainatory data

Corelation

we can see corelation average electrical range with average base MSRP

# Check corelation
cor(EV_mean_er_msrp$avg_range, EV_mean_er_msrp$avg_price)
#> [1] -0.008804457

the relationship between the average electrical range and the base MSRP from the results is negative, tends to be weak

Probability Mass Function

# Probability
prop.table(table(EV_Population_Clean$Make))*100
#> 
#>                 AUDI       AZURE DYNAMICS              BENTLEY 
#>          2.019431567          0.005924258          0.002221597 
#>                  BMW             CADILLAC            CHEVROLET 
#>          4.365437877          0.088863875          8.469467853 
#>             CHRYSLER                 FIAT               FISKER 
#>          1.656570743          0.597609562          0.011107984 
#>                 FORD              GENESIS                HONDA 
#>          5.107451236          0.049615664          0.582798916 
#>              HYUNDAI               JAGUAR                 JEEP 
#>          1.773574846          0.164398169          1.936491950 
#>                  KIA           LAND ROVER                LEXUS 
#>          4.066262830          0.031842889          0.059242584 
#>              LINCOLN                LUCID                MAZDA 
#>          0.156992846          0.104415054          0.008886388 
#>        MERCEDES-BENZ                 MINI           MITSUBISHI 
#>          0.575393593          0.549474963          0.553918156 
#>               NISSAN             POLESTAR              PORSCHE 
#>          9.737999674          0.487270250          0.704986744 
#>               RIVIAN                SMART               SUBARU 
#>          1.337401324          0.205127446          0.201424784 
#>                TESLA                TH!NK               TOYOTA 
#>         45.770820065          0.002962129          3.616019195 
#>           VOLKSWAGEN                VOLVO WHEEGO ELECTRIC CARS 
#>          2.611116871          2.385254521          0.002221597

From above we can see probability most is TESLA 45.7%

Confidence Interfal

Confidence Interfal for Electrical Range

someone wants to buy an electrical vehicle, what is the recommended electrical range

# Mean
mean_range <- mean(EV_Population_Clean$Electric.Range)
mean_range
#> [1] 74.59141
# Standard deviasi
sd_range <- sd(EV_Population_Clean$Electric.Range)
sd_range
#> [1] 98.74396
n_range_msrp <- EV_Population_Clean %>% filter(Electric.Range > 0 & Base.MSRP >0)
dim(n_range_msrp)
#> [1] 3425   10

SE : sd/ sqrt(n)

SE <- sd_range/ sqrt(3425)
SE
#> [1] 1.687253

confidence level 95% alpha (error rate) 5%

Z_alpha_half <- qnorm(alpha/2, lower.tail = F)

Z_alpha_half <- qnorm(0.05/2, lower.tail = F)
Z_alpha_half
#> [1] 1.959964

Lower limit

lower_limit <- mean_range - Z_alpha_half * SE
lower_limit
#> [1] 71.28446

Upper Limit

upper_limit <- mean_range + Z_alpha_half * SE
upper_limit
#> [1] 77.89837

by using confidence intervals we get the average electric range of 71-78

Confidence Interfal for Base MSRP

someone wants to buy an electric vehicle, what is the price of the confidence interval from the base MSRP

# Mean base MSRP
mean_msrp <- mean(EV_Population_Clean$Base.MSRP)
mean_msrp
#> [1] 1448.397
# Standard deviasi
sd_msrp <- sd(EV_Population_Clean$Base.MSRP)
sd_msrp
#> [1] 9683.623

SE : sd/ sqrt(n)

SE <- sd_range/ sqrt(3425)
SE
#> [1] 1.687253

confidence level 95% alpha (error rate) 5%

Z_alpha_half <- qnorm(alpha/2, lower.tail = F)

Z_alpha_half <- qnorm(0.05/2, lower.tail = F)
Z_alpha_half
#> [1] 1.959964

Lower limit

lower_limit_msrp <- mean_msrp - Z_alpha_half * SE
lower_limit_msrp
#> [1] 1445.09

Upper limit

Upper_limit_msrp <- mean_msrp + Z_alpha_half * SE
Upper_limit_msrp
#> [1] 1451.704

by using confidence intervals we get the average base MSRP of 1446-1452

Conclusion

  • Count most electrical vehicle in above is County
  • Seattle is city with most electrical vehicle
  • TESLA is electrical vehicle with most count on County & City