With the lack of fossil fuels and increasing alternative energy. Electric vehicles have recently become a trending topic in the timeline. In some parts of the United States, the use of electric vehicles has been in use for a long time. This data is taken from https://catalog.data.gov/dataset/electric-vehicle-population-data.
We can import data and type data we use csv file
# Read Data
EV_Population <- read.csv("Electric_Vehicle_Population_Data.csv")we use head to see at a glance the contents of the data
# Inspection data
head(EV_Population)To see the data structure we use library Tidyverse
library(tidyverse)# Check data
glimpse(EV_Population)#> Rows: 135,038
#> Columns: 17
#> $ VIN..1.10. <chr> "5YJ3E1EA0K", "1N4BZ…
#> $ County <chr> "Thurston", "Island"…
#> $ City <chr> "Tumwater", "Clinton…
#> $ State <chr> "WA", "WA", "WA", "W…
#> $ Postal.Code <int> 98512, 98236, 98290,…
#> $ Model.Year <int> 2019, 2022, 2020, 20…
#> $ Make <chr> "TESLA", "NISSAN", "…
#> $ Model <chr> "MODEL 3", "LEAF", "…
#> $ Electric.Vehicle.Type <chr> "Battery Electric Ve…
#> $ Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility <chr> "Clean Alternative F…
#> $ Electric.Range <int> 220, 0, 266, 322, 20…
#> $ Base.MSRP <int> 0, 0, 0, 0, 69900, 0…
#> $ Legislative.District <int> 22, 10, 44, 11, 21, …
#> $ DOL.Vehicle.ID <int> 242565116, 183272785…
#> $ Vehicle.Location <chr> "POINT (-122.9131016…
#> $ Electric.Utility <chr> "PUGET SOUND ENERGY …
#> $ X2020.Census.Tract <dbl> 53067010910, 5302997…
This data use 17 coloumns and 135.038 row, Data from https://catalog.data.gov/dataset/electric-vehicle-population-data. Explain coloumn this below:
- VIN (1-10) = The 1st 10 characters of each vehicle's Vehicle Identification Number (VIN).
- County = This is the geographic region of a state that a vehicle's owner is listed to reside within. Vehicles registered in Washington state may be located in other states
- City = The city in which the registered owner resides.
- State = This is the geographic region of the country associated with the record. These addresses may be located in other states.
- Postal Code = The 5 digit zip code in which the registered owner resides.
- Model Year = The model year of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Make = The manufacturer of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Model = The model of the vehicle, determined by decoding the Vehicle Identification Number (VIN).
- Electric Vehicle Type = This distinguishes the vehicle as all electric or a plug-in hybrid.
- Clean Alternative Fuel Vehicle (CAFV) Eligibility = This categorizes vehicle as Clean Alternative Fuel Vehicles (CAFVs) based on the fuel requirement and electric-only range requirement in House Bill 2042 as passed in the 2019 legislative session.
- Electric Range = Describes how far a vehicle can travel purely on its electric charge.
- Base MSRP = This is the lowest Manufacturer's Suggested Retail Price (MSRP) for any trim level of the model in question.
- Legislative District = The specific section of Washington State that the vehicle's owner resides in, as represented in the state legislature.
- DOL Vehicle ID = Unique number assigned to each vehicle by Department of Licensing for identification purposes.
- Vehicle Location = The center of the ZIP Code for the registered vehicle.
- Electric Utility = This is the electric power retail service territories serving the address of the registered vehicle
- 2020 Census Tract = The census tract identifier is a combination of the state, county, and census tract codes as assigned by the United States Census Bureau in the 2020 census, also known as Geographic Identifier (GEOID)
for this case we do not use all columns. we will only use columns according to the case we are going to solve
# Delete coloumn
EV_Population_Clean <- select(EV_Population,
-State,
-Postal.Code,
-Legislative.District,
-Vehicle.Location,
-X2020.Census.Tract,
-Electric.Vehicle.Type,
-Clean.Alternative.Fuel.Vehicle..CAFV..Eligibility )Check missing value all data
# Find missing values
colSums(is.na(EV_Population_Clean))#> VIN..1.10. County City Model.Year
#> 0 0 0 0
#> Make Model Electric.Range Base.MSRP
#> 0 0 1 1
#> DOL.Vehicle.ID Electric.Utility
#> 0 0
In this data, missing values in Electric.Range and Base.MSRP. We can fill with 0 value
# Treatment missing value
EV_Population_Clean[is.na(EV_Population_Clean)] <- 0In this case we must change type data. Conversion into correct data type contributes to memory saving and enable data manipulation using specific function designed for each datatype.
# Change type data
EV_Population_Clean <- mutate(EV_Population_Clean,
County = as.factor(County),
City = as.factor(City),
Make = as.factor(Make),
Model = as.factor(Model))using summary() to extract the basic statistical information of each column in our EV_Population_Clean data frame.
# Check summary data
summary(EV_Population_Clean)#> VIN..1.10. County City Model.Year
#> Length:135038 King :70842 Seattle :23489 Min. :1997
#> Class :character Snohomish:15258 Bellevue : 6960 1st Qu.:2018
#> Mode :character Pierce :10410 Redmond : 4965 Median :2021
#> Clark : 7997 Vancouver: 4819 Mean :2020
#> Thurston : 4851 Kirkland : 4201 3rd Qu.:2022
#> Kitsap : 4461 Bothell : 4196 Max. :2024
#> (Other) :21219 (Other) :86408
#> Make Model Electric.Range Base.MSRP
#> TESLA :61808 MODEL 3:25837 Min. : 0.00 Min. : 0
#> NISSAN :13150 MODEL Y:23577 1st Qu.: 0.00 1st Qu.: 0
#> CHEVROLET:11437 LEAF :13020 Median : 21.00 Median : 0
#> FORD : 6897 MODEL S: 7473 Mean : 74.59 Mean : 1448
#> BMW : 5895 BOLT EV: 5419 3rd Qu.:150.00 3rd Qu.: 0
#> KIA : 5491 VOLT : 4881 Max. :337.00 Max. :845000
#> (Other) :30360 (Other):54831
#> DOL.Vehicle.ID Electric.Utility
#> Min. : 4385 Length:135038
#> 1st Qu.:160630473 Class :character
#> Median :205956344 Mode :character
#> Mean :206343192
#> 3rd Qu.:230888831
#> Max. :479254772
#>
from the data summary we can draw conclusions, for the electrical range the mean is 74.59 and for the base MSRP the mean is 1448
We can sea Seattle city is most count electric vehicle, we can use histogram for distribution analysis
# Aggregation data for county
EV_County <- EV_Population_Clean %>%
group_by(County) %>% summarise(count_make = length(Make)) %>%
arrange(-count_make) %>%
head(10)# Library for plot
library(ggplot2)# Make plot from EV_County
ggplot(data = EV_County, mapping = aes(x = County, y = count_make)) +
geom_col()# Aggregation data for city
EV_city <- EV_Population_Clean %>%
group_by(City) %>% summarise(count_mk = length(Make)) %>%
arrange(-count_mk) %>% head(10)# Make plot from EV_City
ggplot(data = EV_city, mapping = aes(x = City, y = count_mk)) +
geom_col()From the histogram above, the city of Seattle has the most population of electric vehicles
# Aggregation data from mean
EV_mean_er_msrp <- EV_Population_Clean %>%
filter(Base.MSRP > 0) %>%
group_by(Make) %>%
summarise(avg_range = mean(Electric.Range, round(2)),
avg_price = mean(Base.MSRP, round(2)))%>%
arrange(-avg_price)# Make plot data from EV_mean_er_msrp
ggplot(data = EV_mean_er_msrp, mapping = aes(x = avg_price, y = avg_range)) +
geom_point()From the scatter plot above, we can conclude that the highest electrical vehicle range is up to 200 and the highest base MSRP is around 100,000.
we can see corelation average electrical range with average base MSRP
# Check corelation
cor(EV_mean_er_msrp$avg_range, EV_mean_er_msrp$avg_price)#> [1] -0.008804457
the relationship between the average electrical range and the base MSRP from the results is negative, tends to be weak
# Probability
prop.table(table(EV_Population_Clean$Make))*100#>
#> AUDI AZURE DYNAMICS BENTLEY
#> 2.019431567 0.005924258 0.002221597
#> BMW CADILLAC CHEVROLET
#> 4.365437877 0.088863875 8.469467853
#> CHRYSLER FIAT FISKER
#> 1.656570743 0.597609562 0.011107984
#> FORD GENESIS HONDA
#> 5.107451236 0.049615664 0.582798916
#> HYUNDAI JAGUAR JEEP
#> 1.773574846 0.164398169 1.936491950
#> KIA LAND ROVER LEXUS
#> 4.066262830 0.031842889 0.059242584
#> LINCOLN LUCID MAZDA
#> 0.156992846 0.104415054 0.008886388
#> MERCEDES-BENZ MINI MITSUBISHI
#> 0.575393593 0.549474963 0.553918156
#> NISSAN POLESTAR PORSCHE
#> 9.737999674 0.487270250 0.704986744
#> RIVIAN SMART SUBARU
#> 1.337401324 0.205127446 0.201424784
#> TESLA TH!NK TOYOTA
#> 45.770820065 0.002962129 3.616019195
#> VOLKSWAGEN VOLVO WHEEGO ELECTRIC CARS
#> 2.611116871 2.385254521 0.002221597
From above we can see probability most is TESLA 45.7%
someone wants to buy an electrical vehicle, what is the recommended electrical range
# Mean
mean_range <- mean(EV_Population_Clean$Electric.Range)
mean_range#> [1] 74.59141
# Standard deviasi
sd_range <- sd(EV_Population_Clean$Electric.Range)
sd_range#> [1] 98.74396
n_range_msrp <- EV_Population_Clean %>% filter(Electric.Range > 0 & Base.MSRP >0)dim(n_range_msrp)#> [1] 3425 10
SE : sd/ sqrt(n)
SE <- sd_range/ sqrt(3425)
SE#> [1] 1.687253
confidence level 95% alpha (error rate) 5%
Z_alpha_half <- qnorm(alpha/2, lower.tail = F)
Z_alpha_half <- qnorm(0.05/2, lower.tail = F)
Z_alpha_half#> [1] 1.959964
Lower limit
lower_limit <- mean_range - Z_alpha_half * SE
lower_limit#> [1] 71.28446
Upper Limit
upper_limit <- mean_range + Z_alpha_half * SE
upper_limit#> [1] 77.89837
by using confidence intervals we get the average electric range of 71-78
someone wants to buy an electric vehicle, what is the price of the confidence interval from the base MSRP
# Mean base MSRP
mean_msrp <- mean(EV_Population_Clean$Base.MSRP)
mean_msrp#> [1] 1448.397
# Standard deviasi
sd_msrp <- sd(EV_Population_Clean$Base.MSRP)
sd_msrp#> [1] 9683.623
SE : sd/ sqrt(n)
SE <- sd_range/ sqrt(3425)
SE#> [1] 1.687253
confidence level 95% alpha (error rate) 5%
Z_alpha_half <- qnorm(alpha/2, lower.tail = F)
Z_alpha_half <- qnorm(0.05/2, lower.tail = F)
Z_alpha_half#> [1] 1.959964
Lower limit
lower_limit_msrp <- mean_msrp - Z_alpha_half * SE
lower_limit_msrp#> [1] 1445.09
Upper limit
Upper_limit_msrp <- mean_msrp + Z_alpha_half * SE
Upper_limit_msrp#> [1] 1451.704
by using confidence intervals we get the average base MSRP of 1446-1452