There are two main goals in this project: - What variables have a noticeable impact on the actual count or percentage of EVs - What can the percentage or count of EVs be used to predict about states?
library(ggplot2)
library(tidyverse)
library(caret)
library(dplyr)
library(broom)
library(pROC)
The data that is being used in this project has a mix of variables that have varying relation to electric vehicles. These variables were collected in all 50 states each year from 2016 to 2023. However, depending on the state, small bits of data are missing in the earlier years, so most of the graphs use a version of this dataset with the missing values taken out.
evdataclean <- na.omit(evdata)
yeardata <- split(evdata, evdata$year)
statedata <- split(evdata, evdata$state)
The first area of inquiry was to see whether a high level of education has an impact on the percentage of EVs out of total cars in a state. For relevance, this graph only uses the data from 2023.
ggplot(yeardata[["2023"]], aes(Pbachelors, Pelectric)) +
geom_point() +
labs(x = "Percentage of People with a Bachelor's or Higher",
y = "Percentage of EVs in State",
title = "Percentage of Bachelors vs Percentage of EVs in States (2023)") +
theme(plot.title = element_text(hjust = 0.5))
Because there seemed to be a moderate correlation between the two variables, we can run a linear regression model to determine if high education has any statistical significance.
set.seed(123)
trainindex1 <- createDataPartition(evdata$Pelectric, p = 0.8, list = FALSE, times = 1)
evdataTrain1 <- evdataclean[ trainindex1,]
evdataTest1 <- evdataclean[-trainindex1,]
model1 <- lm(Pelectric ~ Pbachelors, data = evdataTrain1)
summary(model1)
Call:
lm(formula = Pelectric ~ Pbachelors, data = evdataTrain1)
Residuals:
Min 1Q Median 3Q Max
-0.56474 -0.31654 -0.13190 0.06994 2.70545
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.797042 0.292007 -2.730 0.00737 **
Pbachelors 0.043613 0.008931 4.883 3.49e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5129 on 112 degrees of freedom
(214 observations deleted due to missingness)
Multiple R-squared: 0.1755, Adjusted R-squared: 0.1682
F-statistic: 23.85 on 1 and 112 DF, p-value: 3.494e-06
The R-squared value seems to hover around 0.17, meaning that around 17% of the variation in the EV percentage can be predicted using the percentage of people with a bachelor’s or higher.
Because the p-value is also far under 0.05, it is likely that the percentage of people with a bachelor’s or higher has some sort of relation with the EV percentage in a state.
evdataTest1 <- evdataTest1 %>%
mutate(PredictedPelec = predict(model1, newdata = evdataTest1))
ggplot(evdataTest1, aes(PredictedPelec, Pelectric)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Predicted Percentage of EVs in State",
y = "Percentage of EVs in State",
title = "Predicted vs Actual Percentage of EVs in States (2023)") +
scale_y_continuous(labels = scales::label_number())
We can also turn this into a graph that compares the actual percentage of EVs in a state with the predicted percentage of EVs. In this case, the red line would be the target, where we want the predicted percentage to be exactly equal to the actual percentage.
summary_table1 <- evdataTest1 %>%
select(PredictedPelec, Pelectric) %>%
summary()
print(summary_table1)
PredictedPelec Pelectric
Min. :0.1816 Min. :0.0500
1st Qu.:0.3949 1st Qu.:0.2550
Median :0.5270 Median :0.3900
Mean :0.5534 Mean :0.5634
3rd Qu.:0.6625 3rd Qu.:0.6150
Max. :1.0508 Max. :2.5000
We can also make a summary table comparing the predicted percentage and the actual percentage. Similar to what we could see on the graph, the model isn’t particularly accurate, especially with the minimum and maximum values, but the the quartiles and mean show some promising accuracy.
Another variable that would strike as potentially significant towards the EV percentage is the price of gas, especially if it has increased over time.
avg_G_value <- evdata %>%
group_by(year) %>%
summarize(avg_value = mean(Gprice))
ggplot(avg_G_value, aes(year, avg_value)) +
geom_col(fill = "orange") +
labs(x = "Year", y = "Nationwide Average Gas Price",
title = "Nationwide Average Gas Price by Year") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_x_continuous(breaks = seq(2018, 2023, by = 1))
Warning: Removed 2 rows containing missing values or values outside the
scale range (`geom_col()`).
Nationwide, gas prices did seem to increase from 2016 from 2023. We can run another linear regression model to see if the price of gas has actually been a factor in EV registrations.
set.seed(123)
trainindex2 <- createDataPartition(evdata$Nregistration, p = 0.8, list = FALSE, times = 1)
evdataTrain2 <- evdataclean[ trainindex,]
evdataTest2 <- evdataclean[-trainindex,]
model2 <- lm(Nregistration ~ Gprice, data = evdataTrain)
summary(model)
Call:
lm(formula = Nregistration ~ Gprice, data = evdataTrain)
Residuals:
Min 1Q Median 3Q Max
-329471 -39989 -892 26358 845726
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -556166 79698 -6.978 2.16e-10 ***
Gprice 204405 26457 7.726 4.92e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 124900 on 113 degrees of freedom
(213 observations deleted due to missingness)
Multiple R-squared: 0.3456, Adjusted R-squared: 0.3399
F-statistic: 59.69 on 1 and 113 DF, p-value: 4.918e-12
This time, the R-squared value hovers around 0.34, which is about twice the value when using the percentage of bachelor’s degrees as a factor. This suggests that the gas price is a more significant factor in EV registrations.
Once again, we can turn this into a graph where the red line would be representative of a perfect model.
evdataTest2 <- evdataTest2 %>%
mutate(PredictedNreg = predict(model, newdata = evdataTest2))
ggplot(evdataTest2, aes(Nregistration, PredictedNreg)) +
geom_point() +
geom_abline(slope = 1, intercept = 0, color = "red") +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
labs(x = "Annual EV Registrations", y = "Predicted Annual EV Registrations",
title = "Predicted vs Actual Annual EV Registrations in States") +
scale_y_continuous(labels = scales::label_number())
While there is still some variance in this model, it does seem to be much more centered around the goal than the model that uses the percentage of bachelor’s degrees.
summary_table2 <- evdataTest2 %>%
select(Nregistration, PredictedNreg) %>%
summary()
print(summary_table2)
Nregistration PredictedNreg
Min. : 500 Min. :-57827
1st Qu.: 5575 1st Qu.: 9320
Median : 18100 Median : 38755
Mean : 33724 Mean : 53869
3rd Qu.: 60650 3rd Qu.: 63692
Max. :152100 Max. :308672
The summary table was also innacurate with minimum and maximum values, but showed some accuracy with quartiles.
Moving toward what EV data can be used to predict, a common example is a political party.
ggplot(evdataclean, aes(party, Pelectric, fill = party)) +
geom_boxplot(width = 0.1) +
geom_violin(alpha = 0.3) +
labs(x = "State Political Party", y = "Percentage of EVs",
title = "Percentage of EVs in States by Political Party") +
theme(plot.title = element_text(hjust = 0.5), aspect.ratio = 0.75, legend.position = "none") +
scale_x_discrete(expand = expansion(mult = c(0.5, 0.5))) +
scale_fill_manual(values = c("Democratic" = "#4255ff", "Republican" = "#ff413b"))
It seems that the majority of Republican states are concentrated at less than 0.5% of EVs, while Democratic states seem to have a higher percentage and are more spread out. Looking at this, we can run a logistic regression model to determine if there is a threshold that can accurately predict a state’s political party.
set.seed(123)
trainIndex <- createDataPartition(evdataclean$party, p = 0.7, list = FALSE)
traindata <- evdataclean[trainIndex, ]
testdata <- evdataclean[-trainIndex, ]
logregmodel <- glm(party ~ Pelectric, data = traindata, family = binomial)
summary(logregmodel)
Call:
glm(formula = party ~ Pelectric, family = binomial, data = traindata)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.7104 0.4265 4.01 6.07e-05 ***
Pelectric -3.4406 0.8115 -4.24 2.24e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 145.48 on 104 degrees of freedom
Residual deviance: 111.70 on 103 degrees of freedom
AIC: 115.7
Number of Fisher Scoring iterations: 5
tidy(logregmodel, exponentiate = TRUE, conf.int = TRUE)
The p-value is far under 0.05, so it is unlikely that the percentage of electric cars in a state is unrelated to that state’s political party.
To check this, we can create an ROC Curve with the model.
testdata$predictedprob <- predict(logregmodel, newdata = testdata, type = "response")
rocurve <- roc(testdata$party, testdata$predictedprob)
Setting levels: control = Democratic, case = Republican
Setting direction: controls < cases
plot(rocurve, col = "blue", main = "ROC Curve")
We can find the point on this curve that demonstrates the most effective threshold and use that threshold in a confusion matrix.
bestcoord <- coords(rocurve, "best", ret = c("threshold", "sensitivity", "specificity", "accuracy"),
best.method = "closest.topleft")
print(bestcoord)
testdata$predictedparty <- ifelse(testdata$predictedprob < 0.5574024, "Democratic", "Republican")
testdata$predictedparty <- factor(testdata$predictedparty, levels = c("Democratic", "Republican"))
confusionMatrix(testdata$predictedparty, testdata$party)
Confusion Matrix and Statistics
Reference
Prediction Democratic Republican
Democratic 17 5
Republican 6 16
Accuracy : 0.75
95% CI : (0.5966, 0.8681)
No Information Rate : 0.5227
P-Value [Acc > NIR] : 0.001705
Kappa : 0.5
Mcnemar's Test P-Value : 1.000000
Sensitivity : 0.7391
Specificity : 0.7619
Pos Pred Value : 0.7727
Neg Pred Value : 0.7273
Prevalence : 0.5227
Detection Rate : 0.3864
Detection Prevalence : 0.5000
Balanced Accuracy : 0.7505
'Positive' Class : Democratic