Climate Change and the Push Towards Electrificaion

Author

Aftikhar Mominzada and Carter Vereschagin

The main question at hand is: How can we analyze the inputs towards climate change and what possible remedies might have the largest impact on reducing our global CO2 emissions? Namely, how can we target specific industries, particularly focusing on the impact they will have on CO2 emissions. We dived into the Energy sector and the transition to sustainable energy sources, particularly the role electrification and electric vehicles may play in reducing these emissions.

__________________________________________________________________________________________________________________________________

Therefore we came up with a work plan: heads up that our work plan is not necessarily following the flow of our presentation. It is because we wanted to write a report that makes sense and is easy to understand; therefore the steps which we took in our work plan is not the same as the flow of our presentation.

We came up with the following work plan:

Data Acquisition and Preparation:

Collect diverse data sources including EIA reports, CO2 emissions data, and EV sales figures.
Preprocess textual data using NLP techniques to extract key insights and topics for analysis.

Analysis of CO2 Emissions Trends:

Analyze historical CO2 emissions trends in both developed and emerging economies.
Visualize data to identify patterns and variations over time, correlating with socio-economic factors.

Evaluation of EV Impact on Emissions Reduction:

Assess the impact of EV sales trends on CO2 emissions reduction: would there be any trend?.

Clustering and Categorization of Electric Vehicles:

Employ clustering algorithms like PCA to categorize EVs based on attributes.
Evaluate characteristics of each EV cluster to understand performance metrics.

Synthesis and Interpretation of Findings (in each step in our report we analyzed our findings):

Synthesize insights from CO2 emissions analysis, EV impact assessment, and EV clustering.

Documentation and Reporting:

Document methodology, data sources, and findings comprehensively.
Present findings through visualizations and presentations for transparency and knowledge dissemination.

___________________________________________________________________________________________________________________________________

Climate change is a heavily debated topic in our modern society. Here we will attempt to look at the role specific industries play in contributing to climate change. This will allow us to be able to select key industries that contribute towards climate change, giving us the opportunity to cater solutions specifically to these industries in hopes of slowing the rate at which we are warming the Earth.

We started by web scraping data from “climatedata.imf.org” in order to find information on greenhouse gas emissions by country and industry in million metric tons of CO2 from 1970 to 2022. After doing some cleaning on the data we were left with the following:

Code

library(tidyverse)
library(readr)

# Retrieving Data on Emissions by Country

url2 = "https://opendata.arcgis.com/datasets/72e94bc71f4441d29710a9bea4d35f1d_0.csv"

destfile = "National_GreenhouseGas_Emissions_Country"
curl::curl_download(url2, destfile)


# Cleaning and converting dataset to a long dataframe

National_GreenhouseGas_Emissions_Country1 <- read_csv(destfile) %>% 
  dplyr::select(-ObjectId, -ISO2, -ISO3, -Source, -CTS_Name, -CTS_Full_Descriptor, -Indicator, -Scale, -CTS_Code, -F2023, -F2024, -F2025, -F2026, -F2027, -F2028, -F2029, -F2030, -Unit) %>% 
  dplyr::rename("Gas Type" = Gas_Type)

colnames(National_GreenhouseGas_Emissions_Country1) <- gsub("F","",colnames(National_GreenhouseGas_Emissions_Country1))

National_GreenhouseGas_Emissions_Country1$Country <- gsub(",.*| Rep\\. of| Rep\\.|Arab Rep\\. of", "", National_GreenhouseGas_Emissions_Country1$Country)

National_GreenhouseGas_Emissions_Country1$Industry <- gsub("^\\S*\\s", "", National_GreenhouseGas_Emissions_Country1$Industry)

Cleaned_Emissions <- National_GreenhouseGas_Emissions_Country1 %>%
  tidyr::pivot_longer(cols = -c('Country', 'Gas Type', 'Industry'), 
                      names_to = "Year",
                      values_to = "Emissions",
                      values_drop_na = FALSE) %>% 
  dplyr::mutate(Combined_Industry = case_when(
    Industry %in% c("Energy", "Energy Industries") ~ "Energy",
    Industry %in% c("Transport", "Road Transportation", "Domestic Navigation", "Other Transportation", "Domestic Aviation", "Railways", "Other Transportation", "CO2 Transport and Storage", "Fuel Combustion Activities") ~ "Transportation",
    Industry %in% c("Manufacturing Industries and Construction", "Other Product Manufacture and Use") ~ "Manufacturing Industry",
    Industry %in% c("Industrial Processes and Product Use", "Other Industrial Processes") ~ "Industrial Processes",
    Industry %in% c("Other", "Other (Not specified elsewhere)", "Non-energy Products from Fuels and Solvent Use", "Fugitive Emissions from Fuels", "Product Uses as Substitutes for ODS", "Applicable") ~ "Other",
    Industry %in% "Buildings and other Sectors" ~ "Buildings and other Sectors",
    Industry %in% "Chemical Industry" ~ "Chemical Industry", 
    Industry %in% "Land-use, land-use change and forestry" ~ "Land-use, land-use change and forestry",
    Industry %in% "Mineral Industry" ~ "Mineral Industry",
    Industry %in% "Metal Industry" ~ "Metal Industry",
    Industry %in% "Electronics Industry" ~ "Electronics Industry",
    Industry %in% "Agriculture" ~ "Agriculture",
    Industry %in% "Waste" ~ "Waste")) %>% 
  dplyr::group_by(Country, Combined_Industry, Year) %>% 
  dplyr::summarize(Total_Emissions = sum(Emissions, na.rm = TRUE), .groups = "keep") %>% 
  dplyr::rename("Industry" = Combined_Industry) %>% 
  dplyr::select(Year, everything()) %>%
  dplyr::filter(Total_Emissions != 0) %>% 
  dplyr::arrange(Year)

library(ggplot2)
library(plotly)

Cleaned_Emission_graph2 <- Cleaned_Emissions %>% 
  dplyr::filter(Country %in% c("United States", "China", "India", "Australia and New Zealand", "Canada", "United Kingdom", "France", "Mexico")) %>% 
  dplyr::filter(Industry != "Land-use, land-use change and forestry") %>% 
  ggplot(
    aes(
      x = Country,
      y = Total_Emissions,
      fill = Industry
    )) +
      geom_bar(stat = "identity") +
  labs(
    title = "Total Emissions by Country and Industry",
  ) +
  ylab("Million Metric Tons of CO2") +
  scale_y_continuous(labels = scales::number_format()) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Cleaned_Emission_graph2

This contained emissions data on 235 different countries around the world based on industry and also the type of gas producing these emissions from 1970 to 2022. We aggregated all the different types of gases into one metric, “Total Emissions” to simplify our analysis as we are trying to target specific industries. For visual purposes, we selected a few developed countries and noticed that China, India, and the United States have significantly higher global emissions, likely due to the size of their economies and population.

We then proceeded to scrape more data from “climatedata.imf.org” where we found data on surface temperature changes from 1970 to 2022 by country. This allowed us to then compare these changes in surface temperature to the emissions data above.

Code

# Retrieving Data on Temperature change by country

library(rvest)
library(RSelenium)

url = "https://climatedata.imf.org/datasets/4063314923d74187be9596f10d034914_0/explore"

rD <- rsDriver(browser = 'firefox', chromever = NULL, verbose = FALSE)
remDr <- rD[["client"]]

Sys.sleep(2)

remDr$navigate(url)

Sys.sleep(8)

for (i in 1:10){
  remDr$executeScript("document.querySelector('.infinite-scroll-container').scrollTop += 5000;")
  Sys.sleep(2)
}

html <- remDr$getPageSource()

remDr$close()

# Cleaning the data and converting it into a long data frame

surface_temp_data <- rvest::read_html(html[[1]]) %>% 
  rvest::html_elements(css = "table") %>% 
  rvest::html_table() %>% 
  as.data.frame() %>% 
  dplyr::select(-CTS.Full.Descriptor, -CTS.Code, -Source, -Indicator, -ISO2, ISO3, -CTS.Name, -ISO3, -X1961, -X1962, -X1963, -X1964, -X1965, -X1966, -X1967, -X1968, -X1969, -Unit)

colnames(surface_temp_data) <- gsub("X","",colnames(surface_temp_data))

surface_temp_data <- surface_temp_data %>% 
  tidyr::pivot_longer(cols = -c("Country"),
                      names_to = "Year",
                      values_to = "Temp Change",
                      values_drop_na = FALSE) %>% 
  dplyr::select(Year, everything())

surface_temp_data$Country <- gsub(",.*| Rep\\. of| Rep\\.|Arab Rep\\. of", "", surface_temp_data$Country)

surface_temp_data_graph <- surface_temp_data %>% 
  dplyr::filter(Country %in% c("United States", "India", "Australia and New Zealand", "Canada", "United Kingdom", "France", "Mexico")) %>%
  ggplot(
    aes(
      x = Year,
      y = `Temp Change`,
      color = Country
    )) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Temperature Change by Country 1970 - 2022",
    x = "Year",
    y = "Temperature Change (Degrees C)"
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

surface_temp_data_graph

After joining our two data sets together we then grouped by country, industry, and year. This allowed us to transform our data from a long data frame into a wide one illustrating the emissions produced by different industries shown on the graph below. We noticed a significant trend upwards in temperature change from 1970 to 2022 and decided to investigate this further.

Code

# Joining the two data sets together

cleaned_data <- Cleaned_Emissions %>% 
  tidyr::pivot_wider(names_from = Industry, values_from = Total_Emissions) %>% 
  dplyr::right_join(surface_temp_data, by = c("Year", "Country")) %>% 
  tidyr::drop_na() %>% 
  distinct(Country, Year, .keep_all = TRUE)


cleaned_data$Year <- as.Date(paste0(cleaned_data$Year, "-01-01"))

cleaned_data_graph <- Cleaned_Emissions %>% 
  dplyr::right_join(surface_temp_data, by = c("Year", "Country")) %>% 
  dplyr::filter(Country %in% c("United States", "India", "Australia and New Zealand", "Canada", "United Kingdom", "France", "Mexico")) %>%
  ggplot(aes(
    x = Year,
    y = `Temp Change`,
    color = Country,
    size = Total_Emissions
  )) +
  geom_point() +
  labs(
    title = "Temp Change by Country, Industry, and Emission",
  ) + 
  ylab("Temperature Change (Degrees C)") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

cleaned_data_graph

Above is another visual representation we created that allowed us to easily see this trend and also incorporated the emissions generated by each country in the respective size of each data point. To try and capture more of this trend we decided to run a linear regression analysis to see how well our emissions data captured this change in temperature.

Code

# Feature engineering our linear regression model

library(rsample)

cleaned_data_split <- cleaned_data %>% 
  rsample::initial_split(prop = 0.8, strata = Country)
training_data <- training(cleaned_data_split)
testing_data <- testing(cleaned_data_split)

library(recipes)

recipe_pipeline_train <- recipes::recipe(`Temp Change` ~ ., data = training_data) %>% 
  recipes::step_rm(Year) %>% 
  recipes::step_rm(Country) %>% 
  recipes::step_normalize(all_numeric()) %>% 
  recipes::prep()

train_baked <- recipes::bake(recipe_pipeline_train, training_data)

recipe_pipeline_test <- recipes::recipe(`Temp Change` ~ ., data = testing_data) %>% 
  recipes::step_rm(Year) %>% 
  recipes::step_rm(Country) %>% 
  recipes::step_normalize(all_numeric()) %>% 
  recipes::prep()

test_baked <- recipes::bake(recipe_pipeline_test, testing_data)

Code

# Linear Model

library(parsnip)

model_lm <- 
  parsnip::linear_reg(mode = "regression") %>% 
  parsnip::set_engine("lm") %>% 
  parsnip::fit(`Temp Change` ~ ., data = train_baked)

summary(model_lm$fit)


Call:
stats::lm(formula = `Temp Change` ~ ., data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7177 -0.6852 -0.0301  0.5788  3.8480 

Coefficients:
                                           Estimate Std. Error t value Pr(>|t|)
(Intercept)                               6.070e-16  3.520e-02   0.000 1.000000
Agriculture                              -1.655e+00  1.267e+00  -1.306 0.191818
`Buildings and other Sectors`            -4.768e-01  4.666e-01  -1.022 0.307162
`Chemical Industry`                      -4.932e-01  4.947e-01  -0.997 0.319050
`Electronics Industry`                    1.839e-01  1.398e-01   1.315 0.188884
Energy                                   -1.141e+01  3.718e+00  -3.069 0.002232
`Industrial Processes`                    1.887e+00  1.186e+00   1.591 0.112092
`Manufacturing Industry`                 -2.592e+00  7.294e-01  -3.553 0.000405
`Metal Industry`                          3.875e-01  2.695e-01   1.438 0.150964
`Mineral Industry`                       -3.496e-01  9.565e-01  -0.366 0.714830
Other                                     1.794e+01  6.497e+00   2.761 0.005910
Transportation                           -4.602e-01  1.641e+00  -0.281 0.779160
Waste                                    -2.970e+00  8.073e-01  -3.679 0.000252
`Land-use, land-use change and forestry` -3.479e-01  1.568e-01  -2.219 0.026834
                                            
(Intercept)                                 
Agriculture                                 
`Buildings and other Sectors`               
`Chemical Industry`                         
`Electronics Industry`                      
Energy                                   ** 
`Industrial Processes`                      
`Manufacturing Industry`                 ***
`Metal Industry`                            
`Mineral Industry`                          
Other                                    ** 
Transportation                              
Waste                                    ***
`Land-use, land-use change and forestry` *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9466 on 709 degrees of freedom
Multiple R-squared:  0.1201,    Adjusted R-squared:  0.1039 
F-statistic: 7.442 on 13 and 709 DF,  p-value: 7.585e-14

We noticed that certain industries, namely the Manufacturing Industry as well as Industrial Processes seem to be statistically significant in our model. However, we also noticed that our adjusted R-squared is particularly low at 6.37%. This signifies that there are other factors contributing to the changes in surface temperature that we have not accounted for in our model and from here we should try and do some more research into what other factors may be causing these changes in surface temperature. We did however want to try and expand further on the data we already collected, in doing so we attempted to perform some machine learning and feature engineering in order to try and predict how the surface temperatures may react to changes in emissions from these industries.

Code

# Running predictions
library(yardstick)

predictions_lm <- model_lm %>% 
  stats::predict(new_data = test_baked) %>% 
  dplyr::bind_cols(`Temp Change` = test_baked %>% ungroup() %>% dplyr::select("Temp Change")) %>% 
  yardstick::metrics(truth = `Temp Change`, estimate = .pred) %>% dplyr::arrange(.metric)

predictions_lm

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard      0.750 
2 rmse    standard      0.967 
3 rsq     standard      0.0660

After performing some predictions and further analysis on these predictions we found that they had little significance in their ability to form accurate predictions of surface temperature with relatively high MAE and RMSE metrics.

We then looked at performing a Bayesian regression in order to see if that would fit our data better and give us a better representation on how these industries affect changes in surface temperature, including seeing if predictions from this model would yield better results.

Code

# Bayesian regression

library(rstanarm)
options(mc.cores = parallel::detectCores())

model_bayes <- parsnip::linear_reg(mode = "regression") %>% 
  parsnip::set_engine("stan",
                      prior_intercept = rstanarm::normal(),
                      prior = rstanarm::student_t(df = 1),
                      iter = 10000,
                      seed = 123) %>% 
  parsnip::fit(`Temp Change` ~ ., data = train_baked)

# Results

library(broom.mixed)

broom.mixed::tidy(
  model_bayes,
  conf.int = TRUE,
  conf.level = 0.2
) %>%
  dplyr::mutate(dplyr::across(where(is.numeric), ~round(.x,5)))

# A tibble: 14 × 5
   term                                    estimate std.error conf.low conf.high
   <chr>                                      <dbl>     <dbl>    <dbl>     <dbl>
 1 (Intercept)                              0.00031    0.0354 -0.00868   0.00917
 2 Agriculture                              0.542      0.816   0.331     0.742  
 3 `Buildings and other Sectors`           -0.170      0.432  -0.277    -0.0621 
 4 `Chemical Industry`                     -0.432      0.446  -0.547    -0.319  
 5 `Electronics Industry`                   0.122      0.138   0.0864    0.156  
 6 Energy                                  -2.82       2.52   -3.55     -2.20   
 7 `Industrial Processes`                   2.72       1.01    2.47      2.97   
 8 `Manufacturing Industry`                -1.93       0.659  -2.10     -1.77   
 9 `Metal Industry`                         0.341      0.250   0.278     0.403  
10 `Mineral Industry`                      -0.469      0.823  -0.676    -0.264  
11 Other                                    3.32       3.93    2.36      4.51   
12 Transportation                           0.678      1.16    0.387     0.983  
13 Waste                                   -2.27       0.759  -2.46     -2.08   
14 `Land-use, land-use change and forestr… -0.0442     0.106  -0.0724   -0.0166

Code

out_bayes <- model_bayes %>%
  stats::predict(new_data = test_baked) %>%
  dplyr::bind_cols(`Temp Change` = test_baked %>% ungroup() %>% dplyr::select(`Temp Change`)) %>%
  yardstick::metrics(truth = `Temp Change`, estimate = .pred) %>%
  dplyr::arrange(.metric)

out_bayes

# A tibble: 3 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 mae     standard      0.736 
2 rmse    standard      0.964 
3 rsq     standard      0.0673

This proved to give us a similar output as our linear regression siggesting that our data does indeed fit a linear model the best. We did notice that the confidence intervals between estimates generally have large ranges given our range of temperature changes in the original data is quite small.

Plotting these estimates along with their confidence interval in a whisker plot gave us some more insight into which industries have the largest confidence intervals. We noticed that the Industrial Processes, Energy, Transportation, and Waste industries had the widest confidence intervals.

Code

broom.mixed::tidy(model_bayes, conf.int = T) %>%
  ggplot2::ggplot(aes(x = term)) +
  geom_point(aes(y = estimate)) +
  geom_errorbar(aes(ymin = conf.low, ymax = conf.high), width = 1.5) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We then decided to go back to the original data to see try and investigate these 4 industries further. After graphing it on a scatter plot we noticed that the industries with the largest emissions were in fact the Transportation and Energy industries. Without finding more factors that may be connected to changes in surface temperature we relied on empirical evidence that suggests emissions do in fact have an impact on surface temperature changes. From this we dived deeper into both the Energy and Transportation industries where we looked at potential remedies that may reduce emissions from both of these sectors, namely the transition towards electrification and electric vehicles.

Code

Cleaned_Emissions_graph <- Cleaned_Emissions %>% 
  dplyr::filter(Country %in% c("China", "United States", "India", "World")) %>% 
  dplyr::filter(Industry != "Other") %>% 
  ungroup()

Cleaned_Emissions_graph$Year <- as.Date(paste0(Cleaned_Emissions_graph$Year, "-01-01"))

Emissions_graph <- Cleaned_Emissions_graph %>% 
  plotly::plot_ly(
    x = ~ Industry,
    y = ~ Total_Emissions,
    type = "scatter")
    
Emissions_graph

Code

library(xml2)
library(jsonlite)
library(readxl)
library(stringr)
library(factoextra)
library(gridExtra)
library(tm)
library(SnowballC)
library(purrr)
library(furrr)
library(topicmodels)
library(tidytext)

#install.packages("factoextra")
#install.packages("maps")
#install.packages("topicmodels")
#install.packages("tidytext")



url <- "https://www.eia.gov/totalenergy/reports.php"

# Start Selenium server

remDr$open()

remDr$navigate(url)

Sys.sleep(2)
html_content_data <- remDr$getPageSource()[[1]]

Sys.sleep(2)
reading_html <- read_html(html_content_data)

Sys.sleep(2)
href_elements_html <- reading_html %>%
  html_nodes("a.ico.html") %>%
  html_attr("href")

Sys.sleep(2)
href_elements_html <- paste0("https://www.eia.gov/", href_elements_html)

href_elements_html <- href_elements_html[-c(1,2,3, 54:59)]

Sys.sleep(2)
href_elements_html <- href_elements_html[!grepl(c("pdf"), href_elements_html)]


remDr$close()

Code

remDr$open()

# Function to extract paragraphs from a webpage
extract_paragraphs <- function(url) {
  remDr$navigate(url)
  page_source <- remDr$getPageSource()[[1]]
  page <- read_html(page_source)
  paragraphs <- html_nodes(page, "p")
  html_text(paragraphs)
}

# List of links
links <- href_elements_html

# Extract paragraphs from each link
paragraphs_list <- lapply(links, extract_paragraphs)

remDr$close()


paragraphs_list <- gsub("(united states|United states|United States)", "US", paragraphs_list, ignore.case = TRUE)

paragraphs_list <- text_cleaned <- gsub("(energy consumption)", "EC", paragraphs_list, ignore.case = TRUE)

paragraphs_list <- text_cleaned <- gsub("\\\\r\\\\n\\\\t", " ", paragraphs_list, ignore.case = TRUE) 
paragraphs_list <- gsub("(\\\\| tttttttt | \n\ | \ |)", "", paragraphs_list)



pattern <- "(t{3,})"
paragraphs_list <- gsub(pattern, "", paragraphs_list)

paragraphs_list_for_cleaning <- tolower(paragraphs_list)

Code

#cleaning phase - in this phase we remove the stop words, the numbers, puncuation, and extra space in the sentences.

corpus <-  VCorpus(VectorSource(paragraphs_list_for_cleaning))

corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeWords, stopwords())
corpus <- tm_map(corpus, stemDocument)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)


cleaned_text <- sapply(corpus, as.character)


Sys.sleep(2)

# the two below paragraphs were repeated everywhere, so this would cause a problem in my NLP model because the number of occurance is very the bedrock of NLP so I have to get rid of these paragraphs. Also all the words are in its root format (ie included, including include - all of them might be as includ)
# this way we could get a good representation of where the words fall.

cleaned_text <- gsub(c("cmenu crude oil gasolin heat oil diesel propan liquid includ biofuel natur gas liquid explor reserv storag import export product price sale sale revenu price power plant fuel use stock generat trade demand emissionsn energi use home commerci build manufactur transport reserv product price employ product distribut stock import export includ hydropow solar wind geotherm biomass ethanol uranium fuel nuclear reactor generat spent fuel comprehens data summari comparison analysi project integr across energi sourc month year energi forecast analysi energi topic financi analysi congression reportsn"), " ",cleaned_text)

Sys.sleep(2)

cleaned_text <- gsub(c("financi market analysi financi data major energi compani greenhous gas data voluntari report electr power plant emiss map tool resourc relat energi disrupt infrastructur state energi inform includ overview rank data analys map energi sourc topic includ forecast map intern energi inform includ overview rank data analys region energi inform includ dashboard map data analys tool custom search view specif data set studi detail document access timeseri data free open data avail api excel addin bulk file widget come test product still develop let us know think  form use collect energi data includ descript link survey instruct addit inform sign email subcript receiv messag specif product subscrib feed updat product includ today energi what new short time articl graphic energi fact issu trend lesson plan scienc fair experi field trip teacher guid career corner report request congress otherwis deem import"), " ",cleaned_text)


Sys.sleep(2)

pattern <- "\\b\\w{16,}\\b"
cleaned_text <- gsub(pattern, "", cleaned_text)
Sys.sleep(2)

cleaned_text[-c(2, 6, 8, 10, 41, 42, 43, 44, 45, 46, 48)]
str_count(cleaned_text)

#here I got rid of the repeatitive words and short formats both of these are meaningless in the context of our analysis.
cleaned_text <- gsub(c("eia | ieo"), "", cleaned_text)

Code

dtm <- DocumentTermMatrix(Corpus(VectorSource(cleaned_text)))
dtm <- removeSparseTerms(dtm, 0.999)
datasett <-  as.data.frame(as.matrix(dtm))

filtered_matrix <- datasett[rowSums(datasett != 0) > 0, ]


ap_topic_model <- topicmodels::LDA(filtered_matrix, k = 18, control = list(seed = 321))


#Running our topic model

Sys.sleep(2)

AP_topics <-  tidytext::tidy(ap_topic_model, matrix = "beta")

ap_top_terms <- AP_topics %>% group_by(topic) %>%  top_n(10, beta) %>%
  ungroup() %>% arrange(topic, -beta)

Sys.sleep(2)

first_plot <- ap_top_terms %>% mutate(term = reorder(term, beta)) %>% 
  mutate(topic = paste("Topic #", topic)) %>% 
  ggplot(aes(term, beta, fill = factor(topic))) + 
  geom_col(show.legend = FALSE)+
  facet_wrap(~ topic, scales = "free")+
  theme_minimal()+
  theme(plot.title = element_text(hjust = 0.5, size=18),
        axis.text.y = element_text(size = 5),
        axis.text.x = element_text(size = 5))+
  labs(title = "Most relevent terms grouped together", 
       caption = "Top terms by topic (betas)")+
  ylab("")+
  xlab("")+
  coord_flip()


# AEO - the full form is annual energy outlook

Energy is the cornerstone of modern civilization, driving economic prosperity, technological innovation, and societal well-being. Despite its paramount importance, energy remains a sensitive topic often overshadowed by geopolitical tensions, environmental concerns, and social inequalities. Historical incidents, such as the oil crises of the 1970s and nuclear disasters like Chernobyl and Fukushima, underscore the complexities and risks associated with energy production and consumption. As we confront the challenges of climate change and strive for a sustainable future, it is imperative to recognize the centrality of energy in shaping our world and engage in open, informed dialogue to navigate the complexities and opportunities it presents.

With that in mind we web scraped the EIA website to understand the prospects of what does the analysis say about energy sector in USA. Generally, these analysis are long and one analysis talks about different areas, hence difficult to get a general idea about what it is all about. Therefore, we used natural language processing (NLP) to break them into root words, and put them in different topic brackets. We used topic modelling in this case and after cleaning and editing the analysis we generated this chart. All the words are in its root form meaning that for the purpose of NLP analysis we changed the words ( included/including/include = includ) to its root form. This way we would be able to fairly compare and catagorize the words to related topic

Code

first_plot

In the above chart we used Latent Dirichlet Allocation (LDA) model to create a better fit for each word across all articles we web scrapped. It is a mixed membership model where one word could be shared with one or many topics. As a result we chose the number of topics and have a list of words associated with one or more topics. (for the context AEO stands for annual energy outlook.)

How did we do it?

We cleaned the data and changed it to document term matrix: which is a representation of a corpus (collection of texts) in a matrix format. the rows are the paragraphs/articles, and the columns are each words/phrase. Looking at the graph, the horizontal axis is the Beta: the probability of each word being related to each topic

we see that topic 3 has something related to electric sector and fuel, lets dive deep and explore that specific topic

From the chart description, it seems that the topic modeling analysis has revealed several key insights into the energy sector of the USA. Contrary to the mass media propagation and outlet, the analysis says something totally different, what we realized is that the focus is more on policy change, and increase in demand of energy in the form of electricity.

Overall, the LDA model’s findings highlight the complexity and multifaceted nature of the energy sector in the USA, encompassing production, economic implications, resource management, and policy impact.

Speaking of Energy, CO2 emission is the main concern that has caused the shift towards sustainable energy; be it a political agenda or anything related to that: from here onwards we will measure the progress of countries that advocate for decarbonization.

Code

National_GreenhouseGas_Emissions_Country <- read_csv(destfile) %>% 
  dplyr::select(-ObjectId, -ISO2, -ISO3, -Source, -CTS_Name, -CTS_Full_Descriptor, -Indicator, -Scale, -CTS_Code, -F2023, -F2024, -F2025, -F2026, -F2027, -F2028, -F2029, -F2030)

colnames(National_GreenhouseGas_Emissions_Country) <- gsub("F","",colnames(National_GreenhouseGas_Emissions_Country))

National_GreenhouseGas_Emissions_Country

Code

# I filtered China and India because they polluted so much so that it would mess up with the scale.

National_GreenhouseGas_Emissions_Country <- National_GreenhouseGas_Emissions_Country %>% mutate( Country =
  gsub(",.*| Rep\\. of| Rep\\.|Arab Rep\\. of", "", Country)) %>% 
  as.data.frame() %>% filter(! Country %in% c("China", "India")) %>% select(-c(2, 3, 4))

Code

filteredd <- National_GreenhouseGas_Emissions_Country %>% select("Country", "2020")

filteredd <- filteredd %>% rename(Emission = "2020")

filteredd <- aggregate(Emission ~ Country, data = filteredd, FUN = sum) %>% rename(region = Country)







# Get world map data
world_map <- map_data("world")
merged_data <- left_join(world_map, filteredd, by = "region")


# Plot the map with emissions
map2020 <- merged_data %>% ggplot(aes(x=long, y=lat, group = group, text = region))+
  geom_polygon(aes(fill = Emission), color = "Black") + scale_fill_gradient(name = "Emission 2020", low = "yellow", high = "red", na.value = "grey50")+
  theme(
    axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        )

map2020 <- ggplotly(map2020)

Code

filteredd1 <- National_GreenhouseGas_Emissions_Country %>% select("Country", "2008")

filteredd1 <- filteredd1 %>% rename(Emission = "2008")

filteredd1 <- aggregate(Emission ~ Country, data = filteredd1, FUN = sum) %>% rename(region = Country)






# Get world map data
world_map1 <- map_data("world")
merged_data1 <- left_join(world_map, filteredd1, by = "region")


# Plot the map with emissions
map2008 <- merged_data1 %>% ggplot(aes(x=long, y=lat, group = group, text = region))+
  geom_polygon(aes(fill = Emission), color = "Black") + scale_fill_gradient(name = "Emission", low = "yellow", high = "red", na.value = "grey50")+
  theme(
    axis.text.x = element_blank(),
        axis.text.y = element_blank(),
        )

map2008 <- ggplotly(map2008)

Code

subplot(map2020, map2008,nrows = 2) %>% layout(title = "World emission in Million metric tons of CO2 equivalent (2020 on top) (2008 in bottom)")

We purposefully excluded India and China because of their high emission other countries would not event seem close, in other words it would greatly skew our scale. Also interestingly the Emission of USA was included as ‘Advanced economies’ so we assumed their emission is aggregated with other economies (perhaps Congo or central Africa to be taken as irony because USA tends to hide its mess all the time and blames others) and did not include it.

We can see that most of developed economies’ emission decreased, whereas emerging economies’ emission has increased over time. Now lets look into a different aspect of CO2 emission. Before going to the next paragraph please play around with the graph below; looks like it has less information but in fact we simplified it for better result.

Code

Emission2017_2021 <- National_GreenhouseGas_Emissions_Country %>% select(`Country`,`2017` , `2018`, `2019`, `2020`, `2021`) %>%
  filter(Country %in% c("China", "United States", "Germany", "France", "United Kingdom", "Norway", "Netherlands", "Sweden", "Japan", "Canada" ))

Emission2017_2021 <- Emission2017_2021[-1]


column_aggregates <- sapply(Emission2017_2021, FUN = function(x) sum(x, na.rm = TRUE)) %>% as.data.frame()
column_aggregates <- column_aggregates %>% mutate(Year=c("2017", "2018", "2019", "2020", "2021"))
column_aggregates <- column_aggregates %>% rename(column_aggregates, Emission = .)

Code

linkkwiki <- "https://en.wikipedia.org/wiki/Electric_car_use_by_country"

remDr$open()

Sys.sleep(1)
remDr$navigate(linkkwiki)
Sys.sleep(1)
html_content_data_wiki <- remDr$getPageSource()
html_content <- html_content_data_wiki[[1]]
reading_html_wiki <- read_html(html_content)

remDr$close()

table <- html_nodes(reading_html_wiki, "table.wikitable") %>% html_table(fill = TRUE)

Sys.sleep(1)
EV_sales2 <- table[[3]] %>% as.data.frame() %>% select(1,3,5,7,9,11)
EV_sales2 <- EV_sales2[-1, ] 
EV_sales2 <- EV_sales2 %>% lapply(function(x) gsub("\\[.*?\\]", "", x)) %>% as.data.frame()
EV_sales2$Country <- gsub("\\(.*?\\)", "", EV_sales2$Country) %>% na.omit()
EV_sales2 <- EV_sales2[1:(nrow(EV_sales2) - 6), ]
EV_sales2 <- EV_sales2 %>% mutate(across(everything(), ~ gsub(",", "", .)))
EV_sales2 <- EV_sales2 %>%mutate(across(2:6, as.numeric))
Sys.sleep(1)
EV_sales2 <- EV_sales2[-1]

EV_sales2 <- sapply(EV_sales2, FUN = function(x) sum(x, na.rm = TRUE)) %>% as.data.frame()
Sys.sleep(1)
EV_sales <- EV_sales2 %>% mutate(Year = c("2021", "2020", "2019", "2018", "2017")) %>% arrange(Year)
EV_sales <- EV_sales %>% rename(EV_sales, evsales = .)
Sys.sleep(1)

final_dataa <- left_join(EV_sales, column_aggregates, by = "Year")

Code

plot <- final_dataa %>%  ggplot(aes(x=evsales, y=Emission, col = factor(Year))) + geom_point() +labs(title = "Global EV sales and Emission", col = "Year")
plotly_plottt <- ggplotly(plot)

Code

plotly_plottt

For this part we took the emission of the countries that we gathered their EV sales. This way we could see if the EV usage overtime has actually correlated to decrease in their emission.

The sale/use of EV in economies that sell EV moves at the opposite direction as the CO2 emission but not too aggressively, and the only time the emission decreased significantly was during covid19 lock-down in 2020. Given that we had 5 data points we did not bother coding for correlation rather a visual would convey the message better.

Therefore from both the graphs (the maps and the points) we can conclude that developed economies are indeed heading towards decolonization however the progress is very slow and gradual.

The energy sector discussions likely involve fossil fuels (such as oil and gas). When considering EV, a crucial shift is the transition from fossil fuel-based energy sources to cleaner alternatives. EVs primarily rely on electricity, which can be generated from renewable sources like solar, wind, or hydroelectric power. This transition impacts both the energy sector and the EV industry.

In the universe of energy, electric cars aren’t just a new vehicle option; they’re a game-changer. By tapping into the power grid instead of the gas station, these cars are reshaping our approach to transportation and energy consumption. They’re not just driving us forward; they’re propelling us toward a more sustainable future.

Despite their growing popularity, there’s still much we don’t know about electric cars. While they hold promise for reducing emissions and dependence on fossil fuels, their widespread adoption presents a host of unanswered questions. How will they impact the electricity grid? What are the long-term environmental implications of their production and disposal? These are all possible questions to ask, the more important question is rather how to understand these cars because they are not the same as combustion engine cars; not even close.

Within the realm of electric cars, a significant challenge lies in the absence of direct comparisons to traditional combustion engine vehicles. Unlike their gasoline-powered counterparts, electric cars possess unique characteristics that defy conventional metrics of comparison. Factors such as range, charging time, and efficiency take on new dimensions in the context of electric propulsion, requiring novel methodologies for evaluation.

Code

# Load required libraries
if (!require("rvest")) install.packages("rvest")
if (!require("dplyr")) install.packages("dplyr")
if (!require("foreach")) install.packages("foreach")

Loading required package: foreach


Attaching package: 'foreach'

The following objects are masked from 'package:purrr':

    accumulate, when

Code

if (!require("doParallel")) install.packages("doParallel")

Loading required package: doParallel

Loading required package: iterators

Loading required package: parallel

Code

library(rvest)
library(dplyr)
library(foreach)
library(doParallel)

# Helper function: Remove non-numeric characters (except comma and dot), replace comma with dot, and convert to numeric.
parse_numeric <- function(text) {
  num <- gsub("[^0-9,\\.]", "", text)
  num <- gsub(",", ".", num)
  as.numeric(num)
}

# URL to scrape
url <- "https://ev-database.org/"

# Read HTML content
page <- read_html(url)

# Extract all car blocks with class 'list-item'
car_nodes <- page %>% html_nodes(".list-item")

# Convert each node to a character string for serialization across workers
car_html_list <- lapply(car_nodes, as.character)

# Set up parallel backend using available cores
numCores <- parallel::detectCores() - 1  # leave one core free
cl <- makeCluster(numCores)
registerDoParallel(cl)

# Process each car block in parallel using foreach
car_results <- foreach(i = seq_along(car_html_list), .combine = rbind, 
                         .packages = c("rvest", "dplyr", "xml2")) %dopar% {
  
  # Rebuild the HTML from the character string
  car_block <- read_html(car_html_list[[i]])
  
  # Basic Information:
  model_name <- car_block %>% 
    html_node(".title span[class]") %>% 
    html_text(trim = TRUE)
  
  model_variant <- car_block %>% 
    html_node(".title .model") %>% 
    html_text(trim = TRUE)
  
  # Additional Attributes:
  efficiency_text       <- car_block %>% html_node("span.efficiency") %>% html_text(trim = TRUE)
  weight_text           <- car_block %>% html_node("span.weight_p") %>% html_text(trim = TRUE)
  acceleration_text     <- car_block %>% html_node("span.acceleration_p") %>% html_text(trim = TRUE)
  longdistance_text     <- car_block %>% html_node("span.long_distance_total") %>% html_text(trim = TRUE)
  price_text            <- car_block %>% html_node("span.country_uk") %>% html_text(trim = TRUE)
  battery_text          <- car_block %>% html_node("span.battery_p") %>% html_text(trim = TRUE)
  fastcharge_text       <- car_block %>% html_node("span.fastcharge_speed_print") %>% html_text(trim = TRUE)
  priceperrange_text    <- car_block %>% html_node("span.priceperrange_p") %>% html_text(trim = TRUE)
  seats_text            <- car_block %>% html_node("i.seats-5 + span") %>% html_text(trim = TRUE)
  
  # Convert extracted texts to numeric values where appropriate
  efficiency      <- if(!is.na(efficiency_text)) parse_numeric(efficiency_text) else NA
  weight          <- if(!is.na(weight_text)) parse_numeric(weight_text) else NA
  acceleration    <- if(!is.na(acceleration_text)) parse_numeric(acceleration_text) else NA
  long_distance   <- if(!is.na(longdistance_text)) parse_numeric(longdistance_text) else NA
  price           <- if(!is.na(price_text)) parse_numeric(price_text) else NA
  battery         <- if(!is.na(battery_text)) parse_numeric(battery_text) else NA
  fastcharge      <- if(!is.na(fastcharge_text)) parse_numeric(fastcharge_text) else NA
  price_per_range <- if(!is.na(priceperrange_text)) parse_numeric(priceperrange_text) else NA
  seats           <- if(!is.na(seats_text)) parse_numeric(seats_text) else NA
  
  # Return a data frame row for this car
  data.frame(
    Model = model_name,
    Variant = model_variant,
    Efficiency = efficiency,
    Weight = weight,
    Acceleration = acceleration,
    Range = long_distance,
    Price = price,
    Battery = battery,
    FastChargeSpeed = fastcharge,
    PricePerRange = price_per_range,
    Seats = seats,
    stringsAsFactors = FALSE
  )
}

# Stop the parallel cluster
stopCluster(cl)

# Rename columns to include units for clarity
colnames(car_results) <- c(
  "Model", 
  "Variant", 
  "Efficiency (Wh/km)", 
  "Weight (kg)", 
  "Acceleration (sec)", 
  "Range (km)", 
  "Price (£)", 
  "Battery (kWh)", 
  "FastChargeSpeed (kW)", 
  "PricePerRange (€/km)", 
  "Seats"
)

Code

plot_car <- plot_ly(
  data = car_results,
  x = ~`Acceleration (sec)`,
  y = ~`Price (£)`,
  text = ~paste(`Model`, `Variant`, sep = " - "),
  size = ~`FastChargeSpeed (kW)`,
  color = ~`Model`,
  colors = "Set1",
  type = "scatter",
  mode = "markers",
  marker = list(symbol = "circle"),
  hoverinfo = "text+x+y+size"
) %>% 
  layout(
    title = "Electric Car Model Comparison (Bubble size = Fast Charge Speed in kW)",
    xaxis = list(title = "Acceleration (sec)"),
    yaxis = list(title = "Price (£)"),
    legend = list(title = list(text = "Car Model"))
  )

Code

plot_car

From the above graph we can see that electric vehicles are different beasts when it comes to understanding their nature, If you double click on the bubble right next of the car model on the side it would remove other vehicles and then you can select your desired models and compare them. All of these EVs are the models that are in the market or will be in the market in coming years. So far we have ~ 345 models in this graph.

so why not categorize them based on their attributes: we used all the attributes such as usable battery, 0-100 acceleration, top speed, Range, efficiency, charging speed, price, and drive-train to cluster them. This way we would have a better understanding of which model falls where.

Code

Sys.sleep(1)

clustered <- car_results %>% replace(is.na(.), 0) %>% select(-c("Model", "Variant"))
clustered <- clustered %>% scale()

if (!require(factoextra)) {
  install.packages("factoextra")
  install.packages("gridExtra")
}
library(factoextra)
library(cluster)
library(gridExtra)


# Calculate the clustering indices
alphaa <- fviz_nbclust(clustered, kmeans, method = "wss")+ theme(plot.title = element_blank())
charliee <- fviz_nbclust(clustered, kmeans, method = "silhouette")+ theme(plot.title = element_blank())
betaa <- fviz_nbclust(clustered, kmeans, method = "gap_stat")+ theme(plot.title = element_blank())

# Arrange the plots next to each other
suppressMessages({ 
k <- grid.arrange(alphaa, charliee, betaa, ncol = 3)
})

Code

TableGrob (1 x 3) "arrange": 3 grobs
  z     cells    name           grob
1 1 (1-1,1-1) arrange gtable[layout]
2 2 (1-1,2-2) arrange gtable[layout]
3 3 (1-1,3-3) arrange gtable[layout]

Looking at the analysis we can cluster our list of electric cars into 3 main catagories.

Code

fviz_cluster(kmeans(clustered, centers = 3, iter.max = 150, nstart = 150), data = clustered)

We can see from the above graph that the first principle component is in the vertical axis and second principle component is in the horizontal axis. And since the first PC explains more than half of our data, we divide our data based on it into three main categories; the result is three clusters.

Code

Clusters <- kmeans(clustered, centers = 3, iter.max = 150, nstart = 150)

car_results <- car_results %>% mutate(cluster = Clusters$cluster)

Code

plott_V <- car_results %>%
  plot_ly(
    x = ~`Range (km)`, 
    y = ~`Efficiency (Wh/km)`, 
    color = ~factor(cluster), 
    size = ~as.numeric(`Price (£)`), 
    text = ~paste("Model:", Model, "<br>",
                  "Variant:", Variant, "<br>",
                  "Battery (kWh):", `Battery (kWh)`, "<br>",
                  "Efficiency (Wh/km):", `Efficiency (Wh/km)`, "<br>",
                  "Acceleration (0-100/s):", `Acceleration (sec)`, "<br>",
                  "Weight (kg):", `Weight (kg)`, "<br>",
                  "FastCharge Speed (kW):", `FastChargeSpeed (kW)`, "<br>",
                  "Price in £:", `Price (£)`, "<br>",
                  "Price Per Range (€/km):", `PricePerRange (€/km)`, "<br>",
                  "Seats:", Seats
                  )) %>%
  add_markers() %>%
  layout(
    title = "Electric Vehicle Data: Clusters, Price, and Efficiency",
    xaxis = list(title = "Range (km)"),
    yaxis = list(title = "Efficiency (Wh/km)"),
    coloraxis = list(title = "Cluster"),
    sizeaxis = list(title = "Price (£)")
  )

Code

plott_V

Both Range and Efficiency are vital considerations for electric car buyers, and they are relatively independent of each other. While Range focuses on the total distance a car can travel on a single charge, Efficiency provides insight into how efficiently the car utilizes its battery capacity to cover that distance. These two variables offer complementary information for evaluating the practicality, cost-effectiveness, and environmental footprint of electric vehicles.

Efficiency in electric cars is expressed in terms of energy consumption per unit distance traveled. In our case its measured in watt-hours per kilometer. For example an efficiency of 286 Wh/km means that the car consumes 286 watt-hours of energy to travel a distance of 1 km.

Higher efficiency values indicate that the car can travel longer distances using the same amount of battery capacity, which is desirable for maximizing the range and reducing operating costs of electric vehicles. Efficiency is influenced by various factors such as vehicle weight, aerodynamics, tire rolling resistance, driving behavior, and environmental conditions.

The range of an electric car refers to the distance it can travel on a single charge of its battery. It’s a critical specification for electric vehicles as it directly impacts their usability and practicality for everyday driving and long-distance travel. For instance a Professor who lives in Calgary and need to travel to Edmonton (300 km) would need an electric car that surpasses that range. The best car for such a person would be Lucid Air Grand Touring with price 132,000 EUR; but the problem is trade off between Range and efficiency. High efficient cars have lower range and the other way around. So the optimum models are the cluster 1. If A person can afford more luxurious cars he should go for cluster 2, if looking for affordable then cluster 3. Therefore we can see that how clustering make our life easier by helping us make decisions based on constraints we have.

Hence, our findings could be summarized into the following points and as a result answer the questions we asked at the beginning:

a. Electrification and the adoption of electric vehicles can significantly impact CO2 emissions. By shifting from fossil fuel-based energy sources to cleaner alternatives and promoting the use of electric vehicles, there is potential for substantial reductions in emissions.

b. Understanding consumer attitude and behaviors towards electric vehicles is crucial for analyzing their uptake and identifying barriers to adoption. Factors such as vehicle cost, range anxiety, charging infrastructure availability based on charging speed and range, and perceived environmental benefits influence consumer decisions, these factors can inform strategies to promote electric vehicle adoption.

Electric cars may be a crucial part of reducing our emissions to try and slow climate change in the near future. This may require some serious thought about how to lay out the infrastructure so we can easily transition into a world with more electrification. Who knows, research being done into other renewable sources of energy may overtake electrification before we can create a sustainable and viable infrastructure platform to support the demand these vehicles require. Today, we are already noticing rolling blackouts becoming more popular in places like California as consumers are putting too much load on the existing power grid. Car manufacturers like Toyota are already putting all of their focus into researching hydrogen powered vehicles as an alternative to EV’s. Some speculators say that the EV is simply a transition vehicle while we try to find a more sustainable method of power generation, much like the ATRAC was a short lived transition towards CD’s and eventually digital media.