Assignment 1: Prediction of Global Renewable Energy Producers and their Economic Development

Loading Datasets

The two different data sets such as Renewable Energy data is downloaded from https://www.kaggle.com/datasets/anishvijay/global-renewable-energy-and-indicators-dataset. , and Economic Indicators data is accessed from https://www.kaggle.com/datasets/prasad22/global-economy-indicators.

# Load datasets
Economic_Indicators <- read.csv("~/Data622/Global_Economy.csv")
Renewable_Energy <- read.csv("~/Data622/Renewable_energy.csv")
# Check structure of the dataset
#str(Renewable_Energy)
# Summary statistics of the dataset
#summary(Renewable_Energy)

Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) is the use of visualization techniques to explore the data set. The assignment explores the relationship between global economic indicators and renewable energy production using machine learning models based on historical data to predict trends in renewable energy production based on key global economic variables, such as GDP, inflation rates, and carbon emissions.

Renewable Energy Production by Country

# Bar plot for country-wise comparison
ggplot(Renewable_Energy, aes(x = reorder(Country, -Production..GWh.),
                 y = Production..GWh.)) +
  geom_bar(stat = "identity") +
  labs(title = "Global Top Renewable Energy Producers",
       x = "Country", y = "Total Renewable Energy Production(GWh)") +
  theme_minimal() +
  coord_flip()

The bar chart shows that the largest Renewable Energy Producers, among them, France is the highest producer of renewable energy around the world.

Renewable Energy Production by Energy_Type

# Check for missing values
missing_values <- colSums(is.na(Renewable_Energy))
# Remove rows with NA values
Renewable_Energy <- Renewable_Energy %>% drop_na()
# Bar plot of energy production by energy type
ggplot(Renewable_Energy, aes(x = reorder(Energy.Type, -Production..GWh.), 
                             y = Production..GWh., fill = Energy.Type)) +
  geom_bar(stat = "identity") +
  labs(title = "Energy Production by Energy Type", 
       x = "Energy Type", 
       y = "Energy Production (GWh)") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3") +  
  coord_flip()

Correlation Between Variables

# Filter the columns using column indices
filtered_data <- Renewable_Energy[, c(9, 10, 11, 13, 17, 18, 19, 20, 21, 22, 23)]
# View the filtered data
head(filtered_data)

##   Energy.Consumption Energy.Exports Energy.Imports Renewable.Energy.Jobs
## 1           369654.6      93087.198       13413.18                756878
## 2           771781.6       1752.536       78493.98                945074
## 3           342707.2      65146.592       41114.87                588423
## 4           498839.6      50257.591       42010.48                 11049
## 5           819064.4      63101.396       17912.96                225191
## 6           363527.7      81737.600       79121.27                816485
##   Average.Annual.Temperature Annual.Rainfall Solar.Irradiance Wind.Speed
## 1                   21.95603        157.0135         203.0386   5.838375
## 2                   24.63326       1336.0032         111.1331  14.872917
## 3                   13.50193        357.5171         159.9964   3.473210
## 4                    1.09090       2271.3120         297.1636  13.575549
## 5                   30.64102       2285.0160         133.2694  14.964739
## 6                   37.73961       2457.0280         220.4361   6.664297
##   Hydro.Potential Geothermal.Potential Biomass.Availability
## 1        95.59797             2.516637             30.26800
## 2        71.73461            88.703683             36.03695
## 3        84.02229             6.611658             54.59595
## 4        53.17173            39.856384             45.27613
## 5        46.42386            98.648207             93.72980
## 6        17.62276            41.178033             78.75851

# Calculate correlation matrix for numeric variables
correlation_matrix <- cor(filtered_data %>% select_if(is.numeric), use = "complete.obs")
# Plot the correlation matrix with numbers shown
corrplot(correlation_matrix, method = "circle", type = "lower", diag = FALSE,
         number.cex = 0.75,        
         addCoef.col = "blue", 
         tl.cex = 0.75,            
         tl.col = "black",      
         tl.srt = 90)

Energy Production by Country and Energy Type

# Bar plot to compare energy production by country and energy type
ggplot(Renewable_Energy, aes(x = reorder(Country, -Production..GWh.), y = Production..GWh., fill = Energy.Type)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Energy.Type) +
  labs(title = "Energy Production by Country and Energy Type", x = "Country", y = "Energy Production (GWh)") +
  coord_flip() +
  theme_minimal()

1. Are the columns of your data correlated?

The Renewable_Energy data set has 2500 observations and 56 variables. According to the correlation coefficient, the variables like Solar.Irradiance and Hydro.Potential are the highest correlated predictors in the production of renewable energy.

2. Are there labels in your data? Did that impact your choice of algorithm?

When I am trying to predict the renewable energy production, the label is continuous and dependent variable (Energy Production..GWh.) using features such as GDP, Population, Annual.Rainfall, Wind.Speed, Solar.Irradiance, Hydro.Potential, Geothermal.Potential, Biomass.Availability, and Average.Annual.Temperature, therefore Random Forest for regression problem is working on this prediction analysis that might use different statistical methods to rank feature importance based on how much each feature reduces the prediction error.

3. What are the pros and cons of each algorithm you selected?

Random Forest algorithm overcomes the problem of over fitting. We can build regression models for complex numerical data set, not affected by outliers and handle non-liner and large data sets. On the other hand, larger number of trees make the random forest algorithm slower.

Support Vector Machine (SVM) is suitable for high dimension data set. It works well with different types of data and offer unique solution. But, processing time takes more while handling large data with more features. SVM is difficult to understand that is a drawback when interpretability is important for business decision.

4. How your choice of algorithm relates to the datasets (was your choice of algorithm impacted by the datasets you chose)?

As the data set contains a combination of numeric and categorical variables, with complex non-linear relationship between features which Random Forest regression task is a good choice for prediction the numeric target variable of Production..GWh based on the features (predictor variables) in the data set. Because it uses multiple decision trees, and each tree selects a random subset of features at each split. This reduces the risk of over fitting and makes it robust for data sets with many variables.

5. Which result will you trust if you need to make a business decision?

RMSE (Root Mean Squared Error), R-Squared, and Cross-Validation results obtained from the machine learning models those provide a balance between accuracy and generalizability are more trustworthy for making business decision.

6. Do you think an analysis could be prone to errors when using too much data, or when using the least amount possible?

A data set with numerous features and complex structure can cause a model to overfitting and reduce its predictive power. Smaller data sets can lead to underfitting because models may not capture the underlying patterns in the data and may produce inaccurate prediction.

7. How does the analysis between data sets compare?

Renewable Energy data set includes the production, growth and capacity of energy sources such as solar, wind, hydro, geothermal and biomass etc. Economic indicators data set comprises Gross Domestic Product (GDP), employment rate, investment, inflation rates and government spending. In comparison, a country with the strongest economic growth as I mentioned, France is the world most largest producer of renewable energy. An increase in GDP may correlate with higher renewable energy investment, which, in turn, could drive down carbon emissions and fuel economic sustainability. Economies that integrate renewable energy are better positioned to transition to a green economy and improve both environmentally friendly and economic resilience in long run. By utilizing two machine learning models and statistical forecasting methods, challenges include ensuring the models are trained on accurate, representative label and features for capturing the non-linear interactions between variables.

Essay

The world has seen a great deal of progress in terms of the production and consumption of renewable energy. This in turn is rapidly powering economic growth and social development around the globe. Renewable energy refers to reusable energy that can be recovered in nature, such as solar energy, wind power, hydro power, biomass energy, waves, tides and geothermal energy. Modern economic growth mainly depends upon energy consumption. No one can deny the critical role of energy consumption plays in generating GDP growth, particularly via industrial development in such area of production process that requires a great deal of energy, and the progress of a nation economy especially relies on this process, thus energy is considered as the backbone of modern industrial development. The assignment focuses on prediction of renewable energy towards sustainable economic growth by considering the growth of trade openness and manufacturing products and services. The use of renewable energy to attain not only the economic development but also to reduce CO2 emissions has become an increasing trend. So, the investment in renewable energy sources represents a sensible decision as an increase in consumption of energy to fuel economic growth. With the rapid development of global industrialization, it has been recognized that excessive consumption of fossil fuels has an adverse impact on the environment. These influences will result in increasing health risks and threats of global climate change. With characteristics of sustainability and low environmental pollution, the issue of renewable energy has attracted attention, and the dataset downloaded from the www.kaggle.com/datasets. France is the highest producer of renewable energy around the world. The development of renewable-energy systems will be able to cope with development of households’ incomes and employment opportunities that leads the growth of a nation’s GDP. Energy forecasting technology plays a vital role in the development, the management and the policy making of energy systems that various machine-learning models have been employed in renewable-energy predictions. The economic development of any nation is marred by environmental degradation, as the increase in economic activity results in an increase in CO2 emission as industrial and urban activities increases. Once the economies of countries start the transition from agricultural economies to industrial economies, industrial production increases and as a result, CO2 emissions also increase. As a result of industrialization, the average income of households increases rapidly with higher income levels and heavy industrial production tends to be phased out and replaced with high-tech products which tend to use less energy and can be powered by renewable energy sources. This also results in the development of the service sector, whose carbon footprint is much smaller than the industrial sector. This development also results in a decrease in population growth rate and the consumers tend to become more sensitive about the use of cleaner sources of energy by both industrial and service sectors, forcing government to formulate environmentally friendly regulations. Therefore, the prediction of renewable energy is an important way to grow a nation’s economy and the study provides analysis of two machine-learning models such as Random Forest and Support Vector Regression (SVR) in predicting renewable energy production.

Building Random Forest model

The available data is split into training and testing sets to build and evaluate our machine learning models.

set.seed(123) # Set seed for reproducibility
trainIndex <- createDataPartition(Renewable_Energy$Production..GWh., p = 0.8, list = FALSE)
trainData <- Renewable_Energy[trainIndex,]
testData <- Renewable_Energy[-trainIndex,]

# Check for missing values
sum(is.na(Renewable_Energy))

## [1] 0

# Impute missing values using median
Renewable_Energy <- na.omit(Renewable_Energy)

# Build the Random Forest model-regression problem
model_rf <- randomForest(formula = Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature, 
                         data = trainData, 
                         importance = TRUE, 
                         ntree = 500)  
# View the model summary
print(model_rf)

## 
## Call:
##  randomForest(formula = Production..GWh. ~ Energy.Consumption +      Energy.Exports + Energy.Imports + Renewable.Energy.Jobs +      Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential +      Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,      data = trainData, importance = TRUE, ntree = 500) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 811319182
##                     % Var explained: -1.49

# Evaluate feature importance
importance(model_rf)

##                               %IncMSE IncNodePurity
## Energy.Consumption         -0.5423079  136050459272
## Energy.Exports             -1.6032436  128923149582
## Energy.Imports             -1.2974667  127198390764
## Renewable.Energy.Jobs       5.5468875  140265467557
## Annual.Rainfall            -0.3191355  136132647761
## Wind.Speed                  0.2910684  134705116329
## Solar.Irradiance            1.2051620  135875412728
## Hydro.Potential             3.3238776  138807081140
## Geothermal.Potential        0.6769814  133743684973
## Biomass.Availability        0.4512120  137090217659
## Average.Annual.Temperature  4.9725821  152125318439

varImpPlot(model_rf)

# Make Predictions on the test data
rf_pred <- predict(model_rf, newdata = testData)

# Model Evaluation (RMSE and R-squared)
# Calculate RMSE
rmse_rf <- sqrt(mean((rf_pred - testData$Production..GWh.)^2))
print(paste("RMSE:", round(rmse_rf, 2)))

## [1] "RMSE: 28851.92"

# Calculate R-squared
SST <- sum((testData$Production..GWh. - mean(testData$Production..GWh.))^2)  # Total sum of squares
SSE <- sum((rf_pred - testData$Production..GWh.)^2)  # Sum of squared errors
r_squared_rf <- 1 - (SSE / SST)
print(paste("R-squared:", round(r_squared_rf, 4)))

## [1] "R-squared: -0.0191"

# Cross-Validation 
train_control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

# Train the model with cross-validation
rf_cv_model <- train(Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,
                     data = trainData,
                     method = "rf",
                     trControl = train_control,
                     importance = TRUE)

# View RMSE and R-squared from cross-validation
rf_cv_rmse <- rf_cv_model$results$RMSE[which.min(rf_cv_model$results$RMSE)]
rf_cv_r_squared <- rf_cv_model$results$Rsquared[which.min(rf_cv_model$results$RMSE)]
print(paste("Cross-Validated RMSE:", round(rf_cv_rmse, 2)))

## [1] "Cross-Validated RMSE: 28431.47"

print(paste("Cross-Validated R-squared:", round(rf_cv_r_squared, 4)))

## [1] "Cross-Validated R-squared: 0.0037"

# Create a data frame with actual and predicted values
prediction_results <- data.frame(index = 1:length(testData$Production..GWh.), 
                                 actual = testData$Production..GWh., 
                                 predicted = rf_pred)

# Plot the actual vs predicted values using geom_line
ggplot(prediction_results, aes(x = index)) +
  geom_line(aes(y = actual, color = "Actual")) +  
  geom_line(aes(y = predicted, color = "Predicted")) + 
  labs(title = "Actual vs Predicted ",
       x = "Sample Number",
       y = "Production (GWh)") +
  scale_color_manual("", 
                     breaks = c("Actual", "Predicted"),
                     values = c("Actual" = "blue", "Predicted" = "red")) +  
  theme_minimal()

Building Support Vector Machine (SVM) Model

# Build the Support Vector Machine (SVM) model for regression
svm_model <- svm(formula = Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature, 
                 data = trainData, 
                 type = "eps-regression",  
                 kernel = "radial",        
                 cost = 10,                
                 epsilon = 0.1)            

# Make Predictions on the test data
svm_pred <- predict(svm_model, newdata = testData)

# Model Evaluation (RMSE and R-squared)
# Calculate RMSE
rmse_svm <- sqrt(mean((svm_pred - testData$Production..GWh.)^2))
print(paste("RMSE:", round(rmse_svm, 2)))

## [1] "RMSE: 34979.75"

# Calculate R-squared
SST <- sum((testData$Production..GWh. - mean(testData$Production..GWh.))^2)  # Total sum of squares
SSE <- sum((svm_pred - testData$Production..GWh.)^2)  # Sum of squared errors
r_squared_svm <- 1 - (SSE / SST)
print(paste("R-squared:", round(r_squared_svm, 4)))

## [1] "R-squared: -0.498"

# Cross-Validation 
train_control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

# Train the SVM model with cross-validation
svm_cv_model <- train(Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,
                      data = trainData,
                      method = "svmRadial",  # Use radial kernel
                      trControl = train_control,
                      preProcess = c("center", "scale"))  # Centering and scaling the data for SVM

# View RMSE and R-squared from cross-validation
svm_cv_rmse <- svm_cv_model$results$RMSE[which.min(svm_cv_model$results$RMSE)]
svm_cv_r_squared <- svm_cv_model$results$Rsquared[which.min(svm_cv_model$results$RMSE)]
print(paste("Cross-Validated RMSE:", round(svm_cv_rmse, 2)))

## [1] "Cross-Validated RMSE: 28570.9"

print(paste("Cross-Validated R-squared:", round(svm_cv_r_squared, 4)))

## [1] "Cross-Validated R-squared: 0.0082"

We can plot the receiver operating curve with False Positive Rate and True Positve Rate.

library(ROCR)
# Convert continuous actual production values into a binary 
threshold <- median(testData$Production..GWh.)  
binary_actual <- ifelse(testData$Production..GWh. > threshold, 1, 0)  

# Generate the prediction object
pred <- prediction(svm_pred, binary_actual)
# Calculate TPR and FPR for ROC curve
roc <- performance(pred, measure = "tpr", x.measure = "fpr")

# Plot the ROC curve
plot(roc, main = "ROC Curve for SVM Model", col = "blue", lwd = 2)

Analyzing Global Economic Indicator (GDP) in 2021

# Filter the data for the year 2021
data_2021 <- Economic_Indicators %>%
  filter(Year == 2021) %>%
  group_by(Country) # Group by country if you want to perform further operations

# View the filtered data for the year 2021
head(data_2021)

## # A tibble: 6 × 26
## # Groups:   Country [6]
##   CountryID Country     Year AMA.exchange.rate IMF.based.exchange.r…¹ Population
##       <int> <chr>      <int>             <dbl>                  <dbl>      <int>
## 1         4 " Afghani…  2021            82.5                   82.5     40099462
## 2         8 " Albania…  2021           104.                   104.       2854710
## 3        12 " Algeria…  2021           135.                   135.      44177969
## 4        20 " Andorra…  2021             0.845                  0.845      79034
## 5        24 " Angola "  2021           631.                   631.      34503774
## 6        28 " Antigua…  2021             2.7                    2.7        93219
## # ℹ abbreviated name: ¹IMF.based.exchange.rate
## # ℹ 20 more variables: Currency <chr>, Per.capita.GNI <int>,
## #   X.Agriculture..hunting..forestry..fishing..ISIC.A.B.. <dbl>,
## #   Changes.in.inventories <dbl>, Construction..ISIC.F. <dbl>,
## #   Exports.of.goods.and.services <dbl>, Final.consumption.expenditure <dbl>,
## #   General.government.final.consumption.expenditure <dbl>,
## #   Gross.capital.formation <dbl>, …

data_2021_sorted <- data_2021 %>%
  arrange(desc(Gross.Domestic.Product..GDP.))
# Filter the columns using column indices
Sorted_Data <- data_2021_sorted[, c(2, 3, 26)]
head(Sorted_Data)

## # A tibble: 6 × 3
## # Groups:   Country [6]
##   Country             Year Gross.Domestic.Product..GDP.
##   <chr>              <int>                        <dbl>
## 1 " United States "   2021                      2.33e13
## 2 " China "           2021                      1.77e13
## 3 " Japan "           2021                      4.94e12
## 4 " Germany "         2021                      4.26e12
## 5 " India "           2021                      3.20e12
## 6 " United Kingdom "  2021                      3.13e12

World Map Showing Global Renewable Energy Production

# Load the world map data 
world_map <- map_data("world")

# Merge the world map data with renewable energy production data
map_data <- left_join(world_map, Renewable_Energy, by = c("region" = "Country"))

# Plot the world map 
ggplot(map_data, aes(x = long, y = lat, group = group, fill = Production..GWh.)) +
  geom_polygon(color = "black") +
  scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "gray90", name = "Production (GWh)") +
  theme_minimal() +
  labs(title = "Renewable Energy Production by Country (GWh)", 
       x = "", y = "") +
  theme(legend.position = "right")

Conclusions

The prediction of renewable energy production and economic growth significantly depends on the economic indicators such as: Gross Domestic Product (GDP), Investment in renewable energy infrastructure, Employment in the renewable energy sector and Government policies. The largest producers of renewable energy—such as China, the U.S., Canada, Australia, Brazil, Russia, India, Japan, France, and Germany—have achieved positive effects, such as job creation, reduced energy costs, and long-term sustainability through a combination of natural resources, strong governmental policies, economic development, and technological advancements. The strong link between renewable energy and economic growth is symbiotic, where economic growth supports renewable energy investments, and enhancing overall economic stability.