The two different data sets such as Renewable Energy data is downloaded from https://www.kaggle.com/datasets/anishvijay/global-renewable-energy-and-indicators-dataset. , and Economic Indicators data is accessed from https://www.kaggle.com/datasets/prasad22/global-economy-indicators.
# Load datasets
Economic_Indicators <- read.csv("~/Data622/Global_Economy.csv")
Renewable_Energy <- read.csv("~/Data622/Renewable_energy.csv")
# Check structure of the dataset
#str(Renewable_Energy)
# Summary statistics of the dataset
#summary(Renewable_Energy)
Exploratory Data Analysis (EDA) is the use of visualization techniques to explore the data set. The assignment explores the relationship between global economic indicators and renewable energy production using machine learning models based on historical data to predict trends in renewable energy production based on key global economic variables, such as GDP, inflation rates, and carbon emissions.
# Bar plot for country-wise comparison
ggplot(Renewable_Energy, aes(x = reorder(Country, -Production..GWh.),
y = Production..GWh.)) +
geom_bar(stat = "identity") +
labs(title = "Global Top Renewable Energy Producers",
x = "Country", y = "Total Renewable Energy Production(GWh)") +
theme_minimal() +
coord_flip()
The bar chart shows that the largest Renewable Energy Producers, among them, France is the highest producer of renewable energy around the world.
# Check for missing values
missing_values <- colSums(is.na(Renewable_Energy))
# Remove rows with NA values
Renewable_Energy <- Renewable_Energy %>% drop_na()
# Bar plot of energy production by energy type
ggplot(Renewable_Energy, aes(x = reorder(Energy.Type, -Production..GWh.),
y = Production..GWh., fill = Energy.Type)) +
geom_bar(stat = "identity") +
labs(title = "Energy Production by Energy Type",
x = "Energy Type",
y = "Energy Production (GWh)") +
theme_minimal() +
scale_fill_brewer(palette = "Set3") +
coord_flip()
# Filter the columns using column indices
filtered_data <- Renewable_Energy[, c(9, 10, 11, 13, 17, 18, 19, 20, 21, 22, 23)]
# View the filtered data
head(filtered_data)
## Energy.Consumption Energy.Exports Energy.Imports Renewable.Energy.Jobs
## 1 369654.6 93087.198 13413.18 756878
## 2 771781.6 1752.536 78493.98 945074
## 3 342707.2 65146.592 41114.87 588423
## 4 498839.6 50257.591 42010.48 11049
## 5 819064.4 63101.396 17912.96 225191
## 6 363527.7 81737.600 79121.27 816485
## Average.Annual.Temperature Annual.Rainfall Solar.Irradiance Wind.Speed
## 1 21.95603 157.0135 203.0386 5.838375
## 2 24.63326 1336.0032 111.1331 14.872917
## 3 13.50193 357.5171 159.9964 3.473210
## 4 1.09090 2271.3120 297.1636 13.575549
## 5 30.64102 2285.0160 133.2694 14.964739
## 6 37.73961 2457.0280 220.4361 6.664297
## Hydro.Potential Geothermal.Potential Biomass.Availability
## 1 95.59797 2.516637 30.26800
## 2 71.73461 88.703683 36.03695
## 3 84.02229 6.611658 54.59595
## 4 53.17173 39.856384 45.27613
## 5 46.42386 98.648207 93.72980
## 6 17.62276 41.178033 78.75851
# Calculate correlation matrix for numeric variables
correlation_matrix <- cor(filtered_data %>% select_if(is.numeric), use = "complete.obs")
# Plot the correlation matrix with numbers shown
corrplot(correlation_matrix, method = "circle", type = "lower", diag = FALSE,
number.cex = 0.75,
addCoef.col = "blue",
tl.cex = 0.75,
tl.col = "black",
tl.srt = 90)
# Bar plot to compare energy production by country and energy type
ggplot(Renewable_Energy, aes(x = reorder(Country, -Production..GWh.), y = Production..GWh., fill = Energy.Type)) +
geom_bar(stat = "identity") +
facet_wrap(~Energy.Type) +
labs(title = "Energy Production by Country and Energy Type", x = "Country", y = "Energy Production (GWh)") +
coord_flip() +
theme_minimal()
When I am trying to predict the renewable energy production, the label is continuous and dependent variable (Energy Production..GWh.) using features such as GDP, Population, Annual.Rainfall, Wind.Speed, Solar.Irradiance, Hydro.Potential, Geothermal.Potential, Biomass.Availability, and Average.Annual.Temperature, therefore Random Forest for regression problem is working on this prediction analysis that might use different statistical methods to rank feature importance based on how much each feature reduces the prediction error.
Random Forest algorithm overcomes the problem of over fitting. We can build regression models for complex numerical data set, not affected by outliers and handle non-liner and large data sets. On the other hand, larger number of trees make the random forest algorithm slower.
Support Vector Machine (SVM) is suitable for high dimension data set. It works well with different types of data and offer unique solution. But, processing time takes more while handling large data with more features. SVM is difficult to understand that is a drawback when interpretability is important for business decision.
As the data set contains a combination of numeric and categorical variables, with complex non-linear relationship between features which Random Forest regression task is a good choice for prediction the numeric target variable of Production..GWh based on the features (predictor variables) in the data set. Because it uses multiple decision trees, and each tree selects a random subset of features at each split. This reduces the risk of over fitting and makes it robust for data sets with many variables.
RMSE (Root Mean Squared Error), R-Squared, and Cross-Validation results obtained from the machine learning models those provide a balance between accuracy and generalizability are more trustworthy for making business decision.
A data set with numerous features and complex structure can cause a model to overfitting and reduce its predictive power. Smaller data sets can lead to underfitting because models may not capture the underlying patterns in the data and may produce inaccurate prediction.
Renewable Energy data set includes the production, growth and capacity of energy sources such as solar, wind, hydro, geothermal and biomass etc. Economic indicators data set comprises Gross Domestic Product (GDP), employment rate, investment, inflation rates and government spending. In comparison, a country with the strongest economic growth as I mentioned, France is the world most largest producer of renewable energy. An increase in GDP may correlate with higher renewable energy investment, which, in turn, could drive down carbon emissions and fuel economic sustainability. Economies that integrate renewable energy are better positioned to transition to a green economy and improve both environmentally friendly and economic resilience in long run. By utilizing two machine learning models and statistical forecasting methods, challenges include ensuring the models are trained on accurate, representative label and features for capturing the non-linear interactions between variables.
The world has seen a great deal of progress in terms of the production and consumption of renewable energy. This in turn is rapidly powering economic growth and social development around the globe. Renewable energy refers to reusable energy that can be recovered in nature, such as solar energy, wind power, hydro power, biomass energy, waves, tides and geothermal energy. Modern economic growth mainly depends upon energy consumption. No one can deny the critical role of energy consumption plays in generating GDP growth, particularly via industrial development in such area of production process that requires a great deal of energy, and the progress of a nation economy especially relies on this process, thus energy is considered as the backbone of modern industrial development. The assignment focuses on prediction of renewable energy towards sustainable economic growth by considering the growth of trade openness and manufacturing products and services. The use of renewable energy to attain not only the economic development but also to reduce CO2 emissions has become an increasing trend. So, the investment in renewable energy sources represents a sensible decision as an increase in consumption of energy to fuel economic growth. With the rapid development of global industrialization, it has been recognized that excessive consumption of fossil fuels has an adverse impact on the environment. These influences will result in increasing health risks and threats of global climate change. With characteristics of sustainability and low environmental pollution, the issue of renewable energy has attracted attention, and the dataset downloaded from the www.kaggle.com/datasets. France is the highest producer of renewable energy around the world. The development of renewable-energy systems will be able to cope with development of households’ incomes and employment opportunities that leads the growth of a nation’s GDP. Energy forecasting technology plays a vital role in the development, the management and the policy making of energy systems that various machine-learning models have been employed in renewable-energy predictions. The economic development of any nation is marred by environmental degradation, as the increase in economic activity results in an increase in CO2 emission as industrial and urban activities increases. Once the economies of countries start the transition from agricultural economies to industrial economies, industrial production increases and as a result, CO2 emissions also increase. As a result of industrialization, the average income of households increases rapidly with higher income levels and heavy industrial production tends to be phased out and replaced with high-tech products which tend to use less energy and can be powered by renewable energy sources. This also results in the development of the service sector, whose carbon footprint is much smaller than the industrial sector. This development also results in a decrease in population growth rate and the consumers tend to become more sensitive about the use of cleaner sources of energy by both industrial and service sectors, forcing government to formulate environmentally friendly regulations. Therefore, the prediction of renewable energy is an important way to grow a nation’s economy and the study provides analysis of two machine-learning models such as Random Forest and Support Vector Regression (SVR) in predicting renewable energy production.
The available data is split into training and testing sets to build and evaluate our machine learning models.
set.seed(123) # Set seed for reproducibility
trainIndex <- createDataPartition(Renewable_Energy$Production..GWh., p = 0.8, list = FALSE)
trainData <- Renewable_Energy[trainIndex,]
testData <- Renewable_Energy[-trainIndex,]
# Check for missing values
sum(is.na(Renewable_Energy))
## [1] 0
# Impute missing values using median
Renewable_Energy <- na.omit(Renewable_Energy)
# Build the Random Forest model-regression problem
model_rf <- randomForest(formula = Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,
data = trainData,
importance = TRUE,
ntree = 500)
# View the model summary
print(model_rf)
##
## Call:
## randomForest(formula = Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature, data = trainData, importance = TRUE, ntree = 500)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 811319182
## % Var explained: -1.49
# Evaluate feature importance
importance(model_rf)
## %IncMSE IncNodePurity
## Energy.Consumption -0.5423079 136050459272
## Energy.Exports -1.6032436 128923149582
## Energy.Imports -1.2974667 127198390764
## Renewable.Energy.Jobs 5.5468875 140265467557
## Annual.Rainfall -0.3191355 136132647761
## Wind.Speed 0.2910684 134705116329
## Solar.Irradiance 1.2051620 135875412728
## Hydro.Potential 3.3238776 138807081140
## Geothermal.Potential 0.6769814 133743684973
## Biomass.Availability 0.4512120 137090217659
## Average.Annual.Temperature 4.9725821 152125318439
varImpPlot(model_rf)
# Make Predictions on the test data
rf_pred <- predict(model_rf, newdata = testData)
# Model Evaluation (RMSE and R-squared)
# Calculate RMSE
rmse_rf <- sqrt(mean((rf_pred - testData$Production..GWh.)^2))
print(paste("RMSE:", round(rmse_rf, 2)))
## [1] "RMSE: 28851.92"
# Calculate R-squared
SST <- sum((testData$Production..GWh. - mean(testData$Production..GWh.))^2) # Total sum of squares
SSE <- sum((rf_pred - testData$Production..GWh.)^2) # Sum of squared errors
r_squared_rf <- 1 - (SSE / SST)
print(paste("R-squared:", round(r_squared_rf, 4)))
## [1] "R-squared: -0.0191"
# Cross-Validation
train_control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
# Train the model with cross-validation
rf_cv_model <- train(Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,
data = trainData,
method = "rf",
trControl = train_control,
importance = TRUE)
# View RMSE and R-squared from cross-validation
rf_cv_rmse <- rf_cv_model$results$RMSE[which.min(rf_cv_model$results$RMSE)]
rf_cv_r_squared <- rf_cv_model$results$Rsquared[which.min(rf_cv_model$results$RMSE)]
print(paste("Cross-Validated RMSE:", round(rf_cv_rmse, 2)))
## [1] "Cross-Validated RMSE: 28431.47"
print(paste("Cross-Validated R-squared:", round(rf_cv_r_squared, 4)))
## [1] "Cross-Validated R-squared: 0.0037"
# Create a data frame with actual and predicted values
prediction_results <- data.frame(index = 1:length(testData$Production..GWh.),
actual = testData$Production..GWh.,
predicted = rf_pred)
# Plot the actual vs predicted values using geom_line
ggplot(prediction_results, aes(x = index)) +
geom_line(aes(y = actual, color = "Actual")) +
geom_line(aes(y = predicted, color = "Predicted")) +
labs(title = "Actual vs Predicted ",
x = "Sample Number",
y = "Production (GWh)") +
scale_color_manual("",
breaks = c("Actual", "Predicted"),
values = c("Actual" = "blue", "Predicted" = "red")) +
theme_minimal()
# Build the Support Vector Machine (SVM) model for regression
svm_model <- svm(formula = Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,
data = trainData,
type = "eps-regression",
kernel = "radial",
cost = 10,
epsilon = 0.1)
# Make Predictions on the test data
svm_pred <- predict(svm_model, newdata = testData)
# Model Evaluation (RMSE and R-squared)
# Calculate RMSE
rmse_svm <- sqrt(mean((svm_pred - testData$Production..GWh.)^2))
print(paste("RMSE:", round(rmse_svm, 2)))
## [1] "RMSE: 34979.75"
# Calculate R-squared
SST <- sum((testData$Production..GWh. - mean(testData$Production..GWh.))^2) # Total sum of squares
SSE <- sum((svm_pred - testData$Production..GWh.)^2) # Sum of squared errors
r_squared_svm <- 1 - (SSE / SST)
print(paste("R-squared:", round(r_squared_svm, 4)))
## [1] "R-squared: -0.498"
# Cross-Validation
train_control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
# Train the SVM model with cross-validation
svm_cv_model <- train(Production..GWh. ~ Energy.Consumption + Energy.Exports + Energy.Imports + Renewable.Energy.Jobs + Annual.Rainfall + Wind.Speed + Solar.Irradiance + Hydro.Potential + Geothermal.Potential + Biomass.Availability + Average.Annual.Temperature,
data = trainData,
method = "svmRadial", # Use radial kernel
trControl = train_control,
preProcess = c("center", "scale")) # Centering and scaling the data for SVM
# View RMSE and R-squared from cross-validation
svm_cv_rmse <- svm_cv_model$results$RMSE[which.min(svm_cv_model$results$RMSE)]
svm_cv_r_squared <- svm_cv_model$results$Rsquared[which.min(svm_cv_model$results$RMSE)]
print(paste("Cross-Validated RMSE:", round(svm_cv_rmse, 2)))
## [1] "Cross-Validated RMSE: 28570.9"
print(paste("Cross-Validated R-squared:", round(svm_cv_r_squared, 4)))
## [1] "Cross-Validated R-squared: 0.0082"
We can plot the receiver operating curve with False Positive Rate and True Positve Rate.
library(ROCR)
# Convert continuous actual production values into a binary
threshold <- median(testData$Production..GWh.)
binary_actual <- ifelse(testData$Production..GWh. > threshold, 1, 0)
# Generate the prediction object
pred <- prediction(svm_pred, binary_actual)
# Calculate TPR and FPR for ROC curve
roc <- performance(pred, measure = "tpr", x.measure = "fpr")
# Plot the ROC curve
plot(roc, main = "ROC Curve for SVM Model", col = "blue", lwd = 2)
# Filter the data for the year 2021
data_2021 <- Economic_Indicators %>%
filter(Year == 2021) %>%
group_by(Country) # Group by country if you want to perform further operations
# View the filtered data for the year 2021
head(data_2021)
## # A tibble: 6 × 26
## # Groups: Country [6]
## CountryID Country Year AMA.exchange.rate IMF.based.exchange.r…¹ Population
## <int> <chr> <int> <dbl> <dbl> <int>
## 1 4 " Afghani… 2021 82.5 82.5 40099462
## 2 8 " Albania… 2021 104. 104. 2854710
## 3 12 " Algeria… 2021 135. 135. 44177969
## 4 20 " Andorra… 2021 0.845 0.845 79034
## 5 24 " Angola " 2021 631. 631. 34503774
## 6 28 " Antigua… 2021 2.7 2.7 93219
## # ℹ abbreviated name: ¹IMF.based.exchange.rate
## # ℹ 20 more variables: Currency <chr>, Per.capita.GNI <int>,
## # X.Agriculture..hunting..forestry..fishing..ISIC.A.B.. <dbl>,
## # Changes.in.inventories <dbl>, Construction..ISIC.F. <dbl>,
## # Exports.of.goods.and.services <dbl>, Final.consumption.expenditure <dbl>,
## # General.government.final.consumption.expenditure <dbl>,
## # Gross.capital.formation <dbl>, …
data_2021_sorted <- data_2021 %>%
arrange(desc(Gross.Domestic.Product..GDP.))
# Filter the columns using column indices
Sorted_Data <- data_2021_sorted[, c(2, 3, 26)]
head(Sorted_Data)
## # A tibble: 6 × 3
## # Groups: Country [6]
## Country Year Gross.Domestic.Product..GDP.
## <chr> <int> <dbl>
## 1 " United States " 2021 2.33e13
## 2 " China " 2021 1.77e13
## 3 " Japan " 2021 4.94e12
## 4 " Germany " 2021 4.26e12
## 5 " India " 2021 3.20e12
## 6 " United Kingdom " 2021 3.13e12
# Load the world map data
world_map <- map_data("world")
# Merge the world map data with renewable energy production data
map_data <- left_join(world_map, Renewable_Energy, by = c("region" = "Country"))
# Plot the world map
ggplot(map_data, aes(x = long, y = lat, group = group, fill = Production..GWh.)) +
geom_polygon(color = "black") +
scale_fill_gradient(low = "lightblue", high = "darkblue", na.value = "gray90", name = "Production (GWh)") +
theme_minimal() +
labs(title = "Renewable Energy Production by Country (GWh)",
x = "", y = "") +
theme(legend.position = "right")
The prediction of renewable energy production and economic growth significantly depends on the economic indicators such as: Gross Domestic Product (GDP), Investment in renewable energy infrastructure, Employment in the renewable energy sector and Government policies. The largest producers of renewable energy—such as China, the U.S., Canada, Australia, Brazil, Russia, India, Japan, France, and Germany—have achieved positive effects, such as job creation, reduced energy costs, and long-term sustainability through a combination of natural resources, strong governmental policies, economic development, and technological advancements. The strong link between renewable energy and economic growth is symbiotic, where economic growth supports renewable energy investments, and enhancing overall economic stability.