Yaswitha_Kalapala_Stat_631_Final_Project

GLOBAL RENEWABLE ENERGY ANALYSIS

The primary goal of this project is to analyze and predict global renewable energy production using statistical techniques. The dataset, consisting of 240 observations and 7 variables, records renewable energy production by year and country, detailing total production and contributions from solar, wind, hydro, and other sources. Initial data exploration reveals trends over the years, identifies leading countries in production, and shows correlations between different energy types. The next steps involve data cleaning, exploratory data analysis, feature engineering, model building, and evaluation using metrics like RMSE, MAE, and R-squared, ultimately providing insights and recommendations based on the findings.

Load the Required Libraries

suppressPackageStartupMessages({
  library(ggplot2)
  library(dplyr)
  library(gridExtra)
  library(corrplot)
  library(lubridate)
  library(caTools)
  library(RColorBrewer)
  library(caret)
  library(readr)
  library(scales)
  library(viridis)
  library(GGally)
  library(hrbrthemes)
  library(tidyr)
})

Load the Dataset

energy_data <- read_csv('global_renewable_energy_production.csv')

## Rows: 240 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (6): Year, SolarEnergy, WindEnergy, HydroEnergy, OtherRenewableEnergy, T...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Print the first five rows of the data

head(energy_data)

## # A tibble: 6 × 7
##    Year Country SolarEnergy WindEnergy HydroEnergy OtherRenewableEnergy
##   <dbl> <chr>         <dbl>      <dbl>       <dbl>                <dbl>
## 1  2000 USA            437.      1436.       1544.                 319.
## 2  2001 USA            240.       403.        399.                 440.
## 3  2002 USA            641.      1120.        335.                 486.
## 4  2003 USA            849.       476.        609.                 133.
## 5  2004 USA            374.       882.       1034.                 181.
## 6  2005 USA            651.       381.        797.                 215.
## # ℹ 1 more variable: TotalRenewableEnergy <dbl>

Displaying Column names in the Dataset

colnames(energy_data)

## [1] "Year"                 "Country"              "SolarEnergy"         
## [4] "WindEnergy"           "HydroEnergy"          "OtherRenewableEnergy"
## [7] "TotalRenewableEnergy"

Display Data types

str(energy_data)

## spc_tbl_ [240 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Year                : num [1:240] 2000 2001 2002 2003 2004 ...
##  $ Country             : chr [1:240] "USA" "USA" "USA" "USA" ...
##  $ SolarEnergy         : num [1:240] 437 240 641 849 374 ...
##  $ WindEnergy          : num [1:240] 1436 403 1120 476 882 ...
##  $ HydroEnergy         : num [1:240] 1544 399 335 609 1034 ...
##  $ OtherRenewableEnergy: num [1:240] 319 440 486 133 181 ...
##  $ TotalRenewableEnergy: num [1:240] 3737 1482 2583 2067 2471 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Year = col_double(),
##   ..   Country = col_character(),
##   ..   SolarEnergy = col_double(),
##   ..   WindEnergy = col_double(),
##   ..   HydroEnergy = col_double(),
##   ..   OtherRenewableEnergy = col_double(),
##   ..   TotalRenewableEnergy = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

Displaying Data dimensions and summary for the dataset

# Display data dimensions
dim(energy_data)

## [1] 240   7

# Display Summary statistics
summary(energy_data)

##       Year        Country           SolarEnergy      WindEnergy    
##  Min.   :2000   Length:240         Min.   :104.6   Min.   : 206.0  
##  1st Qu.:2006   Class :character   1st Qu.:284.7   1st Qu.: 523.6  
##  Median :2012   Mode  :character   Median :533.4   Median : 882.0  
##  Mean   :2012                      Mean   :528.5   Mean   : 857.1  
##  3rd Qu.:2017                      3rd Qu.:766.7   3rd Qu.:1160.2  
##  Max.   :2023                      Max.   :997.0   Max.   :1487.1  
##   HydroEnergy     OtherRenewableEnergy TotalRenewableEnergy
##  Min.   : 320.7   Min.   : 54.88       Min.   : 910.4      
##  1st Qu.: 593.8   1st Qu.:176.32       1st Qu.:2250.8      
##  Median :1046.4   Median :291.40       Median :2815.5      
##  Mean   :1076.6   Mean   :287.13       Mean   :2749.4      
##  3rd Qu.:1495.2   3rd Qu.:405.48       3rd Qu.:3217.2      
##  Max.   :1983.9   Max.   :499.87       Max.   :4628.2

Checking for missing values in the dataset and Data Cleaning

# Checking missing values
colSums(is.na(energy_data))

##                 Year              Country          SolarEnergy 
##                    0                    0                    0 
##           WindEnergy          HydroEnergy OtherRenewableEnergy 
##                    0                    0                    0 
## TotalRenewableEnergy 
##                    0

# Remove rows with NA values
energy_data <- na.omit(energy_data)

Distribution Analysis of Numerical Variables

This code displays histograms for different types of renewable energy production over time using the ggplot2 and gridExtra packages. It uses the energy_data dataset, which includes variables like Year, SolarEnergy, WindEnergy, HydroEnergy, and OtherRenewableEnergy. The code generates separate histograms for each variable to show their distributions, using geom_histogram() with specific colors and titles for clarity. Finally, it arranges these histograms in a 3x2 grid layout with the grid.arrange() function from the gridExtra package, so we can see all the histograms together.

## Histogram plot for Distribution of Year
year_hist <- ggplot(energy_data, aes(x = Year)) + 
  geom_histogram(binwidth = 2, fill = "#AFEEEE", color = "white") +
  ggtitle("Distribution of Year") +
  theme_minimal()

## Histogram plot for Distribution of Solar Energy
solar_hist <- ggplot(energy_data, aes(x = SolarEnergy)) + 
  geom_histogram(binwidth = 2, fill = "#FFFFCC", color = "red") +
  ggtitle("Distribution of Solar Energy") +
  theme_minimal()

## Histogram plot for Distribution of wind Energy
wind_hist <- ggplot(energy_data, aes(x = WindEnergy)) + 
  geom_histogram(binwidth = 2, fill = "#99CCFF", color = "blue") +
  ggtitle("Distribution of Wind Energy") +
  theme_minimal()

## Histogram plot for Distribution of Hydro Energy
hydro_hist <- ggplot(energy_data, aes(x = HydroEnergy)) + 
  geom_histogram(binwidth = 2, fill = "#99FF99", color = "green") +
  ggtitle("Distribution of Hydro Energy") +
  theme_minimal()

## Histogram plot for Distribution of OtherRenewable Energy
other_renewable_hist <- ggplot(energy_data, aes(x = OtherRenewableEnergy)) + 
  geom_histogram(binwidth = 2, fill = "#FFCC99", color = "orange") +
  ggtitle("Distribution of Other Renewable Energy") +
  theme_minimal()

grid.arrange(year_hist, solar_hist, wind_hist, hydro_hist, other_renewable_hist, nrow = 3, ncol = 2)

Defining the list of countries in the Dataset.

countries <- c('USA', 'China', 'India', 'Germany', 'UK', 'France', 'Brazil', 'Canada', 'Australia', 'Japan')

Stacked Area Plot of Renewable Energy Sources Over Time

This code transforms the renewable energy production dataset into a long format, making it compatible for visualization with ggplot2. It then generates a stacked area chart to illustrate the production trends of various renewable energy types (Solar, Wind, Hydro, Other) over time. By using geom_area(), the areas are stacked, offering a clear and concise depiction of the relative contributions of each energy source throughout the years.

# Converting the data into long format for ggplot
long_data <- energy_data %>%
  pivot_longer(cols = c(SolarEnergy, WindEnergy, HydroEnergy, OtherRenewableEnergy),
               names_to = "EnergyType", values_to = "EnergyProduction")

# Plotting stacked area chart
ggplot(long_data, aes(x = Year, y = EnergyProduction, fill = EnergyType)) +
  geom_area() +
  labs(title = "Stacked Area Plot of Renewable Energy Sources Over Time",
       x = "Year", y = "Energy Production (GWh)", fill = "Energy Source") +
  theme_minimal()

Bivariate Analysis

This code create scatter plots and uses it to visualize different types of renewable energy production over time. It uses the ggplot2 package to make scatter plots for Solar Energy, Wind Energy, Hydro Energy, and Other Renewable Energy against the Year, with each plot having its own color and title. Then, it uses the gridExtra package to arrange these scatter plots into a 2x2 grid, making it easy to compare and analyze the trends in renewable energy production over the years.

# Function to Display scatter plots
create_scatter_plot <- function(data, x_var, y_var, color, title) {
  ggplot(data, aes_string(x = x_var, y = y_var)) +
    geom_point(color = color, alpha = 0.5) +
    labs(title = title, x = x_var, y = y_var) +
    theme_minimal()
}

# Displaying scatter plots for each energy type vs Year
solar_year_scatt <- create_scatter_plot(energy_data, "Year", "SolarEnergy", "orange", "Solar Energy vs Year")

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

wind_year_scatt  <- create_scatter_plot(energy_data, "Year", "WindEnergy", "blue", "Wind Energy vs Year")
hydro_year_scatt <- create_scatter_plot(energy_data, "Year", "HydroEnergy", "green", "Hydro Energy vs Year")
other_renewable_year_scatt <- create_scatter_plot(energy_data, "Year", "OtherRenewableEnergy", "red", "Other Renewable Energy vs Year")

# Arranging the scatter plots in a grid layout
grid.arrange(solar_year_scatt, wind_year_scatt, hydro_year_scatt, other_renewable_year_scatt, nrow = 2, ncol = 2)

Correaltion Matrix

This code calculates and visualizes the correlation matrix for the numerical columns in the energy_data dataset. It uses the dplyr package to select only the numeric columns and then computes their correlations. The corrplot package is used to create a visually appealing, color-coded matrix. This visualization focuses on the upper triangle of the matrix and includes the correlation values. Additionally, it orders the variables using hierarchical clustering to make the relationships between them easier to understand.

# Computing correlation matrix
cor_matrix <- cor(select_if(energy_data, is.numeric))

# Visualization of the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper", 
         tl.col = "black", tl.srt = 45, addCoef.col = "black", 
         diag = FALSE, order = "hclust")

Multivariate Analysis

This code creates a scatter plot to show the relationship between Wind Energy and Solar Energy production using the ggplot2 package. Wind Energy is on the x-axis and Solar Energy is on the y-axis, with each data point shown as a blue dot. The dots are slightly transparent to make overlapping points easier to see. The plot includes a clear title and labeled axes, and uses a simple, clean theme.

# Display scatter plot for Solar Energy vs. Wind Energy
ggplot(energy_data, aes(x = WindEnergy, y = SolarEnergy)) +
  geom_point(alpha = 0.7, color = 'blue') +
  labs(title = "Wind Energy vs Solar Energy",
       x = "Wind Energy",
       y = "Solar Energy") +
  theme_minimal()

Visualise the trend of total renewable energy production over the years

This code creates a line plot to show total renewable energy production over the years for different countries using the ggplot2 package. The Year is on the x-axis, and Total Renewable Energy production is on the y-axis. Different colors represent different countries. Lines connect each country’s data points, which are also marked with points. The plot has a clear title and labeled axes for easy understanding, and it uses a minimal theme for a clean, simple look.

ggplot(energy_data, aes(x = Year, y = TotalRenewableEnergy, color = Country, group = Country)) +
  geom_line() +
  geom_point() +
  labs(title = "Annual Trends in Total Renewable Energy Production", x = "Year", y = "Total Renewable Energy (GWh)") +
  theme_minimal()

Displaying the Top countries in renewable energy production

top_countries <- energy_data %>% 
  group_by(Country) %>% 
  summarise(TotalRenewableEnergy = sum(TotalRenewableEnergy)) %>% 
  top_n(5, TotalRenewableEnergy)

# plotting the bar chart
ggplot(top_countries, aes(x = reorder(Country, TotalRenewableEnergy), y = TotalRenewableEnergy, fill = Country)) +
  geom_bar(stat = 'identity') +
  labs(title = "Top Countries in Renewable Energy Production", x = "Country", y = "Total Renewable Energy (GWh)") +
  theme_minimal() +
  coord_flip()

Time series plot for total renewable energy production

This code calculates the total renewable energy production for each year and then creates a line plot to visualize this data. Using the dplyr package, it groups the data by year and sums up the renewable energy production for each year. The plot function is then used to create a line chart showing the total renewable energy production over the years, with labeled axes for Year and Total Renewable Energy (in GWh) and a title. This method clearly shows the trends in renewable energy production over time.

# total renewable energy production by year
total_renewable_per_year <- energy_data %>%
  group_by(Year) %>%
  summarise(TotalRenewableEnergy = sum(TotalRenewableEnergy))

# Plotting the total renewable energy production over the years
plot(total_renewable_per_year$Year, total_renewable_per_year$TotalRenewableEnergy, type = "l",
     main = "Total Renewable Energy Production Over the Years", xlab = "Year", ylab = "Total Renewable Energy (GWh)")

Data Standardization

# Scaling numerical columns
numerical_columns <- c("Year", "SolarEnergy", "WindEnergy", "HydroEnergy", "OtherRenewableEnergy")
energy_data[numerical_columns] <- scale(energy_data[numerical_columns])

Data Splitting

# Splitting the data into 70% training sets and 30% testing sets
set.seed(123) 
split <- sample.split(energy_data$SolarEnergy, SplitRatio = 0.7)
train_data <- subset(energy_data, split == TRUE)
test_data <- subset(energy_data, split == FALSE)

Model Selection

# Training control for cross-validation
cv_control <- trainControl(method = "cv", number = 5)  # 5-fold cross-validation

# Training the base model using all the predictors
base_model <- train(SolarEnergy ~ Year + WindEnergy + HydroEnergy + OtherRenewableEnergy, 
                    data = train_data, method = "lm", trControl = cv_control)

# Training the reduced model using selected predictors
reduced_model <- train(SolarEnergy ~ Year + WindEnergy, 
                       data = train_data, method = "lm", trControl = cv_control)

# Comparing the models using re-sampling
model_results <- resamples(list(base = base_model, reduced = reduced_model))

# Results summary
summary(model_results)

## 
## Call:
## summary.resamples(object = model_results)
## 
## Models: base, reduced 
## Number of resamples: 5 
## 
## MAE 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## base    0.8023692 0.8370956 0.8714827 0.8558750 0.8804779 0.8879497    0
## reduced 0.7836376 0.8026387 0.8598168 0.8588634 0.9084546 0.9397693    0
## 
## RMSE 
##              Min.   1st Qu.   Median      Mean  3rd Qu.     Max. NA's
## base    0.9403469 0.9703772 1.030423 1.0020965 1.033388 1.035947    0
## reduced 0.8974503 0.9304814 1.017295 0.9966804 1.022259 1.115915    0
## 
## Rsquared 
##                 Min.     1st Qu.     Median       Mean    3rd Qu.       Max.
## base    0.0008274734 0.007486526 0.01164203 0.03225177 0.04693860 0.09436421
## reduced 0.0194664725 0.020213854 0.02532524 0.06896130 0.03534438 0.24445656
##         NA's
## base       0
## reduced    0

# Model Results by Dot Plots
dotplot(model_results)

Fitting Linear Regression Model

# Fit a linear regression model to predict solar energy using all predictors
lm_model_all <- lm(SolarEnergy ~ Year + WindEnergy + HydroEnergy + OtherRenewableEnergy, data = train_data)

# Display summary of the linear regression model
summary(lm_model_all)

## 
## Call:
## lm(formula = SolarEnergy ~ Year + WindEnergy + HydroEnergy + 
##     OtherRenewableEnergy, data = train_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.61504 -0.88663 -0.05796  0.74688  1.91378 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)          -0.03360    0.07690  -0.437   0.6628  
## Year                 -0.17594    0.07739  -2.273   0.0243 *
## WindEnergy           -0.06781    0.07510  -0.903   0.3679  
## HydroEnergy          -0.11547    0.07693  -1.501   0.1353  
## OtherRenewableEnergy  0.01835    0.07943   0.231   0.8176  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9938 on 163 degrees of freedom
## Multiple R-squared:  0.04963,    Adjusted R-squared:  0.02631 
## F-statistic: 2.128 on 4 and 163 DF,  p-value: 0.07963

Compare and contrast logistic regression with multinomial logistic regression

library(nnet)
energy_data$EnergyCategory <- cut(energy_data$TotalRenewableEnergy, breaks = 3, labels = c("Low", "Medium", "High"))
model_multinomial <- multinom(EnergyCategory ~ SolarEnergy + WindEnergy + HydroEnergy + OtherRenewableEnergy, data=energy_data)

## # weights:  18 (10 variable)
## initial  value 263.666949 
## iter  10 value 55.844733
## iter  20 value 5.398074
## iter  30 value 2.508198
## iter  40 value 2.302491
## iter  50 value 1.987449
## iter  60 value 1.921587
## iter  70 value 1.833264
## iter  80 value 1.780439
## iter  90 value 1.652863
## iter 100 value 1.607954
## final  value 1.607954 
## stopped after 100 iterations

summary(model_multinomial)

## Call:
## multinom(formula = EnergyCategory ~ SolarEnergy + WindEnergy + 
##     HydroEnergy + OtherRenewableEnergy, data = energy_data)
## 
## Coefficients:
##        (Intercept) SolarEnergy WindEnergy HydroEnergy OtherRenewableEnergy
## Medium    56.50451    27.58755   35.91137     47.1109             15.68476
## High     -20.73800    63.05888   84.40721    107.6599             32.58926
## 
## Std. Errors:
##        (Intercept) SolarEnergy WindEnergy HydroEnergy OtherRenewableEnergy
## Medium    33.65876    17.05575   21.50945    28.22440             10.42032
## High      53.68716    26.07117   34.11858    43.30532             14.25991
## 
## Residual Deviance: 3.215908 
## AIC: 23.21591

Check for multicollinearity

# Load the car package
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

# Check for multicollinearity
vif_values <- vif(lm_model_all)
print(vif_values)

##                 Year           WindEnergy          HydroEnergy 
##             1.015265             1.001553             1.011438 
## OtherRenewableEnergy 
##             1.019783

Distribution of Residual Analysis

# Distribution of residuals
resid_hist <- ggplot(lm_model_all, aes(x = .resid)) + 
  geom_histogram(binwidth = 0.1, fill = "gray") +
  labs(title = "Histogram of Residuals", x = "Residuals", y = "Count") +
  theme_ipsum()

# Variance of residuals
resid_fitted <- ggplot(lm_model_all, aes(x = .fitted, y = .resid)) + 
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals") +
  theme_ipsum()

# Arrange the plots
grid.arrange(resid_hist, resid_fitted, nrow = 2, ncol = 1, top = "Regression Assumptions")

## `geom_smooth()` using formula = 'y ~ x'

Predictions on Test data

# Predict solar energy on the test set
predictions <- predict(lm_model_all, newdata = test_data)

RMSE Calculation

# Calculate RMSE
rmse <- sqrt(mean((test_data$SolarEnergy - predictions)^2))
print(paste("Root Mean Squared Error (RMSE):", round(rmse, 2)))

## [1] "Root Mean Squared Error (RMSE): 1.05"

MAPE Calculation

# Calculate MAPE
mape <- mean(abs((test_data$SolarEnergy - predictions) / test_data$SolarEnergy)) * 100
print(paste("Mean Absolute Percentage Error (MAPE):", round(mape, 2)))

## [1] "Mean Absolute Percentage Error (MAPE): 110.53"