Yaswitha_Kalapala_Stat_631_Final_Project
GLOBAL RENEWABLE ENERGY ANALYSIS
The primary goal of this project is to analyze and predict global renewable energy production using statistical techniques. The dataset, consisting of 240 observations and 7 variables, records renewable energy production by year and country, detailing total production and contributions from solar, wind, hydro, and other sources. Initial data exploration reveals trends over the years, identifies leading countries in production, and shows correlations between different energy types. The next steps involve data cleaning, exploratory data analysis, feature engineering, model building, and evaluation using metrics like RMSE, MAE, and R-squared, ultimately providing insights and recommendations based on the findings.
Load the Required Libraries
Load the Dataset
## Rows: 240 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): Country
## dbl (6): Year, SolarEnergy, WindEnergy, HydroEnergy, OtherRenewableEnergy, T...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Print the first five rows of the data
## # A tibble: 6 × 7
## Year Country SolarEnergy WindEnergy HydroEnergy OtherRenewableEnergy
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 2000 USA 437. 1436. 1544. 319.
## 2 2001 USA 240. 403. 399. 440.
## 3 2002 USA 641. 1120. 335. 486.
## 4 2003 USA 849. 476. 609. 133.
## 5 2004 USA 374. 882. 1034. 181.
## 6 2005 USA 651. 381. 797. 215.
## # ℹ 1 more variable: TotalRenewableEnergy <dbl>
Displaying Column names in the Dataset
## [1] "Year" "Country" "SolarEnergy"
## [4] "WindEnergy" "HydroEnergy" "OtherRenewableEnergy"
## [7] "TotalRenewableEnergy"
Display Data types
## spc_tbl_ [240 × 7] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Year : num [1:240] 2000 2001 2002 2003 2004 ...
## $ Country : chr [1:240] "USA" "USA" "USA" "USA" ...
## $ SolarEnergy : num [1:240] 437 240 641 849 374 ...
## $ WindEnergy : num [1:240] 1436 403 1120 476 882 ...
## $ HydroEnergy : num [1:240] 1544 399 335 609 1034 ...
## $ OtherRenewableEnergy: num [1:240] 319 440 486 133 181 ...
## $ TotalRenewableEnergy: num [1:240] 3737 1482 2583 2067 2471 ...
## - attr(*, "spec")=
## .. cols(
## .. Year = col_double(),
## .. Country = col_character(),
## .. SolarEnergy = col_double(),
## .. WindEnergy = col_double(),
## .. HydroEnergy = col_double(),
## .. OtherRenewableEnergy = col_double(),
## .. TotalRenewableEnergy = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Displaying Data dimensions and summary for the dataset
## [1] 240 7
## Year Country SolarEnergy WindEnergy
## Min. :2000 Length:240 Min. :104.6 Min. : 206.0
## 1st Qu.:2006 Class :character 1st Qu.:284.7 1st Qu.: 523.6
## Median :2012 Mode :character Median :533.4 Median : 882.0
## Mean :2012 Mean :528.5 Mean : 857.1
## 3rd Qu.:2017 3rd Qu.:766.7 3rd Qu.:1160.2
## Max. :2023 Max. :997.0 Max. :1487.1
## HydroEnergy OtherRenewableEnergy TotalRenewableEnergy
## Min. : 320.7 Min. : 54.88 Min. : 910.4
## 1st Qu.: 593.8 1st Qu.:176.32 1st Qu.:2250.8
## Median :1046.4 Median :291.40 Median :2815.5
## Mean :1076.6 Mean :287.13 Mean :2749.4
## 3rd Qu.:1495.2 3rd Qu.:405.48 3rd Qu.:3217.2
## Max. :1983.9 Max. :499.87 Max. :4628.2
Checking for missing values in the dataset and Data Cleaning
## Year Country SolarEnergy
## 0 0 0
## WindEnergy HydroEnergy OtherRenewableEnergy
## 0 0 0
## TotalRenewableEnergy
## 0
Distribution Analysis of Numerical Variables
This code displays histograms for different types of renewable energy production over time using the ggplot2 and gridExtra packages. It uses the energy_data dataset, which includes variables like Year, SolarEnergy, WindEnergy, HydroEnergy, and OtherRenewableEnergy. The code generates separate histograms for each variable to show their distributions, using geom_histogram() with specific colors and titles for clarity. Finally, it arranges these histograms in a 3x2 grid layout with the grid.arrange() function from the gridExtra package, so we can see all the histograms together.
## Histogram plot for Distribution of Year
year_hist <- ggplot(energy_data, aes(x = Year)) +
geom_histogram(binwidth = 2, fill = "#AFEEEE", color = "white") +
ggtitle("Distribution of Year") +
theme_minimal()
## Histogram plot for Distribution of Solar Energy
solar_hist <- ggplot(energy_data, aes(x = SolarEnergy)) +
geom_histogram(binwidth = 2, fill = "#FFFFCC", color = "red") +
ggtitle("Distribution of Solar Energy") +
theme_minimal()
## Histogram plot for Distribution of wind Energy
wind_hist <- ggplot(energy_data, aes(x = WindEnergy)) +
geom_histogram(binwidth = 2, fill = "#99CCFF", color = "blue") +
ggtitle("Distribution of Wind Energy") +
theme_minimal()
## Histogram plot for Distribution of Hydro Energy
hydro_hist <- ggplot(energy_data, aes(x = HydroEnergy)) +
geom_histogram(binwidth = 2, fill = "#99FF99", color = "green") +
ggtitle("Distribution of Hydro Energy") +
theme_minimal()
## Histogram plot for Distribution of OtherRenewable Energy
other_renewable_hist <- ggplot(energy_data, aes(x = OtherRenewableEnergy)) +
geom_histogram(binwidth = 2, fill = "#FFCC99", color = "orange") +
ggtitle("Distribution of Other Renewable Energy") +
theme_minimal()
grid.arrange(year_hist, solar_hist, wind_hist, hydro_hist, other_renewable_hist, nrow = 3, ncol = 2)Defining the list of countries in the Dataset.
Stacked Area Plot of Renewable Energy Sources Over Time
This code transforms the renewable energy production dataset into a long format, making it compatible for visualization with ggplot2. It then generates a stacked area chart to illustrate the production trends of various renewable energy types (Solar, Wind, Hydro, Other) over time. By using geom_area(), the areas are stacked, offering a clear and concise depiction of the relative contributions of each energy source throughout the years.
# Converting the data into long format for ggplot
long_data <- energy_data %>%
pivot_longer(cols = c(SolarEnergy, WindEnergy, HydroEnergy, OtherRenewableEnergy),
names_to = "EnergyType", values_to = "EnergyProduction")
# Plotting stacked area chart
ggplot(long_data, aes(x = Year, y = EnergyProduction, fill = EnergyType)) +
geom_area() +
labs(title = "Stacked Area Plot of Renewable Energy Sources Over Time",
x = "Year", y = "Energy Production (GWh)", fill = "Energy Source") +
theme_minimal()Bivariate Analysis
This code create scatter plots and uses it to visualize different types of renewable energy production over time. It uses the ggplot2 package to make scatter plots for Solar Energy, Wind Energy, Hydro Energy, and Other Renewable Energy against the Year, with each plot having its own color and title. Then, it uses the gridExtra package to arrange these scatter plots into a 2x2 grid, making it easy to compare and analyze the trends in renewable energy production over the years.
# Function to Display scatter plots
create_scatter_plot <- function(data, x_var, y_var, color, title) {
ggplot(data, aes_string(x = x_var, y = y_var)) +
geom_point(color = color, alpha = 0.5) +
labs(title = title, x = x_var, y = y_var) +
theme_minimal()
}
# Displaying scatter plots for each energy type vs Year
solar_year_scatt <- create_scatter_plot(energy_data, "Year", "SolarEnergy", "orange", "Solar Energy vs Year")## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
wind_year_scatt <- create_scatter_plot(energy_data, "Year", "WindEnergy", "blue", "Wind Energy vs Year")
hydro_year_scatt <- create_scatter_plot(energy_data, "Year", "HydroEnergy", "green", "Hydro Energy vs Year")
other_renewable_year_scatt <- create_scatter_plot(energy_data, "Year", "OtherRenewableEnergy", "red", "Other Renewable Energy vs Year")
# Arranging the scatter plots in a grid layout
grid.arrange(solar_year_scatt, wind_year_scatt, hydro_year_scatt, other_renewable_year_scatt, nrow = 2, ncol = 2)Correaltion Matrix
This code calculates and visualizes the correlation matrix for the numerical columns in the energy_data dataset. It uses the dplyr package to select only the numeric columns and then computes their correlations. The corrplot package is used to create a visually appealing, color-coded matrix. This visualization focuses on the upper triangle of the matrix and includes the correlation values. Additionally, it orders the variables using hierarchical clustering to make the relationships between them easier to understand.
# Computing correlation matrix
cor_matrix <- cor(select_if(energy_data, is.numeric))
# Visualization of the correlation matrix
corrplot(cor_matrix, method = "color", type = "upper",
tl.col = "black", tl.srt = 45, addCoef.col = "black",
diag = FALSE, order = "hclust")Multivariate Analysis
This code creates a scatter plot to show the relationship between Wind Energy and Solar Energy production using the ggplot2 package. Wind Energy is on the x-axis and Solar Energy is on the y-axis, with each data point shown as a blue dot. The dots are slightly transparent to make overlapping points easier to see. The plot includes a clear title and labeled axes, and uses a simple, clean theme.
# Display scatter plot for Solar Energy vs. Wind Energy
ggplot(energy_data, aes(x = WindEnergy, y = SolarEnergy)) +
geom_point(alpha = 0.7, color = 'blue') +
labs(title = "Wind Energy vs Solar Energy",
x = "Wind Energy",
y = "Solar Energy") +
theme_minimal()Visualise the trend of total renewable energy production over the years
This code creates a line plot to show total renewable energy production over the years for different countries using the ggplot2 package. The Year is on the x-axis, and Total Renewable Energy production is on the y-axis. Different colors represent different countries. Lines connect each country’s data points, which are also marked with points. The plot has a clear title and labeled axes for easy understanding, and it uses a minimal theme for a clean, simple look.
ggplot(energy_data, aes(x = Year, y = TotalRenewableEnergy, color = Country, group = Country)) +
geom_line() +
geom_point() +
labs(title = "Annual Trends in Total Renewable Energy Production", x = "Year", y = "Total Renewable Energy (GWh)") +
theme_minimal()Displaying the Top countries in renewable energy production
top_countries <- energy_data %>%
group_by(Country) %>%
summarise(TotalRenewableEnergy = sum(TotalRenewableEnergy)) %>%
top_n(5, TotalRenewableEnergy)
# plotting the bar chart
ggplot(top_countries, aes(x = reorder(Country, TotalRenewableEnergy), y = TotalRenewableEnergy, fill = Country)) +
geom_bar(stat = 'identity') +
labs(title = "Top Countries in Renewable Energy Production", x = "Country", y = "Total Renewable Energy (GWh)") +
theme_minimal() +
coord_flip()Time series plot for total renewable energy production
This code calculates the total renewable energy production for each year and then creates a line plot to visualize this data. Using the dplyr package, it groups the data by year and sums up the renewable energy production for each year. The plot function is then used to create a line chart showing the total renewable energy production over the years, with labeled axes for Year and Total Renewable Energy (in GWh) and a title. This method clearly shows the trends in renewable energy production over time.
# total renewable energy production by year
total_renewable_per_year <- energy_data %>%
group_by(Year) %>%
summarise(TotalRenewableEnergy = sum(TotalRenewableEnergy))
# Plotting the total renewable energy production over the years
plot(total_renewable_per_year$Year, total_renewable_per_year$TotalRenewableEnergy, type = "l",
main = "Total Renewable Energy Production Over the Years", xlab = "Year", ylab = "Total Renewable Energy (GWh)")Data Standardization
Data Splitting
Model Selection
# Training control for cross-validation
cv_control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
# Training the base model using all the predictors
base_model <- train(SolarEnergy ~ Year + WindEnergy + HydroEnergy + OtherRenewableEnergy,
data = train_data, method = "lm", trControl = cv_control)
# Training the reduced model using selected predictors
reduced_model <- train(SolarEnergy ~ Year + WindEnergy,
data = train_data, method = "lm", trControl = cv_control)
# Comparing the models using re-sampling
model_results <- resamples(list(base = base_model, reduced = reduced_model))
# Results summary
summary(model_results)##
## Call:
## summary.resamples(object = model_results)
##
## Models: base, reduced
## Number of resamples: 5
##
## MAE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## base 0.8023692 0.8370956 0.8714827 0.8558750 0.8804779 0.8879497 0
## reduced 0.7836376 0.8026387 0.8598168 0.8588634 0.9084546 0.9397693 0
##
## RMSE
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## base 0.9403469 0.9703772 1.030423 1.0020965 1.033388 1.035947 0
## reduced 0.8974503 0.9304814 1.017295 0.9966804 1.022259 1.115915 0
##
## Rsquared
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## base 0.0008274734 0.007486526 0.01164203 0.03225177 0.04693860 0.09436421
## reduced 0.0194664725 0.020213854 0.02532524 0.06896130 0.03534438 0.24445656
## NA's
## base 0
## reduced 0
Fitting Linear Regression Model
# Fit a linear regression model to predict solar energy using all predictors
lm_model_all <- lm(SolarEnergy ~ Year + WindEnergy + HydroEnergy + OtherRenewableEnergy, data = train_data)
# Display summary of the linear regression model
summary(lm_model_all)##
## Call:
## lm(formula = SolarEnergy ~ Year + WindEnergy + HydroEnergy +
## OtherRenewableEnergy, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.61504 -0.88663 -0.05796 0.74688 1.91378
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.03360 0.07690 -0.437 0.6628
## Year -0.17594 0.07739 -2.273 0.0243 *
## WindEnergy -0.06781 0.07510 -0.903 0.3679
## HydroEnergy -0.11547 0.07693 -1.501 0.1353
## OtherRenewableEnergy 0.01835 0.07943 0.231 0.8176
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9938 on 163 degrees of freedom
## Multiple R-squared: 0.04963, Adjusted R-squared: 0.02631
## F-statistic: 2.128 on 4 and 163 DF, p-value: 0.07963
Compare and contrast logistic regression with multinomial logistic regression
library(nnet)
energy_data$EnergyCategory <- cut(energy_data$TotalRenewableEnergy, breaks = 3, labels = c("Low", "Medium", "High"))
model_multinomial <- multinom(EnergyCategory ~ SolarEnergy + WindEnergy + HydroEnergy + OtherRenewableEnergy, data=energy_data)## # weights: 18 (10 variable)
## initial value 263.666949
## iter 10 value 55.844733
## iter 20 value 5.398074
## iter 30 value 2.508198
## iter 40 value 2.302491
## iter 50 value 1.987449
## iter 60 value 1.921587
## iter 70 value 1.833264
## iter 80 value 1.780439
## iter 90 value 1.652863
## iter 100 value 1.607954
## final value 1.607954
## stopped after 100 iterations
## Call:
## multinom(formula = EnergyCategory ~ SolarEnergy + WindEnergy +
## HydroEnergy + OtherRenewableEnergy, data = energy_data)
##
## Coefficients:
## (Intercept) SolarEnergy WindEnergy HydroEnergy OtherRenewableEnergy
## Medium 56.50451 27.58755 35.91137 47.1109 15.68476
## High -20.73800 63.05888 84.40721 107.6599 32.58926
##
## Std. Errors:
## (Intercept) SolarEnergy WindEnergy HydroEnergy OtherRenewableEnergy
## Medium 33.65876 17.05575 21.50945 28.22440 10.42032
## High 53.68716 26.07117 34.11858 43.30532 14.25991
##
## Residual Deviance: 3.215908
## AIC: 23.21591
Check for multicollinearity
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Year WindEnergy HydroEnergy
## 1.015265 1.001553 1.011438
## OtherRenewableEnergy
## 1.019783
Distribution of Residual Analysis
# Distribution of residuals
resid_hist <- ggplot(lm_model_all, aes(x = .resid)) +
geom_histogram(binwidth = 0.1, fill = "gray") +
labs(title = "Histogram of Residuals", x = "Residuals", y = "Count") +
theme_ipsum()
# Variance of residuals
resid_fitted <- ggplot(lm_model_all, aes(x = .fitted, y = .resid)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm", se = FALSE, color = "red") +
labs(title = "Residuals vs. Fitted Values", x = "Fitted Values", y = "Residuals") +
theme_ipsum()
# Arrange the plots
grid.arrange(resid_hist, resid_fitted, nrow = 2, ncol = 1, top = "Regression Assumptions")## `geom_smooth()` using formula = 'y ~ x'