RMT HomeworkR

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

#calling library WDI

library(WDI)

#retrieving data from the world bank with factor "NY.GDP.PCAP.KD" = GDP per capita (2000-2023)
my_data <- WDI(country = "all", indicator = "NY.GDP.PCAP.KD", start = 2000, end = 2023, extra = TRUE, cache = NULL)

#Calling library tidyverse
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#calling library dplyr to start filtering the data
library(dplyr)

#Install and load WDI package (making sure)
#install.packages("WDI")
library(WDI)

#Retrieving World Bank data for specified indicators
my_data <- WDI(country = "all", 
                  indicator = c("NY.GDP.PCAP.KD", "EG.FEC.RNEW.ZS", "EG.USE.PCAP.KG.OE"), 
                  start = 2000, end = 2023, 
                  extra = TRUE)

#View the first few rows of the data
head(my_data)

##       country iso2c iso3c year status lastupdated NY.GDP.PCAP.KD EG.FEC.RNEW.ZS
## 1 Afghanistan    AF   AFG 2020         2024-06-28       529.1449           18.2
## 2 Afghanistan    AF   AFG 2021         2024-06-28       407.6165           20.0
## 3 Afghanistan    AF   AFG 2012         2024-06-28       570.6761           15.4
## 4 Afghanistan    AF   AFG 2022         2024-06-28       372.6159           20.0
## 5 Afghanistan    AF   AFG 2015         2024-06-28       566.8811           17.7
## 6 Afghanistan    AF   AFG 2004         2024-06-28       338.7394           44.2
##   EG.USE.PCAP.KG.OE     region capital longitude latitude     income lending
## 1                NA South Asia   Kabul   69.1761  34.5228 Low income     IDA
## 2                NA South Asia   Kabul   69.1761  34.5228 Low income     IDA
## 3                NA South Asia   Kabul   69.1761  34.5228 Low income     IDA
## 4                NA South Asia   Kabul   69.1761  34.5228 Low income     IDA
## 5                NA South Asia   Kabul   69.1761  34.5228 Low income     IDA
## 6                NA South Asia   Kabul   69.1761  34.5228 Low income     IDA

#filtering data that "capital column is either blank or missing
library(dplyr)

my_data <- my_data %>%
  filter(capital != "")

#renaming columns
library(dplyr) 

my_data <- my_data %>%
  rename(
    `GDP per Capita` = NY.GDP.PCAP.KD,
    `Renewable Energy Use` = EG.FEC.RNEW.ZS,
    `Energy Use per Capita` = EG.USE.PCAP.KG.OE
  )

#filtering out rows or columns with (NA) values or ("") blank 
my_data <- my_data %>%
  filter(!is.na(`GDP per Capita`) & `GDP per Capita` != "" &
         !is.na(`Renewable Energy Use`) & `Renewable Energy Use` != "")

#After examining the data I am continuing the work with the data in between 2017-2021
library(dplyr)

my_data <- my_data %>%
  filter(year >= 2017 & year <= 2021)

#summerizing data and grouping the data by "country", "region" and "income". Summary statistics for each group 
library(dplyr)

my_data1 <- my_data %>%
  group_by(country, region, income) %>%
  summarise(
    `Mean GDP per Capita` = mean(`GDP per Capita`, na.rm = TRUE),
    `Min GDP per Capita` = min(`GDP per Capita`, na.rm = TRUE),
    `Max GDP per Capita` = max(`GDP per Capita`, na.rm = TRUE),
    `Median GDP per Capita` = median(`GDP per Capita`, na.rm = TRUE),
    `Mean Renewable Energy Use` = mean(`Renewable Energy Use`, na.rm = TRUE),
    `Min Renewable Energy Use` = min(`Renewable Energy Use`, na.rm = TRUE),
    `Max Renewable Energy Use` = max(`Renewable Energy Use`, na.rm = TRUE),
    `Median Renewable Energy Use` = median(`Renewable Energy Use`, na.rm = TRUE)
  )

## `summarise()` has grouped output by 'country', 'region'. You can override using
## the `.groups` argument.

#Creating a new category based on "Mean Renewable Energy Use".
my_data1 <- my_data1 %>%
  mutate(
    Consumption_Category = case_when(
      `Mean Renewable Energy Use` < 10 ~ "Low",
      `Mean Renewable Energy Use` >= 10 & `Mean Renewable Energy Use` < 30 ~ "Medium",
      `Mean Renewable Energy Use` >= 30 & `Mean Renewable Energy Use` < 50 ~ "High",
      TRUE ~ "Very High"
    )
  )

# Calculating the frequesncy distribution of "Consumption_Category", counting how many observations fall into each category of "Consumtion_Category"
library(dplyr)

category_distribution <- my_data1 %>%
  group_by(Consumption_Category) %>%
  summarise(Count = n())

print(category_distribution)

## # A tibble: 4 × 2
##   Consumption_Category Count
##   <chr>                <int>
## 1 High                    29
## 2 Low                     61
## 3 Medium                  62
## 4 Very High               46

#continuing with parametric test of “my_data1”

#Checking the normality by group
library(ggplot2)
ggplot(my_data1, aes(sample = `Mean Renewable Energy Use`, color = income)) +
  geom_qq() +
  geom_qq_line() +
  facet_wrap(~ income) +
  labs(title = "Q-Q Plots for Mean Renewable Energy Use by Income Level")

#Checking the homogeneity of variances
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

leveneTest(`Mean Renewable Energy Use` ~ income, data = my_data1)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value  Pr(>F)  
## group   3   3.809 0.01103 *
##       194                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

#The results from Levene’s Test indicate that the assumption of homogeneity of variances is violated (p-value = 0.01103, which is less than the typical alpha level of 0.05). This result suggests that the variances of Mean Renewable Energy Use across the different income categories are not equal. Given this violation of assumption of homogenity for ANOVA, we continue with non-parametric Kruskal-wallis test.

# Kruskal-Wallis Test
kruskal_test <- kruskal.test(`Mean Renewable Energy Use` ~ income, data = my_data1)
print(kruskal_test)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Mean Renewable Energy Use by income
## Kruskal-Wallis chi-squared = 66.809, df = 3, p-value = 2.057e-14

#Interpretation of Kruskal-Wallis rank sum test. The p-value is much less than 0.05, suggesting strong evidence against the null hypothesis. This means we can reject the null hypothesis that there is no difference in median renewable energy consumption across the different income levels.Since the test is significant, it implies that the economic classification by income of countries has a statistically significant association with the consumption of renewable energy.

#Since the Kruskal-Wallis test indicates differences but does not tell us which specific groups differ from each other, we will conduct post-hoc tests. Using pairwise comparisons with adjustments for multiple testing.

#Installing the 'FSA' package
if (!require(FSA)) install.packages("FSA", dependencies = TRUE)

## Loading required package: FSA

## Registered S3 methods overwritten by 'FSA':
##   method       from
##   confint.boot car 
##   hist.boot    car

## ## FSA v0.9.5. See citation('FSA') if used in publication.
## ## Run fishR() for related website and fishR('IFAR') for related book.

## 
## Attaching package: 'FSA'

## The following object is masked from 'package:car':
## 
##     bootCase

library(FSA)

# Dunn's test for multiple comparisons
dunn_test <- dunnTest(`Mean Renewable Energy Use` ~ income, data = my_data1, method="bonferroni")

## Warning: income was coerced to a factor.

print(dunn_test)

## Dunn (1964) Kruskal-Wallis multiple comparison

##   p-values adjusted with the Bonferroni method.

##                                  Comparison         Z      P.unadj        P.adj
## 1                  High income - Low income -7.235717 4.630760e-13 2.778456e-12
## 2         High income - Lower middle income -5.139629 2.752811e-07 1.651687e-06
## 3          Low income - Lower middle income  3.063458 2.187948e-03 1.312769e-02
## 4         High income - Upper middle income -1.232164 2.178877e-01 1.000000e+00
## 5          Low income - Upper middle income  6.044713 1.496763e-09 8.980578e-09
## 6 Lower middle income - Upper middle income  3.694095 2.206717e-04 1.324030e-03

#Based on the statistical analyses performed, including the Kruskal-Wallis test and subsequent Dunn’s post-hoc tests, we can conclude that there is a statistically significant association between the economic classification by income of countries and the consumption of renewable energy. The results show significant differences in renewable energy use among various income levels, particularly between high-income and low-income countries, as well as between upper-middle-income and lower-income groups. These findings allow us to reject the null hypothesis (H0) that there is no correlation between economic classification by income and renewable energy consumption. Instead, we accept the alternative hypothesis (H1) that there is a correlation, indicating that economic factors significantly influence renewable energy consumption patterns. This conclusion underscores the need for refined energy policies that consider economic disparities to effectively promote renewable energy adoption across different income levels.

#continuing with regression I will again specify my research question: How do income level and region impact renewable energy consumption across countries? I will be continuing with Multiple Linear Regression.

# Set CRAN mirror before installing packages
options(repos = c(CRAN = "https://cran.rstudio.com/"))

# Install the necessary packages
install.packages(c("ggplot2", "car", "MASS"))

## Warning: packages 'ggplot2', 'car' are in use and will not be installed

## Installing package into 'C:/Users/Hp Victus/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## Warning: package 'MASS' is not available for this version of R
## 
## A version of this package for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

#Installing necessary packages
install.packages(c("ggplot2", "car", "statsmodels", "MASS"))

## Warning: packages 'ggplot2', 'car' are in use and will not be installed

## Installing packages into 'C:/Users/Hp Victus/AppData/Local/R/win-library/4.3'
## (as 'lib' is unspecified)

## Warning: packages 'statsmodels', 'MASS' are not available for this version of R
## 
## Versions of these packages for your version of R might be available elsewhere,
## see the ideas at
## https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages

#Loading the required libraries
library(ggplot2)
library(car)
library(MASS)

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

#Descriptive statistics for the relevant variables
summary(my_data1[, c("Mean Renewable Energy Use", "income", "region", "Mean GDP per Capita")])

##  Mean Renewable Energy Use    income             region         
##  Min.   : 0.000            Length:198         Length:198        
##  1st Qu.: 6.525            Class :character   Class :character  
##  Median :21.640            Mode  :character   Mode  :character  
##  Mean   :29.575                                                 
##  3rd Qu.:47.350                                                 
##  Max.   :96.380                                                 
##  Mean GDP per Capita
##  Min.   :   270     
##  1st Qu.:  2173     
##  Median :  6268     
##  Mean   : 15351     
##  3rd Qu.: 18933     
##  Max.   :107182

#Checking for missing values
colSums(is.na(my_data1[, c("Mean Renewable Energy Use", "income", "region", "Mean GDP per Capita")]))

## Mean Renewable Energy Use                    income                    region 
##                         0                         0                         0 
##       Mean GDP per Capita 
##                         0

#categorical variables set as factors
my_data1$income <- as.factor(my_data1$income)
my_data1$region <- as.factor(my_data1$region)

#Fit the linear regression model
model <- lm(`Mean Renewable Energy Use` ~ income + region + `Mean GDP per Capita`, data = my_data1)

#Print the summary of the model
summary(model)

## 
## Call:
## lm(formula = `Mean Renewable Energy Use` ~ income + region + 
##     `Mean GDP per Capita`, data = my_data1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -41.112 -10.637  -2.337  11.297  61.108 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       9.846e+00  5.183e+00   1.899  0.05905 .  
## incomeLow income                  3.951e+01  6.558e+00   6.024 8.86e-09 ***
## incomeLower middle income         2.047e+01  5.041e+00   4.060 7.20e-05 ***
## incomeUpper middle income         1.072e+00  4.352e+00   0.246  0.80562    
## regionEurope & Central Asia       1.005e+01  4.279e+00   2.348  0.01993 *  
## regionLatin America & Caribbean   7.289e+00  4.417e+00   1.650  0.10060    
## regionMiddle East & North Africa -1.741e+01  5.244e+00  -3.321  0.00108 ** 
## regionNorth America               4.861e-01  1.168e+01   0.042  0.96685    
## regionSouth Asia                  1.073e+01  7.303e+00   1.470  0.14332    
## regionSub-Saharan Africa          2.606e+01  4.858e+00   5.364 2.39e-07 ***
## `Mean GDP per Capita`             1.508e-05  1.016e-04   0.148  0.88218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 17.99 on 187 degrees of freedom
## Multiple R-squared:  0.6052, Adjusted R-squared:  0.5841 
## F-statistic: 28.67 on 10 and 187 DF,  p-value: < 2.2e-16

#Checking liniarity
# Plot residuals vs fitted values
plot(model$fitted.values, resid(model), 
     xlab = "Fitted Values", 
     ylab = "Residuals", 
     main = "Residuals vs Fitted Values")
abline(h = 0, col = "red")

# Q-Q plot to check normality of residuals
qqnorm(resid(model))
qqline(resid(model), col = "red")

#Levene's Test for Homogeneity of Variance
leveneTest(resid(model) ~ income, data = my_data1)

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   3  1.6036 0.1899
##       194

#The p-value from Levene’s Test is 0.1899, which is greater than 0.05. This suggests that there is no evidence to reject the null hypothesis of equal variances. Therefore, we can assume that the variances of Mean Renewable Energy Use across the income groups are homogeneous, meeting one of the key assumptions for linear regression.

# Calculate VIF for multicollinearity check
vif(model)

##                           GVIF Df GVIF^(1/(2*Df))
## income                4.350680  3        1.277692
## region                2.711144  6        1.086666
## `Mean GDP per Capita` 2.695896  1        1.641918

#ncome GVIF = 4.350680:1.277 ; Region GVIF = 2.711144:1.086 ; Mean GDP per Capita VIF = 2.695896: 1.641 ; With these results we conclude that all values are below concering levels as for “GVIF” the threshold was square root of 5 = 2.236. This indicates no multicoliniarity issues.

# Diagnostic plots for the linear model to visualize the findings
par(mfrow = c(2, 2))  # Set plot layout for 4 plots
plot(model)

#interpreting descriptive statistics: #Interpretation of the Multiple Linear Regression Results: R-squared = 0.605: This means that approximately 60.5% of the variance in Mean Renewable Energy Use is explained by the model. #Income Levels: Low income (p < 0.001): The estimate (39.51) indicates that low-income countries have significantly higher renewable energy use compared to the reference group (high-income countries). Lower middle income (p < 0.001): These countries have, on average, 20.47 units more renewable energy use compared to high-income countries. Upper middle income is not explained because it is not significant. #Regions: Europe & Central Asia (p = 0.0199): Countries in this region tend to use 10.05 more units of renewable energy on average compared to the reference region. Middle East & North Africa (p = 0.0011): These countries have significantly lower renewable energy use (by 17.41 units) compared to the reference region. Sub-Saharan Africa (p < 0.001): This region has significantly higher renewable energy use, with an estimate of 26.06 units more than the reference region. (reference region being East Asia & Pacific)

#Mean GDP per Capita (p = 0.882): The coefficient is very small and not statistically significant, indicating that GDP per capita does not have a meaningful linear effect on renewable energy use in this model.

#Final conclusion and answer to the research question: The regression analysis shows a clear correlation between income levels, regions, and renewable energy consumption. Specifically, lower-income countries and certain regions (Sub-Saharan Africa, Europe & Central Asia) tend to consume more renewable energy compared to wealthier nations and other regions. These findings suggest that the adoption of renewable energy is influenced by a combination of economic classification and geographic factors, rather than GDP per capita alone.

RMT HomeworkR

Burim Fusha

2024-09-04

R Markdown

Including Plots