library(tidyverse)
library(janitor)

Question

Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Data description

The 2022 Environmental Performance Index (EPI) ranks 180 countries on 40 performance indicators covering climate change, environmental public health, and ecosystem vitality, the EPI constitutes the world’s leading analysis of country-level sustainability trends.

Country rankings are grounded in the best available data from international organizations and research centers around the world that have been carefully analyzed by Yale and Columbia researchers.

Data was gathered from: Environmental Performance Index Site

# load data
df <- read_csv("/Users/justinwilliams/Documents/CUNY SPS/605/discussions/data/epi2022results05302022.csv") %>% 
  clean_names()
## Rows: 180 Columns: 279
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr   (3): iso, country, region
## dbl (276): code, EPI.new, HLT.new, AIR.new, HAD.new, PMD.new, OZD.new, NOE.n...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Target variable - epi_new

Dichotomous variable - region = Global West as boolean

Interaction variable - n2o_nox = n2o * nox

Quadratic variable - air2 = air^2

Drop columns

I want to drop any columns besides those with new suffix.

# drop old
df <- df %>% 
  select(-matches("_change|_old|_rnk"))

Ok, now 59 columns remain.

Rename columns

I want to get rid of the suffix _new from the names of the remaining variables.

# rename
df <- df %>% 
    rename_at(vars(ends_with("_new")), ~ str_remove(., "_new"))

Missing Values

Probably will have to drop columns with missing values.

# create missing value list
(na_cols <- df %>%
  select_if(~ any(is.na(.))) %>% 
  names())
##  [1] "ocp" "mpa" "shi" "spi" "ecs" "tcl" "grl" "wtl" "fsh" "fss" "rms" "ftd"
## [13] "spu" "wrs" "wwt" "fga" "lcb"

Ok, 17 columns have missing values, so I will drop them.

# drop cols with missing
df <- df %>% 
  select(-all_of(na_cols))

Ok, now 42 columns left.

Create Quadratic

df <- df %>% 
  mutate(air2 = sqrt(air))

Create Boolean

# create bool
df <- df %>% 
  mutate(global_west = if_else(region == "Global West",1,0))

Create interaction variable

As \(N_2O\) has the longest average lifetime in the atmosphere, I will multiply \(N_2O\) growth rate from the climate change mitigation issue category by \(NO_x\) exposure from air quality. \(NO_x\) is the common term for the more reactive nitrogen oxides, and includes \(N_2O\) underneath its umbrella.

df <- df %>% 
  mutate(n2o_nox = nda * noe)

Ok, now we are ready to begin modeling.

Create Model

# create model
lm1 <- lm(epi ~ global_west + n2o_nox + hlt + air + had + pmd + ozd +
     noe + soe + coe + voe + h2o + usd + uwd + hmt + pbd + wmg +
     msw + rec + eco + bdh + tbn + tbg + par + bhv + acd + sda +
     nxa + agr + snm + pcc + cch + cda + cha + nda + bca + ghn +
     gib + ghp, data = df)

summary(lm1)
## 
## Call:
## lm(formula = epi ~ global_west + n2o_nox + hlt + air + had + 
##     pmd + ozd + noe + soe + coe + voe + h2o + usd + uwd + hmt + 
##     pbd + wmg + msw + rec + eco + bdh + tbn + tbg + par + bhv + 
##     acd + sda + nxa + agr + snm + pcc + cch + cda + cha + nda + 
##     bca + ghn + gib + ghp, data = df)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.063407 -0.019354  0.000721  0.021210  0.056145 
## 
## Coefficients: (2 not defined because of singularities)
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.722e-03  2.799e-02   0.133 0.894410    
## global_west  6.401e-03  1.411e-02   0.454 0.650825    
## n2o_nox      7.443e-07  4.410e-06   0.169 0.866218    
## hlt         -6.406e-02  7.686e-02  -0.833 0.405991    
## air          1.588e-01  7.935e-02   2.001 0.047251 *  
## had         -5.020e-03  3.184e-02  -0.158 0.874967    
## pmd         -6.535e-03  3.940e-02  -0.166 0.868507    
## ozd         -3.660e-04  3.759e-03  -0.097 0.922570    
## noe         -5.982e-04  3.806e-03  -0.157 0.875339    
## soe         -3.814e-04  1.518e-03  -0.251 0.801966    
## coe         -4.859e-04  1.558e-03  -0.312 0.755572    
## voe          6.776e-05  1.517e-03   0.045 0.964436    
## h2o          3.049e-02  7.523e-02   0.405 0.685880    
## usd          1.440e-02  3.001e-02   0.480 0.632079    
## uwd          2.079e-02  4.508e-02   0.461 0.645338    
## hmt          2.635e-02  7.693e-03   3.425 0.000803 ***
## pbd                 NA         NA      NA       NA    
## wmg          2.722e-02  7.613e-03   3.575 0.000479 ***
## msw         -3.575e-04  5.580e-04  -0.641 0.522825    
## rec         -6.196e-06  3.259e-04  -0.019 0.984860    
## eco          4.187e-01  5.877e-04 712.328  < 2e-16 ***
## bdh          1.021e-03  5.143e-04   1.985 0.049083 *  
## tbn          2.183e-04  2.493e-04   0.876 0.382556    
## tbg         -4.838e-04  2.455e-04  -1.971 0.050678 .  
## par          1.523e-05  1.698e-04   0.090 0.928674    
## bhv          3.470e-05  2.634e-04   0.132 0.895403    
## acd          1.979e-02  7.542e-02   0.262 0.793343    
## sda         -9.957e-03  3.770e-02  -0.264 0.792086    
## nxa         -9.711e-03  3.772e-02  -0.257 0.797206    
## agr         -2.254e-05  2.756e-04  -0.082 0.934935    
## snm         -1.452e-04  2.654e-04  -0.547 0.585162    
## pcc          3.832e-01  1.884e-03 203.415  < 2e-16 ***
## cch                 NA         NA      NA       NA    
## cda         -1.473e-03  7.023e-04  -2.098 0.037711 *  
## cha         -1.510e-04  2.196e-04  -0.688 0.492653    
## nda         -1.496e-04  2.105e-04  -0.710 0.478610    
## bca          5.211e-05  1.458e-04   0.358 0.721246    
## ghn         -1.365e-03  7.185e-04  -1.900 0.059438 .  
## gib         -8.933e-05  1.584e-04  -0.564 0.573581    
## ghp         -5.667e-05  1.746e-04  -0.325 0.745895    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03018 on 142 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 8.035e+05 on 37 and 142 DF,  p-value: < 2.2e-16

Well we have a 1 for \(R2\), which makes sense since epi target was derived from these scores, lets remove everything except the significant ones and see what happens.

# model significant
lm2 <- lm(epi ~ air + hmt + wmg + eco + bdh + tbg + pcc + ghn, data = df)

summary(lm2)
## 
## Call:
## lm(formula = epi ~ air + hmt + wmg + eco + bdh + tbg + pcc + 
##     ghn, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17460 -0.36264 -0.02662  0.32088  1.49993 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.333960   0.182299  -1.832 0.068701 .  
## air          0.135429   0.003696  36.642  < 2e-16 ***
## hmt          0.026144   0.003340   7.827 4.98e-13 ***
## wmg          0.044159   0.003415  12.932  < 2e-16 ***
## eco          0.439839   0.007648  57.511  < 2e-16 ***
## bdh         -0.023882   0.006019  -3.968 0.000107 ***
## tbg          0.009444   0.003176   2.973 0.003370 ** 
## pcc          0.403857   0.005173  78.075  < 2e-16 ***
## ghn         -0.018413   0.002690  -6.845 1.31e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.539 on 171 degrees of freedom
## Multiple R-squared:  0.9982, Adjusted R-squared:  0.9981 
## F-statistic: 1.163e+04 on 8 and 171 DF,  p-value: < 2.2e-16

Ok, 99% ofepi is explained by these 8 variables.

Let’s take a look at the residuals.

plot(lm2$fitted.values, lm2$residuals, xlab="Fitted Values", ylab="Residuals", main="Residuals vs. Fitted")
abline(h=0)

An the Q-Q plot

qqnorm(lm2$residuals)
qqline(lm2$residuals)

Histogram of residuals

hist(lm2$residuals)

The residuals distribution looks pretty normals too me, a few outliers are observed in the diagnostic plots.