library(tidyverse)
library(janitor)
Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
The 2022 Environmental Performance Index (EPI) ranks 180 countries on 40 performance indicators covering climate change, environmental public health, and ecosystem vitality, the EPI constitutes the world’s leading analysis of country-level sustainability trends.
Country rankings are grounded in the best available data from international organizations and research centers around the world that have been carefully analyzed by Yale and Columbia researchers.
Data was gathered from: Environmental Performance Index Site
# load data
df <- read_csv("/Users/justinwilliams/Documents/CUNY SPS/605/discussions/data/epi2022results05302022.csv") %>%
clean_names()
## Rows: 180 Columns: 279
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): iso, country, region
## dbl (276): code, EPI.new, HLT.new, AIR.new, HAD.new, PMD.new, OZD.new, NOE.n...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Target variable - epi_new
Dichotomous variable - region
= Global
West as boolean
Interaction variable - n2o_nox
= n2o *
nox
Quadratic variable - air2
= air^2
I want to drop any columns besides those with new
suffix.
# drop old
df <- df %>%
select(-matches("_change|_old|_rnk"))
Ok, now 59
columns remain.
I want to get rid of the suffix _new
from the names of
the remaining variables.
# rename
df <- df %>%
rename_at(vars(ends_with("_new")), ~ str_remove(., "_new"))
Probably will have to drop columns with missing values.
# create missing value list
(na_cols <- df %>%
select_if(~ any(is.na(.))) %>%
names())
## [1] "ocp" "mpa" "shi" "spi" "ecs" "tcl" "grl" "wtl" "fsh" "fss" "rms" "ftd"
## [13] "spu" "wrs" "wwt" "fga" "lcb"
Ok, 17
columns have missing values, so I will drop
them.
# drop cols with missing
df <- df %>%
select(-all_of(na_cols))
Ok, now 42
columns left.
df <- df %>%
mutate(air2 = sqrt(air))
# create bool
df <- df %>%
mutate(global_west = if_else(region == "Global West",1,0))
As \(N_2O\) has the longest average lifetime in the atmosphere, I will multiply \(N_2O\) growth rate from the climate change mitigation issue category by \(NO_x\) exposure from air quality. \(NO_x\) is the common term for the more reactive nitrogen oxides, and includes \(N_2O\) underneath its umbrella.
df <- df %>%
mutate(n2o_nox = nda * noe)
Ok, now we are ready to begin modeling.
# create model
lm1 <- lm(epi ~ global_west + n2o_nox + hlt + air + had + pmd + ozd +
noe + soe + coe + voe + h2o + usd + uwd + hmt + pbd + wmg +
msw + rec + eco + bdh + tbn + tbg + par + bhv + acd + sda +
nxa + agr + snm + pcc + cch + cda + cha + nda + bca + ghn +
gib + ghp, data = df)
summary(lm1)
##
## Call:
## lm(formula = epi ~ global_west + n2o_nox + hlt + air + had +
## pmd + ozd + noe + soe + coe + voe + h2o + usd + uwd + hmt +
## pbd + wmg + msw + rec + eco + bdh + tbn + tbg + par + bhv +
## acd + sda + nxa + agr + snm + pcc + cch + cda + cha + nda +
## bca + ghn + gib + ghp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.063407 -0.019354 0.000721 0.021210 0.056145
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.722e-03 2.799e-02 0.133 0.894410
## global_west 6.401e-03 1.411e-02 0.454 0.650825
## n2o_nox 7.443e-07 4.410e-06 0.169 0.866218
## hlt -6.406e-02 7.686e-02 -0.833 0.405991
## air 1.588e-01 7.935e-02 2.001 0.047251 *
## had -5.020e-03 3.184e-02 -0.158 0.874967
## pmd -6.535e-03 3.940e-02 -0.166 0.868507
## ozd -3.660e-04 3.759e-03 -0.097 0.922570
## noe -5.982e-04 3.806e-03 -0.157 0.875339
## soe -3.814e-04 1.518e-03 -0.251 0.801966
## coe -4.859e-04 1.558e-03 -0.312 0.755572
## voe 6.776e-05 1.517e-03 0.045 0.964436
## h2o 3.049e-02 7.523e-02 0.405 0.685880
## usd 1.440e-02 3.001e-02 0.480 0.632079
## uwd 2.079e-02 4.508e-02 0.461 0.645338
## hmt 2.635e-02 7.693e-03 3.425 0.000803 ***
## pbd NA NA NA NA
## wmg 2.722e-02 7.613e-03 3.575 0.000479 ***
## msw -3.575e-04 5.580e-04 -0.641 0.522825
## rec -6.196e-06 3.259e-04 -0.019 0.984860
## eco 4.187e-01 5.877e-04 712.328 < 2e-16 ***
## bdh 1.021e-03 5.143e-04 1.985 0.049083 *
## tbn 2.183e-04 2.493e-04 0.876 0.382556
## tbg -4.838e-04 2.455e-04 -1.971 0.050678 .
## par 1.523e-05 1.698e-04 0.090 0.928674
## bhv 3.470e-05 2.634e-04 0.132 0.895403
## acd 1.979e-02 7.542e-02 0.262 0.793343
## sda -9.957e-03 3.770e-02 -0.264 0.792086
## nxa -9.711e-03 3.772e-02 -0.257 0.797206
## agr -2.254e-05 2.756e-04 -0.082 0.934935
## snm -1.452e-04 2.654e-04 -0.547 0.585162
## pcc 3.832e-01 1.884e-03 203.415 < 2e-16 ***
## cch NA NA NA NA
## cda -1.473e-03 7.023e-04 -2.098 0.037711 *
## cha -1.510e-04 2.196e-04 -0.688 0.492653
## nda -1.496e-04 2.105e-04 -0.710 0.478610
## bca 5.211e-05 1.458e-04 0.358 0.721246
## ghn -1.365e-03 7.185e-04 -1.900 0.059438 .
## gib -8.933e-05 1.584e-04 -0.564 0.573581
## ghp -5.667e-05 1.746e-04 -0.325 0.745895
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.03018 on 142 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 8.035e+05 on 37 and 142 DF, p-value: < 2.2e-16
Well we have a 1 for \(R2\), which
makes sense since epi
target was derived from these scores,
lets remove everything except the significant ones and see what
happens.
# model significant
lm2 <- lm(epi ~ air + hmt + wmg + eco + bdh + tbg + pcc + ghn, data = df)
summary(lm2)
##
## Call:
## lm(formula = epi ~ air + hmt + wmg + eco + bdh + tbg + pcc +
## ghn, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17460 -0.36264 -0.02662 0.32088 1.49993
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.333960 0.182299 -1.832 0.068701 .
## air 0.135429 0.003696 36.642 < 2e-16 ***
## hmt 0.026144 0.003340 7.827 4.98e-13 ***
## wmg 0.044159 0.003415 12.932 < 2e-16 ***
## eco 0.439839 0.007648 57.511 < 2e-16 ***
## bdh -0.023882 0.006019 -3.968 0.000107 ***
## tbg 0.009444 0.003176 2.973 0.003370 **
## pcc 0.403857 0.005173 78.075 < 2e-16 ***
## ghn -0.018413 0.002690 -6.845 1.31e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.539 on 171 degrees of freedom
## Multiple R-squared: 0.9982, Adjusted R-squared: 0.9981
## F-statistic: 1.163e+04 on 8 and 171 DF, p-value: < 2.2e-16
Ok, 99% ofepi
is explained by these 8 variables.
Let’s take a look at the residuals.
plot(lm2$fitted.values, lm2$residuals, xlab="Fitted Values", ylab="Residuals", main="Residuals vs. Fitted")
abline(h=0)
An the Q-Q plot
qqnorm(lm2$residuals)
qqline(lm2$residuals)
Histogram of residuals
hist(lm2$residuals)
The residuals distribution looks pretty normals too me, a few outliers are observed in the diagnostic plots.