This is an subset of a project that I worked on as a group.
Climate is an essential aspect of life on earth that influences vegetation, wildlife and humans differently. It affects the ecosystem, biodiversity and economic growth. Life has thrived in specific climatic zones, while in other harsher climates – humans, animals, and plants alike have had to adapt.
In this project, I will go through the analysis that my colleagues & I did specifically relating to Happiness and Climate.
But before that, credit where credit is due -
Thank you Rahul, Maxim, Max & Ryan
First, let’s import all the necessary packages.
library(countrycode)
library(readxl)
library(wpp2019)
library(wbstats)
library(dplyr)
library(rvest)
library(ggplot2)
library(ggfortify)
library(car)
library(pedometrics)
library(plotly)
Multiple data sources of interest were identified, such as: [Wikipedia] (https://en.wikipedia.org/wiki/Main_Page);[Global Residence Website] (https://globalresidenceindex.com/); [World Health Organization Website] (https://www.who.int/data/gho/data/indicators); [World Bank Data R package] (https://www.rdocumentation.org/packages/wbstats/versions/0.2)
To gather the data from these sources we used three different techniques, namely:
Climate data was downloaded from Global Residence Website as Excel files. Happiness data was downloaded from Kaggle.
Once the data was downloaded and read into a dataframe, we noticed the country names between the various source could be potentially different. To address this difference, we used an R package known as countrycode. This package detects the various country aliases and assign as three-letter code which can then commonly used between the various dataframes.
The data is then stored in the weather and health dataframes as shown in the code below.
src <- "https://globalresidenceindex.com/wp-content/uploads/2017/10/STC-Climate-18.xlsx"
lcl <- basename(src)
download.file(url = src, destfile = lcl,mode='wb')
weather <- as.data.frame(read_excel(lcl))
c.code<-countrycode(weather[,2],"country.name","iso3c")
weather<-cbind(weather,c.code)
happiness<-read.csv(file='hapind2020.csv')
hapdata<-happiness[c(1:3,7)]
c.code<-countrycode(hapdata[,1],"country.name","iso3c")
hapdata<-cbind(hapdata,c.code)
After we have identified certain parameters that could influence human well-being, we had to generate our data subsets for each of the factors. In the beginning this meant extraction of the corresponding weather data. In our case having a look at the weather data set we have extracted the following weather information:
| Predictor | Column |
|---|---|
| Index | 4 |
| Days above 32° C | 6 |
| Annual mm Precipitation | 10 |
| Precipitation Days in a Year | 11 |
| Heating Degree Days | 13 |
| Sunshine Hours | 15 |
Once that was done, we combined the climate data (predictor variable) with the happiness data (response variable) into one data frame for easy analysis.
variables <- weather[,c(16,4,6,10,11,13,15)]
data.hap <- right_join(hapdata[,c(5,3)], variables, by ="c.code")
data.hap <- data.hap[complete.cases(data.hap),]
colnames(data.hap) <- c("c.code", "Response", "Index", "Days32",
"Precipitation.Annual", "Precipitation.Days",
"Heating.Days", "Sunshine.Hours")
Now let’s do some analysis on this. First, let’s build a simple linear regression model including all the predictor variables. We can also do a visual check.
# Building linear regression model with all predictor variables (except country code)
hap <- lm(Response ~ . -c.code, data = data.hap)
summary(hap)
##
## Call:
## lm(formula = Response ~ . - c.code, data = data.hap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.20186 -0.63308 -0.03538 0.65263 1.72746
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.945e+00 1.260e+00 3.130 0.002090 **
## Index 1.160e+00 1.287e+00 0.901 0.368808
## Days32 1.078e-03 3.188e-03 0.338 0.735567
## Precipitation.Annual -1.957e-04 2.335e-04 -0.838 0.403283
## Precipitation.Days 7.508e-03 1.896e-03 3.961 0.000114 ***
## Heating.Days 1.698e-04 1.035e-04 1.640 0.103044
## Sunshine.Hours 7.003e-05 1.814e-04 0.386 0.700010
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8804 on 154 degrees of freedom
## Multiple R-squared: 0.2299, Adjusted R-squared: 0.1999
## F-statistic: 7.663 on 6 and 154 DF, p-value: 3.255e-07
# Check for relationships of variables to response variable (normalized value)
plot(data.hap, pch = 15, col = "blue")
With an adjusted R2 of 0.1999, we see that the linear model is not a good fit. The next step is to try some higher-order terms.
# After checking for model accuracy we have decided to include higher order terms
# in order to generate a better fit of the model
hap <- lm(Response ~ Precipitation.Days + poly(Heating.Days, 4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
summary(hap)
##
## Call:
## lm(formula = Response ~ Precipitation.Days + poly(Heating.Days,
## 4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22987 -0.54046 -0.01811 0.59978 1.57403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.052087 0.203470 24.830 < 2e-16 ***
## Precipitation.Days 0.006627 0.001597 4.151 5.59e-05 ***
## poly(Heating.Days, 4)1 5.669478 1.403040 4.041 8.55e-05 ***
## poly(Heating.Days, 4)2 -0.294329 0.992172 -0.297 0.76715
## poly(Heating.Days, 4)3 -2.299425 0.935847 -2.457 0.01517 *
## poly(Heating.Days, 4)4 1.236485 0.929496 1.330 0.18549
## poly(Index, 4)1 3.216450 1.304482 2.466 0.01483 *
## poly(Index, 4)2 1.485948 1.031719 1.440 0.15192
## poly(Index, 4)3 -0.977148 0.902139 -1.083 0.28052
## poly(Index, 4)4 -0.869950 0.901323 -0.965 0.33603
## poly(Precipitation.Annual, 4)1 -1.166140 1.333884 -0.874 0.38341
## poly(Precipitation.Annual, 4)2 0.504325 1.041043 0.484 0.62879
## poly(Precipitation.Annual, 4)3 -3.063452 1.012060 -3.027 0.00292 **
## poly(Precipitation.Annual, 4)4 0.578307 0.885782 0.653 0.51486
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8538 on 147 degrees of freedom
## Multiple R-squared: 0.3087, Adjusted R-squared: 0.2476
## F-statistic: 5.051 on 13 and 147 DF, p-value: 2.454e-07
We can now see that the adjusted R2 has slightly improved to 0.2476. Let’s check for some of the classical regression checks
t1 <- list(family = "Arial, sans-serif", size = 13, color = "black")
t2 <- list(family = "Arial, sans-serif", size = 12, color = "black")
# Check normality of errors
fig7 <- plot_ly(x=~hap$residuals, type = "histogram",marker=list(color="aquamarine"))
fig7 <- fig7 %>% layout(title="Histogram of Residuals - Happiness Scores",
yaxis=list(title="Frequency",titlefont=t1,tickfont=t2,showgrid=F,showline=T),
xaxis=list(title="Residuals - Happiness Scores",titlefont=t1,tickfont=t2,showgrid=F,showline=T))
fig7
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
autoplot(hap, which = 2, ncol = 1,colour="aquamarine") + theme_bw() + labs(title="Q-Q Plot for Happiness Scores")
# Check for Heteroskedasticity
for (i in 2:length(data.hap)){
col.names <- colnames(data.hap)
p <- qplot(data.hap[[i]], hap$residuals,color=I("aquamarine")) +
xlab(col.names[i]) + ylab('Residuals') +
labs(title="Happiness Scores Residual vs. Predictor") + theme_bw()
print(p)
Sys.sleep(2)
}
# Checking for Multicollinearity
vif(hap)
## GVIF Df GVIF^(1/(2*Df))
## Precipitation.Days 1.833112 1 1.353925
## poly(Heating.Days, 4) 4.873237 4 1.218926
## poly(Index, 4) 4.138210 4 1.194267
## poly(Precipitation.Annual, 4) 4.922032 4 1.220445
hap <- stepVIF(hap, threshold = 5, verbose = TRUE)
## all predictor variables have a VIF lower than the threshold
summary(hap)
##
## Call:
## lm(formula = Response ~ Precipitation.Days + poly(Heating.Days,
## 4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.22987 -0.54046 -0.01811 0.59978 1.57403
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.052087 0.203470 24.830 < 2e-16 ***
## Precipitation.Days 0.006627 0.001597 4.151 5.59e-05 ***
## poly(Heating.Days, 4)1 5.669478 1.403040 4.041 8.55e-05 ***
## poly(Heating.Days, 4)2 -0.294329 0.992172 -0.297 0.76715
## poly(Heating.Days, 4)3 -2.299425 0.935847 -2.457 0.01517 *
## poly(Heating.Days, 4)4 1.236485 0.929496 1.330 0.18549
## poly(Index, 4)1 3.216450 1.304482 2.466 0.01483 *
## poly(Index, 4)2 1.485948 1.031719 1.440 0.15192
## poly(Index, 4)3 -0.977148 0.902139 -1.083 0.28052
## poly(Index, 4)4 -0.869950 0.901323 -0.965 0.33603
## poly(Precipitation.Annual, 4)1 -1.166140 1.333884 -0.874 0.38341
## poly(Precipitation.Annual, 4)2 0.504325 1.041043 0.484 0.62879
## poly(Precipitation.Annual, 4)3 -3.063452 1.012060 -3.027 0.00292 **
## poly(Precipitation.Annual, 4)4 0.578307 0.885782 0.653 0.51486
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8538 on 147 degrees of freedom
## Multiple R-squared: 0.3087, Adjusted R-squared: 0.2476
## F-statistic: 5.051 on 13 and 147 DF, p-value: 2.454e-07
# Checking for outliers
p <- hap$rank - 1 # Calculate number of predictors in the model
n <- nrow(data.hap) # Number of observations in the dataset
autoplot(hap, which = 4, ncol = 1,colour="aquamarine") + theme_bw() + labs(title="Cook's Distance - Happiness")
hap.cooks <- as.vector(cooks.distance(hap))
hap.outl1 <- which(hap.cooks > 1)
hap.outl1 # No such outliers exist
## integer(0)
hap.outl2 <- which(hap.cooks > qf(0.50, p + 1, n - p - 1))
hap.outl2 # No such outliers exist
## integer(0)
From our regression results, we can see a limited correlation between a selected predictor and response variables. The regression results generally improved when higher orders were introduced to the model. A value of adjusted R20.25 is not indicative of a strong relationship. Perhaps, to have a more robust happiness predictor, we likely would have to look at factors outside of climate.
This exercise is really to showcase the workflow in evaluating correlation between different parameters.