Introduction

Climate is an essential aspect of life on earth that influences vegetation, wildlife and humans differently. It affects the ecosystem, biodiversity and economic growth. Life has thrived in specific climatic zones, while in other harsher climates – humans, animals, and plants alike have had to adapt.

In this project, I will go through the analysis that my colleagues & I did specifically relating to Happiness and Climate.

But before that, credit where credit is due -

Thank you Rahul, Maxim, Max & Ryan

First, let’s import all the necessary packages.

library(countrycode)
library(readxl)
library(wpp2019)
library(wbstats)
library(dplyr)
library(rvest)
library(ggplot2)
library(ggfortify)
library(car)
library(pedometrics)
library(plotly)

Multiple data sources of interest were identified, such as: [Wikipedia] (https://en.wikipedia.org/wiki/Main_Page);[Global Residence Website] (https://globalresidenceindex.com/); [World Health Organization Website] (https://www.who.int/data/gho/data/indicators); [World Bank Data R package] (https://www.rdocumentation.org/packages/wbstats/versions/0.2)

To gather the data from these sources we used three different techniques, namely:

Direct download of csv and xlsx Files
R Packages
Web Scraping

Direct download of csv and xlsx Files

Climate data was downloaded from Global Residence Website as Excel files. Happiness data was downloaded from Kaggle.

Once the data was downloaded and read into a dataframe, we noticed the country names between the various source could be potentially different. To address this difference, we used an R package known as countrycode. This package detects the various country aliases and assign as three-letter code which can then commonly used between the various dataframes.

The data is then stored in the weather and health dataframes as shown in the code below.

src <- "https://globalresidenceindex.com/wp-content/uploads/2017/10/STC-Climate-18.xlsx"
lcl <- basename(src)
download.file(url = src, destfile = lcl,mode='wb')
weather <- as.data.frame(read_excel(lcl))
c.code<-countrycode(weather[,2],"country.name","iso3c")
weather<-cbind(weather,c.code)

happiness<-read.csv(file='hapind2020.csv')
hapdata<-happiness[c(1:3,7)]
c.code<-countrycode(hapdata[,1],"country.name","iso3c")
hapdata<-cbind(hapdata,c.code)

After we have identified certain parameters that could influence human well-being, we had to generate our data subsets for each of the factors. In the beginning this meant extraction of the corresponding weather data. In our case having a look at the weather data set we have extracted the following weather information:

Predictor	Column
Index	4
Days above 32° C	6
Annual mm Precipitation	10
Precipitation Days in a Year	11
Heating Degree Days	13
Sunshine Hours	15

Once that was done, we combined the climate data (predictor variable) with the happiness data (response variable) into one data frame for easy analysis.

variables <- weather[,c(16,4,6,10,11,13,15)]

data.hap <- right_join(hapdata[,c(5,3)], variables, by ="c.code")
data.hap <- data.hap[complete.cases(data.hap),]
colnames(data.hap) <- c("c.code", "Response", "Index", "Days32", 
                        "Precipitation.Annual", "Precipitation.Days",
                        "Heating.Days", "Sunshine.Hours")

Now let’s do some analysis on this. First, let’s build a simple linear regression model including all the predictor variables. We can also do a visual check.

# Building linear regression model with all predictor variables (except country code)
hap <- lm(Response ~ . -c.code, data = data.hap)
summary(hap)

## 
## Call:
## lm(formula = Response ~ . - c.code, data = data.hap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.20186 -0.63308 -0.03538  0.65263  1.72746 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           3.945e+00  1.260e+00   3.130 0.002090 ** 
## Index                 1.160e+00  1.287e+00   0.901 0.368808    
## Days32                1.078e-03  3.188e-03   0.338 0.735567    
## Precipitation.Annual -1.957e-04  2.335e-04  -0.838 0.403283    
## Precipitation.Days    7.508e-03  1.896e-03   3.961 0.000114 ***
## Heating.Days          1.698e-04  1.035e-04   1.640 0.103044    
## Sunshine.Hours        7.003e-05  1.814e-04   0.386 0.700010    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8804 on 154 degrees of freedom
## Multiple R-squared:  0.2299, Adjusted R-squared:  0.1999 
## F-statistic: 7.663 on 6 and 154 DF,  p-value: 3.255e-07

# Check for relationships of variables to response variable (normalized value)
plot(data.hap, pch = 15, col = "blue")

With an adjusted R² of 0.1999, we see that the linear model is not a good fit. The next step is to try some higher-order terms.

# After checking for model accuracy we have decided to include higher order terms
# in order to generate a better fit of the model
hap <- lm(Response ~ Precipitation.Days + poly(Heating.Days, 4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
summary(hap)

## 
## Call:
## lm(formula = Response ~ Precipitation.Days + poly(Heating.Days, 
##     4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22987 -0.54046 -0.01811  0.59978  1.57403 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.052087   0.203470  24.830  < 2e-16 ***
## Precipitation.Days              0.006627   0.001597   4.151 5.59e-05 ***
## poly(Heating.Days, 4)1          5.669478   1.403040   4.041 8.55e-05 ***
## poly(Heating.Days, 4)2         -0.294329   0.992172  -0.297  0.76715    
## poly(Heating.Days, 4)3         -2.299425   0.935847  -2.457  0.01517 *  
## poly(Heating.Days, 4)4          1.236485   0.929496   1.330  0.18549    
## poly(Index, 4)1                 3.216450   1.304482   2.466  0.01483 *  
## poly(Index, 4)2                 1.485948   1.031719   1.440  0.15192    
## poly(Index, 4)3                -0.977148   0.902139  -1.083  0.28052    
## poly(Index, 4)4                -0.869950   0.901323  -0.965  0.33603    
## poly(Precipitation.Annual, 4)1 -1.166140   1.333884  -0.874  0.38341    
## poly(Precipitation.Annual, 4)2  0.504325   1.041043   0.484  0.62879    
## poly(Precipitation.Annual, 4)3 -3.063452   1.012060  -3.027  0.00292 ** 
## poly(Precipitation.Annual, 4)4  0.578307   0.885782   0.653  0.51486    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8538 on 147 degrees of freedom
## Multiple R-squared:  0.3087, Adjusted R-squared:  0.2476 
## F-statistic: 5.051 on 13 and 147 DF,  p-value: 2.454e-07

We can now see that the adjusted R² has slightly improved to 0.2476. Let’s check for some of the classical regression checks

Normal Distribution of Residuals
Heteroskedasticity
Multicollinearity
Outliers

t1 <- list(family = "Arial, sans-serif", size = 13, color = "black")
t2 <- list(family = "Arial, sans-serif", size = 12, color = "black")

# Check normality of errors
fig7 <- plot_ly(x=~hap$residuals, type = "histogram",marker=list(color="aquamarine"))
fig7 <- fig7 %>% layout(title="Histogram of Residuals - Happiness Scores",
                        yaxis=list(title="Frequency",titlefont=t1,tickfont=t2,showgrid=F,showline=T),
                        xaxis=list(title="Residuals - Happiness Scores",titlefont=t1,tickfont=t2,showgrid=F,showline=T))
fig7

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

autoplot(hap, which = 2, ncol = 1,colour="aquamarine") + theme_bw() + labs(title="Q-Q Plot for Happiness Scores")

# Check for Heteroskedasticity
for (i in 2:length(data.hap)){
  col.names <- colnames(data.hap)
  p <- qplot(data.hap[[i]], hap$residuals,color=I("aquamarine")) +
    xlab(col.names[i]) + ylab('Residuals') +
    labs(title="Happiness Scores Residual vs. Predictor") + theme_bw()
  print(p)
  Sys.sleep(2)
}

# Checking for Multicollinearity
vif(hap)

##                                   GVIF Df GVIF^(1/(2*Df))
## Precipitation.Days            1.833112  1        1.353925
## poly(Heating.Days, 4)         4.873237  4        1.218926
## poly(Index, 4)                4.138210  4        1.194267
## poly(Precipitation.Annual, 4) 4.922032  4        1.220445

hap <- stepVIF(hap, threshold = 5, verbose  = TRUE)

## all predictor variables have a VIF lower than the threshold

summary(hap)

## 
## Call:
## lm(formula = Response ~ Precipitation.Days + poly(Heating.Days, 
##     4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22987 -0.54046 -0.01811  0.59978  1.57403 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     5.052087   0.203470  24.830  < 2e-16 ***
## Precipitation.Days              0.006627   0.001597   4.151 5.59e-05 ***
## poly(Heating.Days, 4)1          5.669478   1.403040   4.041 8.55e-05 ***
## poly(Heating.Days, 4)2         -0.294329   0.992172  -0.297  0.76715    
## poly(Heating.Days, 4)3         -2.299425   0.935847  -2.457  0.01517 *  
## poly(Heating.Days, 4)4          1.236485   0.929496   1.330  0.18549    
## poly(Index, 4)1                 3.216450   1.304482   2.466  0.01483 *  
## poly(Index, 4)2                 1.485948   1.031719   1.440  0.15192    
## poly(Index, 4)3                -0.977148   0.902139  -1.083  0.28052    
## poly(Index, 4)4                -0.869950   0.901323  -0.965  0.33603    
## poly(Precipitation.Annual, 4)1 -1.166140   1.333884  -0.874  0.38341    
## poly(Precipitation.Annual, 4)2  0.504325   1.041043   0.484  0.62879    
## poly(Precipitation.Annual, 4)3 -3.063452   1.012060  -3.027  0.00292 ** 
## poly(Precipitation.Annual, 4)4  0.578307   0.885782   0.653  0.51486    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8538 on 147 degrees of freedom
## Multiple R-squared:  0.3087, Adjusted R-squared:  0.2476 
## F-statistic: 5.051 on 13 and 147 DF,  p-value: 2.454e-07

# Checking for outliers
p <- hap$rank - 1 # Calculate number of predictors in the model
n <- nrow(data.hap) # Number of observations in the dataset
autoplot(hap, which = 4, ncol = 1,colour="aquamarine") + theme_bw() + labs(title="Cook's Distance - Happiness")

hap.cooks <-  as.vector(cooks.distance(hap))
hap.outl1 <- which(hap.cooks > 1)
hap.outl1 # No such outliers exist

## integer(0)

hap.outl2 <- which(hap.cooks > qf(0.50, p + 1, n - p - 1))
hap.outl2 # No such outliers exist

## integer(0)

From our regression results, we can see a limited correlation between a selected predictor and response variables. The regression results generally improved when higher orders were introduced to the model. A value of adjusted R²0.25 is not indicative of a strong relationship. Perhaps, to have a more robust happiness predictor, we likely would have to look at factors outside of climate.

This exercise is really to showcase the workflow in evaluating correlation between different parameters.

R Notebook

Introduction

Direct download of csv and xlsx Files