This is an subset of a project that I worked on as a group.
Climate is an essential aspect of life on earth that influences vegetation, wildlife and humans differently. It affects the ecosystem, biodiversity and economic growth. Life has thrived in specific climatic zones, while in other harsher climates – humans, animals, and plants alike have had to adapt.
In this project, I will go through the analysis that my colleagues & I did specifically relating to Happiness and Climate.
But before that, credit where credit is due -
Thank you Rahul, Maxim, Max & Ryan
First, let’s import all the necessary packages.
library(countrycode)
library(readxl)
library(wpp2019)
library(wbstats)
library(dplyr)
library(rvest)
library(ggplot2)
library(ggfortify)
library(car)
library(pedometrics)
library(plotly)
Multiple data sources of interest were identified, such as: [Wikipedia] (https://en.wikipedia.org/wiki/Main_Page);[Global Residence Website] (https://globalresidenceindex.com/); [World Health Organization Website] (https://www.who.int/data/gho/data/indicators); [World Bank Data R package] (https://www.rdocumentation.org/packages/wbstats/versions/0.2)
To gather the data from these sources we used three different techniques, namely:
Climate data was downloaded from Global Residence Website as Excel files. Happiness data was downloaded from Kaggle.
Once the data was downloaded and read into a dataframe, we noticed the country names between the various source could be potentially different. To address this difference, we used an R package known as countrycode. This package detects the various country aliases and assign as three-letter code which can then commonly used between the various dataframes.
The data is then stored in the weather and health dataframes as shown in the code below.
src <- "https://globalresidenceindex.com/wp-content/uploads/2017/10/STC-Climate-18.xlsx"
lcl <- basename(src)
download.file(url = src, destfile = lcl,mode='wb')
weather <- as.data.frame(read_excel(lcl))
c.code<-countrycode(weather[,2],"country.name","iso3c")
weather<-cbind(weather,c.code)
happiness<-read.csv(file='hapind2020.csv')
hapdata<-happiness[c(1:3,7)]
c.code<-countrycode(hapdata[,1],"country.name","iso3c")
hapdata<-cbind(hapdata,c.code)
variables <- weather[,c(16,4,6,10,11,13,15)]
data.hap <- right_join(hapdata[,c(5,3)], variables, by ="c.code")
data.hap <- data.hap[complete.cases(data.hap),]
colnames(data.hap) <- c("c.code", "Response", "Index", "Days32",
"Precipitation.Annual", "Precipitation.Days",
"Heating.Days", "Sunshine.Hours")
Now let’s do some analysis on this.
# Happiness ###################################################################
hap <- lm(Response ~ . -c.code, data = data.hap)
summary(hap)
Call:
lm(formula = Response ~ . - c.code, data = data.hap)
Residuals:
Min 1Q Median 3Q Max
-2.20186 -0.63308 -0.03538 0.65263 1.72746
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.945e+00 1.260e+00 3.130 0.002090 **
Index 1.160e+00 1.287e+00 0.901 0.368808
Days32 1.078e-03 3.188e-03 0.338 0.735567
Precipitation.Annual -1.957e-04 2.335e-04 -0.838 0.403283
Precipitation.Days 7.508e-03 1.896e-03 3.961 0.000114 ***
Heating.Days 1.698e-04 1.035e-04 1.640 0.103044
Sunshine.Hours 7.003e-05 1.814e-04 0.386 0.700010
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8804 on 154 degrees of freedom
Multiple R-squared: 0.2299, Adjusted R-squared: 0.1999
F-statistic: 7.663 on 6 and 154 DF, p-value: 3.255e-07
# Check for relationships of variables to response variable (normalized value)
plot(data.hap, pch = 15, col = "blue")
# After checking for model accuracy we have decided to include higher order terms
# in order to generate a better fit of the model
hap <- lm(Response ~ Precipitation.Days + poly(Heating.Days, 4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
summary(hap)
Call:
lm(formula = Response ~ Precipitation.Days + poly(Heating.Days,
4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
Residuals:
Min 1Q Median 3Q Max
-2.22987 -0.54046 -0.01811 0.59978 1.57403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.052087 0.203470 24.830 < 2e-16 ***
Precipitation.Days 0.006627 0.001597 4.151 5.59e-05 ***
poly(Heating.Days, 4)1 5.669478 1.403040 4.041 8.55e-05 ***
poly(Heating.Days, 4)2 -0.294329 0.992172 -0.297 0.76715
poly(Heating.Days, 4)3 -2.299425 0.935847 -2.457 0.01517 *
poly(Heating.Days, 4)4 1.236485 0.929496 1.330 0.18549
poly(Index, 4)1 3.216450 1.304482 2.466 0.01483 *
poly(Index, 4)2 1.485948 1.031719 1.440 0.15192
poly(Index, 4)3 -0.977148 0.902139 -1.083 0.28052
poly(Index, 4)4 -0.869950 0.901323 -0.965 0.33603
poly(Precipitation.Annual, 4)1 -1.166140 1.333884 -0.874 0.38341
poly(Precipitation.Annual, 4)2 0.504325 1.041043 0.484 0.62879
poly(Precipitation.Annual, 4)3 -3.063452 1.012060 -3.027 0.00292 **
poly(Precipitation.Annual, 4)4 0.578307 0.885782 0.653 0.51486
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8538 on 147 degrees of freedom
Multiple R-squared: 0.3087, Adjusted R-squared: 0.2476
F-statistic: 5.051 on 13 and 147 DF, p-value: 2.454e-07
t1 <- list(family = "Arial, sans-serif", size = 13, color = "black")
t2 <- list(family = "Arial, sans-serif", size = 12, color = "black")
# Check normality of errors
fig7 <- plot_ly(x=~hap$residuals, type = "histogram",marker=list(color="aquamarine"))
fig7 <- fig7 %>% layout(title="Histogram of Residuals - Happiness Scores",
yaxis=list(title="Frequency",titlefont=t1,tickfont=t2,showgrid=F,showline=T),
xaxis=list(title="Residuals - Happiness Scores",titlefont=t1,tickfont=t2,showgrid=F,showline=T))
fig7
`arrange_()` is deprecated as of dplyr 0.7.0.
Please use `arrange()` instead.
See vignette('programming') for more help
This warning is displayed once every 8 hours.
Call `lifecycle::last_warnings()` to see where this warning was generated.
autoplot(hap, which = 2, ncol = 1,colour="aquamarine") + theme_bw() + labs(title="Q-Q Plot for Happiness Scores")
# Check for Heteroskedasticity
for (i in 2:length(data.hap)){
col.names <- colnames(data.hap)
p <- qplot(data.hap[[i]], hap$residuals,color=I("aquamarine")) +
xlab(col.names[i]) + ylab('Residuals') +
labs(title="Happiness Scores Residual vs. Predictor") + theme_bw()
print(p)
Sys.sleep(2)
}
# Checking for Multicollinearity
vif(hap)
GVIF Df GVIF^(1/(2*Df))
Precipitation.Days 1.833112 1 1.353925
poly(Heating.Days, 4) 4.873237 4 1.218926
poly(Index, 4) 4.138210 4 1.194267
poly(Precipitation.Annual, 4) 4.922032 4 1.220445
hap <- stepVIF(hap, threshold = 5, verbose = TRUE)
all predictor variables have a VIF lower than the threshold
summary(hap)
Call:
lm(formula = Response ~ Precipitation.Days + poly(Heating.Days,
4) + poly(Index, 4) + poly(Precipitation.Annual, 4), data = data.hap)
Residuals:
Min 1Q Median 3Q Max
-2.22987 -0.54046 -0.01811 0.59978 1.57403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.052087 0.203470 24.830 < 2e-16 ***
Precipitation.Days 0.006627 0.001597 4.151 5.59e-05 ***
poly(Heating.Days, 4)1 5.669478 1.403040 4.041 8.55e-05 ***
poly(Heating.Days, 4)2 -0.294329 0.992172 -0.297 0.76715
poly(Heating.Days, 4)3 -2.299425 0.935847 -2.457 0.01517 *
poly(Heating.Days, 4)4 1.236485 0.929496 1.330 0.18549
poly(Index, 4)1 3.216450 1.304482 2.466 0.01483 *
poly(Index, 4)2 1.485948 1.031719 1.440 0.15192
poly(Index, 4)3 -0.977148 0.902139 -1.083 0.28052
poly(Index, 4)4 -0.869950 0.901323 -0.965 0.33603
poly(Precipitation.Annual, 4)1 -1.166140 1.333884 -0.874 0.38341
poly(Precipitation.Annual, 4)2 0.504325 1.041043 0.484 0.62879
poly(Precipitation.Annual, 4)3 -3.063452 1.012060 -3.027 0.00292 **
poly(Precipitation.Annual, 4)4 0.578307 0.885782 0.653 0.51486
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8538 on 147 degrees of freedom
Multiple R-squared: 0.3087, Adjusted R-squared: 0.2476
F-statistic: 5.051 on 13 and 147 DF, p-value: 2.454e-07
# Checking for outliers
p <- hap$rank - 1 # Calculate number of predictors in the model
n <- nrow(data.hap) # Number of observations in the dataset
autoplot(hap, which = 4, ncol = 1,colour="aquamarine") + theme_bw() + labs(title="Cook's Distance - Happiness")
hap.cooks <- as.vector(cooks.distance(hap))
hap.outl1 <- which(hap.cooks > 1)
hap.outl1 # No such outliers exist
integer(0)
hap.outl2 <- which(hap.cooks > qf(0.50, p + 1, n - p - 1))
hap.outl2 # No such outliers exist
integer(0)