The url of the data is : https://en.wikipedia.org/wiki/World_Happiness_Report
Reading the data from the workspace and finish setting up the data
setwd("C:/workspace") #The data is in a folder called "workspace" in the Cdrive.
worldhappiness=read.csv("worldhappiness2018.csv",header=T)
colnames(worldhappiness) #There was an incoding error and read 'rank' as something else
## [1] "癤풰ank" "Region" "country" "score"
## [5] "gdp" "socialsupport" "healthylife" "freedom"
## [9] "generosity" "corruption"
colnames(worldhappiness)[1]="rank"
attach(worldhappiness) #for directly using column names
Now checking if there’s soemthing else to check by using str function
str(worldhappiness) ;head(worldhappiness) ;tail(worldhappiness)
## 'data.frame': 156 obs. of 10 variables:
## $ rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Region : Factor w/ 5 levels "afr","ame","asi",..: 4 4 4 4 4 4 2 5 4 5 ...
## $ country : Factor w/ 156 levels "Afghanistan",..: 45 106 38 58 134 100 26 101 133 7 ...
## $ score : num 7.63 7.59 7.55 7.5 7.49 ...
## $ gdp : num 1.3 1.46 1.35 1.34 1.42 ...
## $ socialsupport: num 1.59 1.58 1.59 1.64 1.55 ...
## $ healthylife : num 0.874 0.861 0.868 0.914 0.927 0.878 0.896 0.876 0.913 0.91 ...
## $ freedom : num 0.681 0.686 0.683 0.677 0.66 0.638 0.653 0.669 0.659 0.647 ...
## $ generosity : num 0.192 0.286 0.284 0.353 0.256 0.333 0.321 0.365 0.285 0.361 ...
## $ corruption : Factor w/ 111 levels "0","0.001","0.006",..: 107 103 108 78 104 99 98 106 105 100 ...
## rank Region country score gdp socialsupport healthylife freedom
## 1 1 eur Finland 7.632 1.305 1.592 0.874 0.681
## 2 2 eur Norway 7.594 1.456 1.582 0.861 0.686
## 3 3 eur Denmark 7.555 1.351 1.590 0.868 0.683
## 4 4 eur Iceland 7.495 1.343 1.644 0.914 0.677
## 5 5 eur Swizerland 7.487 1.420 1.549 0.927 0.660
## 6 6 eur Netherlands 7.441 1.361 1.488 0.878 0.638
## generosity corruption
## 1 0.192 0.393
## 2 0.286 0.34
## 3 0.284 0.408
## 4 0.353 0.138
## 5 0.256 0.357
## 6 0.333 0.295
## rank Region country score gdp socialsupport
## 151 151 afr Rwanda 3.408 0.332 0.896
## 152 152 afr Yemen 3.355 0.442 1.073
## 153 153 afr Tanzania 3.303 0.455 0.991
## 154 154 afr South Sudan 3.254 0.337 0.608
## 155 155 afr Central African Republic 3.083 0.024 0.000
## 156 156 afr Burundi 2.905 0.091 0.627
## healthylife freedom generosity corruption
## 151 0.400 0.636 0.200 0.444
## 152 0.343 0.244 0.083 0.064
## 153 0.381 0.481 0.270 0.097
## 154 0.177 0.112 0.224 0.106
## 155 0.010 0.305 0.218 0.038
## 156 0.145 0.065 0.149 0.076
class(corruption)
## [1] "factor"
You will be able to see that the corruption score is a factor which is problematic. Change it into a numeric vector.
corruption=as.numeric(as.character(corruption)) #generates NA
## Warning: 강제형변환에 의해 생성된 NA 입니다
which(is.na(corruption)) #20
## [1] 20
We have to note that the 20th value for corruption is NA(Arab Emirates). There was originally no data from the wikipedia.
Now let’s use simple stats to grasp the situation
cor(score,gdp)
## [1] 0.8021239
cor(score,socialsupport)
## [1] 0.7457602
cor(score,healthylife)
## [1] 0.7758136
cor(score,freedom)
## [1] 0.5442799
cor(score,generosity)#NA
## [1] 0.1345187
cor(score[-20],corruption[-20]) #without the data of Arab Emirates.
## [1] 0.4052915
Correlation Coefficient of score with GDP, Healthy life expectancy and social support was critical (over 74%) Correlation of score with freedom of life choices and perception of corruption was mild(about 40~60%) In fact, correlation of score with generosity was insignificant(about 13%).
Now, let’s watch about continents
tapply(score,Region,median) #Oceania>America>Europe>Asia>Africa
## afr ame asi eur oce
## 4.3530 6.1795 5.2235 5.9465 7.2980
tapply(gdp,Region,median) #Europe>America
## afr ame asi eur oce
## 0.4485 0.9710 0.9415 1.2050 1.3040
tapply(socialsupport,Region,median) #Europe>America
## afr ame asi eur oce
## 0.9705 1.4385 1.2615 1.4765 1.5870
tapply(healthylife,Region,median) #Europe>America
## afr ame asi eur oce
## 0.2935 0.6745 0.6320 0.8550 0.8930
tapply(freedom,Region,median) #America>Asia>Europe
## afr ame asi eur oce
## 0.4210 0.5515 0.5320 0.4565 0.6580
tapply(generosity,Region,median) #Oceania>Asia>Africa>Europe>America
## afr ame asi eur oce
## 0.1690 0.1355 0.2065 0.1545 0.3630
tapply(corruption[-20],Region[-20],median) #Oceania>Asia>Africa>America>Europe
## afr ame asi eur oce
## 0.0785 0.0780 0.1050 0.0750 0.3455
Oceanian countries account for all the categories, from total scores to 6 independent variables. We can jump into a conclusion that Oceanian countries are good to live.
table(Region)
## Region
## afr ame asi eur oce
## 46 24 38 46 2
There are only 2 Oceanian Countries, Australia and New Zealand. Small samples have high possibilities of extreme results.
First, a scatter plot of gdp and score. Differentiated color by continents.
library(ggplot2)
qplot(gdp,score,col=Region,size=I(3),xlim=c(-0.3,2.3))+geom_text(aes(label=country),col="black",vjust=-1.5,size=I(3))
Now, adding complexity. A scatter plot of socialsupport and score, differentiated color by continents and a size difference by the gdp.
ggplot(data=worldhappiness,aes(x=socialsupport,y=score,size=gdp,col=Region))+geom_point()+geom_text(aes(label=country),vjust=-1.5,col="black",size=I(3))
Or if you don’t want to see too many letters and just want to see some outliers, use a function called ‘annotate’.
ggplot(data=worldhappiness,aes(x=socialsupport,y=score,size=gdp,col=Region))+geom_point()+annotate("text",x=0.8,y=6.9,label="United Arab Emirates")+annotate("text",x=0.62,y=3,label="Burundi")+annotate("text",x=1.42,y=4.2,label="Ukraine")
worldhappiness=worldhappiness[-20,] #for comfort, reduce NA
reg=lm(score~gdp+socialsupport+healthylife+freedom+corruption+generosity)
summary(reg)
##
## Call:
## lm(formula = score ~ gdp + socialsupport + healthylife + freedom +
## corruption + generosity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.73152 -0.28631 0.00222 0.35921 1.08520
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.8240 0.1978 9.223 2.72e-16 ***
## gdp 0.9013 0.2424 3.719 0.000283 ***
## socialsupport 1.1151 0.2117 5.267 4.81e-07 ***
## healthylife 0.9672 0.3425 2.824 0.005402 **
## freedom 1.3992 0.3186 4.392 2.13e-05 ***
## corruption 0.7307 0.5275 1.385 0.168075
## generosity 0.5177 0.4715 1.098 0.273929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5218 on 148 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.7905, Adjusted R-squared: 0.782
## F-statistic: 93.06 on 6 and 148 DF, p-value: < 2.2e-16
step(reg,direction="both")
## Start: AIC=-194.83
## score ~ gdp + socialsupport + healthylife + freedom + corruption +
## generosity
##
## Df Sum of Sq RSS AIC
## - generosity 1 0.3283 40.619 -195.58
## - corruption 1 0.5224 40.813 -194.84
## <none> 40.290 -194.83
## - healthylife 1 2.1706 42.461 -188.70
## - gdp 1 3.7648 44.055 -182.99
## - freedom 1 5.2514 45.542 -177.84
## - socialsupport 1 7.5510 47.841 -170.21
##
## Step: AIC=-195.58
## score ~ gdp + socialsupport + healthylife + freedom + corruption
##
## Df Sum of Sq RSS AIC
## <none> 40.619 -195.58
## + generosity 1 0.3283 40.290 -194.83
## - corruption 1 0.8663 41.485 -194.31
## - healthylife 1 2.2077 42.826 -189.37
## - gdp 1 3.5592 44.178 -184.56
## - freedom 1 6.0256 46.644 -176.14
## - socialsupport 1 7.5507 48.169 -171.15
##
## Call:
## lm(formula = score ~ gdp + socialsupport + healthylife + freedom +
## corruption)
##
## Coefficients:
## (Intercept) gdp socialsupport healthylife freedom
## 1.8895 0.8705 1.1151 0.9752 1.4688
## corruption
## 0.9000
Now let’s see correlations between independent variables to see if there’s multicollinearity in Multiple Linear Regression. One option is to use ggpairs function in GGally function.
library(GGally)
ggpairs(worldhappiness[,5:9])
Or to draw a plot of all six independent variables including Corruption,base plotting system is also enough. This code is from page 114 of “R Graphics Cookbook by Winston Chang”
panel.hist <- function(x, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(usr[1:2], 0, 1.5) )
h <- hist(x, plot = FALSE)
breaks <- h$breaks; nB <- length(breaks)
y <- h$counts; y <- y/max(y)
rect(breaks[-nB], 0, breaks[-1], y, ...)
}
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
usr <- par("usr"); on.exit(par(usr))
par(usr = c(0, 1, 0, 1))
r <- abs(cor(x, y))
txt <- format(c(r, 0.123456789), digits = digits)[1]
txt <- paste0(prefix, txt)
if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
text(0.5, 0.5, txt, cex = cex.cor * r)
}
pairs(worldhappiness[,5:10],upper.panel=panel.cor,diag.panel=panel.hist,lower.panel=panel.smooth)
cor(gdp,socialsupport)=0.728 cor(gdp,healthylife)=0.866 cor(socialsupport,healthylife)=0.675 It is assumable that there is a severe multicollinearity between gdp and healthylife. Let’s calculate the VIF(Variance Inflation Factor). One important thing to note is not to load a package “VIF” but “car”.
model=lm(score~gdp+socialsupport+healthylife+freedom+generosity)
library(car)
## Loading required package: carData
vif(model)
## gdp socialsupport healthylife freedom generosity
## 3.774128 2.075623 3.747210 1.359489 1.117754
Since none of these variables exceed the VIF 4, let’s pass.
plot(model)