I am going to use the world happiness report. You can read descriptions about the data in https://worldhappiness.report/ed/2018/.

The url of the data is : https://en.wikipedia.org/wiki/World_Happiness_Report

1. Reading + Setting up the data for use

Reading the data from the workspace and finish setting up the data

setwd("C:/workspace") #The data is in a folder called "workspace" in the Cdrive.
worldhappiness=read.csv("worldhappiness2018.csv",header=T)
colnames(worldhappiness) #There was an incoding error and read 'rank' as something else
##  [1] "癤풰ank"       "Region"        "country"       "score"        
##  [5] "gdp"           "socialsupport" "healthylife"   "freedom"      
##  [9] "generosity"    "corruption"
colnames(worldhappiness)[1]="rank" 
attach(worldhappiness) #for directly using column names

Now checking if there’s soemthing else to check by using str function

str(worldhappiness) ;head(worldhappiness) ;tail(worldhappiness)
## 'data.frame':    156 obs. of  10 variables:
##  $ rank         : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Region       : Factor w/ 5 levels "afr","ame","asi",..: 4 4 4 4 4 4 2 5 4 5 ...
##  $ country      : Factor w/ 156 levels "Afghanistan",..: 45 106 38 58 134 100 26 101 133 7 ...
##  $ score        : num  7.63 7.59 7.55 7.5 7.49 ...
##  $ gdp          : num  1.3 1.46 1.35 1.34 1.42 ...
##  $ socialsupport: num  1.59 1.58 1.59 1.64 1.55 ...
##  $ healthylife  : num  0.874 0.861 0.868 0.914 0.927 0.878 0.896 0.876 0.913 0.91 ...
##  $ freedom      : num  0.681 0.686 0.683 0.677 0.66 0.638 0.653 0.669 0.659 0.647 ...
##  $ generosity   : num  0.192 0.286 0.284 0.353 0.256 0.333 0.321 0.365 0.285 0.361 ...
##  $ corruption   : Factor w/ 111 levels "0","0.001","0.006",..: 107 103 108 78 104 99 98 106 105 100 ...
##   rank Region     country score   gdp socialsupport healthylife freedom
## 1    1    eur     Finland 7.632 1.305         1.592       0.874   0.681
## 2    2    eur      Norway 7.594 1.456         1.582       0.861   0.686
## 3    3    eur     Denmark 7.555 1.351         1.590       0.868   0.683
## 4    4    eur     Iceland 7.495 1.343         1.644       0.914   0.677
## 5    5    eur  Swizerland 7.487 1.420         1.549       0.927   0.660
## 6    6    eur Netherlands 7.441 1.361         1.488       0.878   0.638
##   generosity corruption
## 1      0.192      0.393
## 2      0.286       0.34
## 3      0.284      0.408
## 4      0.353      0.138
## 5      0.256      0.357
## 6      0.333      0.295
##     rank Region                  country score   gdp socialsupport
## 151  151    afr                   Rwanda 3.408 0.332         0.896
## 152  152    afr                    Yemen 3.355 0.442         1.073
## 153  153    afr                 Tanzania 3.303 0.455         0.991
## 154  154    afr              South Sudan 3.254 0.337         0.608
## 155  155    afr Central African Republic 3.083 0.024         0.000
## 156  156    afr                  Burundi 2.905 0.091         0.627
##     healthylife freedom generosity corruption
## 151       0.400   0.636      0.200      0.444
## 152       0.343   0.244      0.083      0.064
## 153       0.381   0.481      0.270      0.097
## 154       0.177   0.112      0.224      0.106
## 155       0.010   0.305      0.218      0.038
## 156       0.145   0.065      0.149      0.076
class(corruption)
## [1] "factor"

You will be able to see that the corruption score is a factor which is problematic. Change it into a numeric vector.

corruption=as.numeric(as.character(corruption)) #generates NA
## Warning: 강제형변환에 의해 생성된 NA 입니다
which(is.na(corruption)) #20
## [1] 20

We have to note that the 20th value for corruption is NA(Arab Emirates). There was originally no data from the wikipedia.

2.Simple statistics

Now let’s use simple stats to grasp the situation

cor(score,gdp)
## [1] 0.8021239
cor(score,socialsupport)
## [1] 0.7457602
cor(score,healthylife)
## [1] 0.7758136
cor(score,freedom)
## [1] 0.5442799
cor(score,generosity)#NA
## [1] 0.1345187
cor(score[-20],corruption[-20]) #without the data of Arab Emirates.
## [1] 0.4052915

Correlation Coefficient of score with GDP, Healthy life expectancy and social support was critical (over 74%) Correlation of score with freedom of life choices and perception of corruption was mild(about 40~60%) In fact, correlation of score with generosity was insignificant(about 13%).

Now, let’s watch about continents

tapply(score,Region,median) #Oceania>America>Europe>Asia>Africa
##    afr    ame    asi    eur    oce 
## 4.3530 6.1795 5.2235 5.9465 7.2980
tapply(gdp,Region,median) #Europe>America
##    afr    ame    asi    eur    oce 
## 0.4485 0.9710 0.9415 1.2050 1.3040
tapply(socialsupport,Region,median) #Europe>America
##    afr    ame    asi    eur    oce 
## 0.9705 1.4385 1.2615 1.4765 1.5870
tapply(healthylife,Region,median) #Europe>America
##    afr    ame    asi    eur    oce 
## 0.2935 0.6745 0.6320 0.8550 0.8930
tapply(freedom,Region,median) #America>Asia>Europe
##    afr    ame    asi    eur    oce 
## 0.4210 0.5515 0.5320 0.4565 0.6580
tapply(generosity,Region,median) #Oceania>Asia>Africa>Europe>America
##    afr    ame    asi    eur    oce 
## 0.1690 0.1355 0.2065 0.1545 0.3630
tapply(corruption[-20],Region[-20],median) #Oceania>Asia>Africa>America>Europe
##    afr    ame    asi    eur    oce 
## 0.0785 0.0780 0.1050 0.0750 0.3455

Oceanian countries account for all the categories, from total scores to 6 independent variables. We can jump into a conclusion that Oceanian countries are good to live.

However, one secret to this conclusion is:

table(Region)
## Region
## afr ame asi eur oce 
##  46  24  38  46   2

There are only 2 Oceanian Countries, Australia and New Zealand. Small samples have high possibilities of extreme results.

3. Plotting

First, a scatter plot of gdp and score. Differentiated color by continents.

library(ggplot2)
qplot(gdp,score,col=Region,size=I(3),xlim=c(-0.3,2.3))+geom_text(aes(label=country),col="black",vjust=-1.5,size=I(3))

Now, adding complexity. A scatter plot of socialsupport and score, differentiated color by continents and a size difference by the gdp.

ggplot(data=worldhappiness,aes(x=socialsupport,y=score,size=gdp,col=Region))+geom_point()+geom_text(aes(label=country),vjust=-1.5,col="black",size=I(3))

Or if you don’t want to see too many letters and just want to see some outliers, use a function called ‘annotate’.

ggplot(data=worldhappiness,aes(x=socialsupport,y=score,size=gdp,col=Region))+geom_point()+annotate("text",x=0.8,y=6.9,label="United Arab Emirates")+annotate("text",x=0.62,y=3,label="Burundi")+annotate("text",x=1.42,y=4.2,label="Ukraine")

4. Regression

worldhappiness=worldhappiness[-20,] #for comfort, reduce NA
reg=lm(score~gdp+socialsupport+healthylife+freedom+corruption+generosity)
summary(reg)
## 
## Call:
## lm(formula = score ~ gdp + socialsupport + healthylife + freedom + 
##     corruption + generosity)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.73152 -0.28631  0.00222  0.35921  1.08520 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.8240     0.1978   9.223 2.72e-16 ***
## gdp             0.9013     0.2424   3.719 0.000283 ***
## socialsupport   1.1151     0.2117   5.267 4.81e-07 ***
## healthylife     0.9672     0.3425   2.824 0.005402 ** 
## freedom         1.3992     0.3186   4.392 2.13e-05 ***
## corruption      0.7307     0.5275   1.385 0.168075    
## generosity      0.5177     0.4715   1.098 0.273929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5218 on 148 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7905, Adjusted R-squared:  0.782 
## F-statistic: 93.06 on 6 and 148 DF,  p-value: < 2.2e-16
step(reg,direction="both")
## Start:  AIC=-194.83
## score ~ gdp + socialsupport + healthylife + freedom + corruption + 
##     generosity
## 
##                 Df Sum of Sq    RSS     AIC
## - generosity     1    0.3283 40.619 -195.58
## - corruption     1    0.5224 40.813 -194.84
## <none>                       40.290 -194.83
## - healthylife    1    2.1706 42.461 -188.70
## - gdp            1    3.7648 44.055 -182.99
## - freedom        1    5.2514 45.542 -177.84
## - socialsupport  1    7.5510 47.841 -170.21
## 
## Step:  AIC=-195.58
## score ~ gdp + socialsupport + healthylife + freedom + corruption
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       40.619 -195.58
## + generosity     1    0.3283 40.290 -194.83
## - corruption     1    0.8663 41.485 -194.31
## - healthylife    1    2.2077 42.826 -189.37
## - gdp            1    3.5592 44.178 -184.56
## - freedom        1    6.0256 46.644 -176.14
## - socialsupport  1    7.5507 48.169 -171.15
## 
## Call:
## lm(formula = score ~ gdp + socialsupport + healthylife + freedom + 
##     corruption)
## 
## Coefficients:
##   (Intercept)            gdp  socialsupport    healthylife        freedom  
##        1.8895         0.8705         1.1151         0.9752         1.4688  
##    corruption  
##        0.9000

It shows that Happiness is represented by GDP, Social support, Healthy Life expectation, Freedom of choosing life and generosity. The value generosity has a p value of 0.0997, which is bigger than other variables(***)

5. Multicollinearity

Now let’s see correlations between independent variables to see if there’s multicollinearity in Multiple Linear Regression. One option is to use ggpairs function in GGally function.

library(GGally)
ggpairs(worldhappiness[,5:9])

Or to draw a plot of all six independent variables including Corruption,base plotting system is also enough. This code is from page 114 of “R Graphics Cookbook by Winston Chang”

panel.hist <- function(x, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(usr[1:2], 0, 1.5) )
  h <- hist(x, plot = FALSE)
  breaks <- h$breaks; nB <- length(breaks)
  y <- h$counts; y <- y/max(y)
  rect(breaks[-nB], 0, breaks[-1], y, ...)
}

panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...) {
  usr <- par("usr"); on.exit(par(usr))
  par(usr = c(0, 1, 0, 1))
  r <- abs(cor(x, y))
  txt <- format(c(r, 0.123456789), digits = digits)[1]
  txt <- paste0(prefix, txt)
  if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
  text(0.5, 0.5, txt, cex = cex.cor * r)
}

pairs(worldhappiness[,5:10],upper.panel=panel.cor,diag.panel=panel.hist,lower.panel=panel.smooth)

cor(gdp,socialsupport)=0.728 cor(gdp,healthylife)=0.866 cor(socialsupport,healthylife)=0.675 It is assumable that there is a severe multicollinearity between gdp and healthylife. Let’s calculate the VIF(Variance Inflation Factor). One important thing to note is not to load a package “VIF” but “car”.

model=lm(score~gdp+socialsupport+healthylife+freedom+generosity)
library(car) 
## Loading required package: carData
vif(model)
##           gdp socialsupport   healthylife       freedom    generosity 
##      3.774128      2.075623      3.747210      1.359489      1.117754

Since none of these variables exceed the VIF 4, let’s pass.

6. Linearity, Normality, Equal Variance and Outliers

plot(model)

1) Residual vs Fitted : Linearity. X coordinate= estimated value = yhat, y coordinate = residual. If the red line is similar as a straight line, it’s linear.

2) Normal Q-Q plot : Normality. Points have to stay near the red dotted line. In this case, you can either regard it as normal or a little bit left-skewed.

3) Scale - Location : Equal Variance. Dots are quite randomly pointed.

4) Residuals vs Leverage : Outliers. Although 20th(Czech), 130th(Chad), 146th(Malawi) data stand out, all are within the cook’s distance.