library(ggplot2)
library(ggfortify)
library(knitr)

Introduction

The two datasets shared on Kaggle that I’m going to use are:

**Human Development UN Human Development Index Index](https://www.kaggle.com/robbies/humandevelopmentindex)**

Sustainable Development Solutions Network’s World Happiness Report from 2015

The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2016 Update, which ranks 156 countries by their happiness levels, was released in Rome in advance of UN World Happiness Day, March 20th. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. They reflect a new worldwide demand for more attention to happiness as a criteria for government policy.

So, the code below reads in the data sources and joins them together by country name. In here, we use dplyr and join the two datasets for comparison.

library(readr)
library(dplyr)

Read in data files from X2015 and human_development datasets

human_development <- read.csv("~/Documents/Rstudio/datasets/human_development.csv", stringsAsFactors = F)
X2015 <- read.csv("~/Documents/Rstudio/datasets/2015.csv", stringsAsFactors = F)

view data

glimpse(human_development)
## Observations: 195
## Variables: 8
## $ HDI.Rank                               <int> 1, 2, 3, 4, 5, 6, 6, 8,...
## $ Country                                <chr> "Norway", "Australia", ...
## $ Human.Development.Index..HDI.          <dbl> 0.944, 0.935, 0.930, 0....
## $ Life.Expectancy.at.Birth               <dbl> 81.6, 82.4, 83.0, 80.2,...
## $ Expected.Years.of.Education            <dbl> 17.5, 20.2, 15.8, 18.7,...
## $ Mean.Years.of.Education                <dbl> 12.6, 13.0, 12.8, 12.7,...
## $ Gross.National.Income..GNI..per.Capita <chr> "64,992", "42,261", "56...
## $ GNI.per.Capita.Rank.Minus.HDI.Rank     <int> 5, 17, 6, 11, 9, 11, 16...
glimpse(X2015)
## Observations: 158
## Variables: 12
## $ Country                       <chr> "Switzerland", "Iceland", "Denma...
## $ Region                        <chr> "Western Europe", "Western Europ...
## $ Happiness.Rank                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
## $ Happiness.Score               <dbl> 7.587, 7.561, 7.527, 7.522, 7.42...
## $ Standard.Error                <dbl> 0.03411, 0.04884, 0.03328, 0.038...
## $ Economy..GDP.per.Capita.      <dbl> 1.39651, 1.30232, 1.32548, 1.459...
## $ Family                        <dbl> 1.34951, 1.40223, 1.36058, 1.330...
## $ Health..Life.Expectancy.      <dbl> 0.94143, 0.94784, 0.87464, 0.885...
## $ Freedom                       <dbl> 0.66557, 0.62877, 0.64938, 0.669...
## $ Trust..Government.Corruption. <dbl> 0.41978, 0.14145, 0.48357, 0.365...
## $ Generosity                    <dbl> 0.29678, 0.43630, 0.34139, 0.346...
## $ Dystopia.Residual             <dbl> 2.51738, 2.70201, 2.49204, 2.465...

create HDI vs Hap

HDIvsHap <- human_development %>%
  left_join(X2015 , by = "Country") %>%
  mutate(Country = factor(Country)) %>%
  select(Country,HDI.Rank,Happiness.Rank, Happiness.Score,Human.Development.Index..HDI., Mean.Years.of.Education,Dystopia.Residual)

view the summary of our newly joined dataset

summary(HDIvsHap)
##                 Country       HDI.Rank      Happiness.Rank  
##  Afghanistan        :  1   Min.   :  1.00   Min.   :  1.00  
##  Albania            :  1   1st Qu.: 47.75   1st Qu.: 36.25  
##  Algeria            :  1   Median : 94.00   Median : 79.50  
##  Andorra            :  1   Mean   : 94.31   Mean   : 78.19  
##  Angola             :  1   3rd Qu.:141.25   3rd Qu.:118.75  
##  Antigua and Barbuda:  1   Max.   :188.00   Max.   :158.00  
##  (Other)            :189   NA's   :7        NA's   :57      
##  Happiness.Score Human.Development.Index..HDI. Mean.Years.of.Education
##  Min.   :2.839   Min.   :0.3480                Min.   : 1.400         
##  1st Qu.:4.526   1st Qu.:0.5770                1st Qu.: 5.550         
##  Median :5.232   Median :0.7210                Median : 8.400         
##  Mean   :5.416   Mean   :0.6918                Mean   : 8.079         
##  3rd Qu.:6.322   3rd Qu.:0.8000                3rd Qu.:10.600         
##  Max.   :7.587   Max.   :0.9440                Max.   :13.100         
##  NA's   :57                                                           
##  Dystopia.Residual
##  Min.   :0.6704   
##  1st Qu.:1.8090   
##  Median :2.0954   
##  Mean   :2.1122   
##  3rd Qu.:2.4624   
##  Max.   :3.6021   
##  NA's   :57

We observed some NA’s so we have to remove NA’s

HDIvsHap <-na.omit(HDIvsHap)

again, we check our new dataset

summary(HDIvsHap)
##         Country       HDI.Rank      Happiness.Rank   Happiness.Score
##  Afghanistan:  1   Min.   :  1.00   Min.   :  1.00   Min.   :2.839  
##  Albania    :  1   1st Qu.: 40.25   1st Qu.: 36.25   1st Qu.:4.526  
##  Algeria    :  1   Median : 85.00   Median : 79.50   Median :5.232  
##  Angola     :  1   Mean   : 90.50   Mean   : 78.19   Mean   :5.416  
##  Argentina  :  1   3rd Qu.:144.50   3rd Qu.:118.75   3rd Qu.:6.322  
##  Armenia    :  1   Max.   :188.00   Max.   :158.00   Max.   :7.587  
##  (Other)    :132                                                    
##  Human.Development.Index..HDI. Mean.Years.of.Education Dystopia.Residual
##  Min.   :0.3480                Min.   : 1.400          Min.   :0.6704   
##  1st Qu.:0.5497                1st Qu.: 5.675          1st Qu.:1.8090   
##  Median :0.7330                Median : 8.500          Median :2.0954   
##  Mean   :0.7007                Mean   : 8.177          Mean   :2.1122   
##  3rd Qu.:0.8357                3rd Qu.:10.900          3rd Qu.:2.4624   
##  Max.   :0.9440                Max.   :13.100          Max.   :3.6021   
## 

Now that we already have our dataset, we can take a closer look at it. We can use formattable.

library(formattable)

HDIvsHap %>%
  head(39) %>%
  formattable((list(
    Country = color_bar("yellow"),
    HDI.Rank = color_bar("lightgreen"),
    Happiness.Rank = color_bar ("light pink"),
    Happiness.Score = color_bar("deepskyblue"),
    Human.Development.Index..HDI = color_bar("yellow"),
    Mean.Years.of.Education = color_bar("deepskyblue"),
    Dystopia.Residual = color_bar("deepskyblue"))))
Country HDI.Rank Happiness.Rank Happiness.Score Human.Development.Index..HDI. Mean.Years.of.Education Dystopia.Residual
1 Norway 1 4 7.522 0.944 12.6 2.46531
2 Australia 2 10 7.284 0.935 13.0 2.26646
3 Switzerland 3 1 7.587 0.930 12.8 2.51738
4 Denmark 4 3 7.527 0.923 12.7 2.49204
5 Netherlands 5 7 7.378 0.922 11.9 2.46570
6 Germany 6 26 6.750 0.916 13.1 2.11569
7 Ireland 6 18 6.940 0.916 12.2 1.97570
8 United States 8 15 7.119 0.915 12.9 2.51011
9 Canada 9 5 7.427 0.913 13.0 2.45176
10 New Zealand 9 9 7.286 0.913 12.5 2.26425
11 Singapore 11 24 6.798 0.912 10.6 1.88501
14 Sweden 14 8 7.364 0.907 12.1 2.37119
15 United Kingdom 14 21 6.867 0.907 13.1 1.96994
16 Iceland 16 2 7.561 0.899 10.6 2.70201
18 Israel 18 11 7.278 0.894 12.5 3.08854
19 Luxembourg 19 17 6.946 0.892 11.7 1.96961
20 Japan 20 46 5.987 0.891 11.5 1.68435
21 Belgium 21 19 6.937 0.890 11.3 2.41484
22 France 22 29 6.575 0.888 11.1 2.21126
23 Austria 23 13 7.200 0.885 10.8 2.53320
24 Finland 24 6 7.406 0.883 10.3 2.61955
25 Slovenia 25 55 5.848 0.880 11.9 1.61583
26 Spain 26 36 6.329 0.876 9.6 2.12367
27 Italy 27 50 5.948 0.873 10.1 2.02518
28 Czech Republic 28 31 6.505 0.870 12.3 2.67782
29 Greece 29 102 4.857 0.865 10.3 1.80101
30 Estonia 30 73 5.429 0.861 12.5 1.58782
32 Cyprus 32 67 5.689 0.850 11.6 1.88931
33 Qatar 32 28 6.611 0.850 9.1 1.55674
35 Slovakia 35 45 5.995 0.844 12.2 2.24639
36 Poland 36 60 5.791 0.843 11.8 1.86565
37 Lithuania 37 56 5.833 0.839 12.4 2.44649
38 Malta 37 37 6.302 0.839 10.3 1.64880
39 Saudi Arabia 39 35 6.411 0.837 8.7 2.43872
40 Argentina 40 30 6.574 0.836 9.8 2.83600
41 United Arab Emirates 41 20 6.901 0.835 9.5 2.24743
42 Chile 42 27 6.670 0.832 9.8 2.67585
43 Portugal 43 88 5.102 0.830 8.2 1.26462
44 Hungary 44 104 4.800 0.828 11.6 1.24074

Well, looks like our table contains several surprises. We could see Australia as the 4th HDI rank but she only ranks 10th in Happiness ranking. Check also Germany, were she ranks 6th in HDI but ranks 26th in happiness. Iceland, which ranks 2nd in happiness, drops down to 16th in HDI. Even Japan, which ranks 20th in HDI but almost double down on happiness ranking at 46th. Over-all , the table tells us that having high HDI does not translate to happy people in a country. From above observation, I have to admire countries like Norway, Australia, Switzerland, Denmark, the Netherlands, New Zealand and Canada for having well-rounded policies in each of their individual countries which results in a better work-life balance for their citizens.

**Will happiness increase or decrease if people both have longer lives and spend long years at school? In here we will try to answer our question and run some simple linear regression models

Lets make another data and select only columns that we are going to use for our linear regression

my.data <- human_development %>%
  left_join(X2015 , by = "Country") %>%
  mutate(Country = factor(Country)) %>%
  select(Happiness.Score,Life.Expectancy.at.Birth, Expected.Years.of.Education, Dystopia.Residual)

its very important to always check our joined dataset. I dont know..I guess this has become a habit of mine.

glimpse(my.data)
## Observations: 195
## Variables: 4
## $ Happiness.Score             <dbl> 7.522, 7.284, 7.587, 7.527, 7.378,...
## $ Life.Expectancy.at.Birth    <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9...
## $ Expected.Years.of.Education <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5...
## $ Dystopia.Residual           <dbl> 2.46531, 2.26646, 2.51738, 2.49204...

we will change the column names so it will look better before we plot our new dataset to corrplot

colnames(my.data) <- c("happiness", "lifeSpan", "education","dystopiaResidual")

again, check if our columns are in order

summary(my.data)
##    happiness        lifeSpan       education     dystopiaResidual
##  Min.   :2.839   Min.   :49.00   Min.   : 4.10   Min.   :0.6704  
##  1st Qu.:4.526   1st Qu.:65.75   1st Qu.:11.10   1st Qu.:1.8090  
##  Median :5.232   Median :73.10   Median :13.10   Median :2.0954  
##  Mean   :5.416   Mean   :71.07   Mean   :12.86   Mean   :2.1122  
##  3rd Qu.:6.322   3rd Qu.:76.80   3rd Qu.:14.90   3rd Qu.:2.4624  
##  Max.   :7.587   Max.   :84.00   Max.   :20.20   Max.   :3.6021  
##  NA's   :57                                      NA's   :57

remove NA’s

Happiness <-na.omit(my.data)

summary

summary(Happiness)
##    happiness        lifeSpan       education     dystopiaResidual
##  Min.   :2.839   Min.   :49.00   Min.   : 5.40   Min.   :0.6704  
##  1st Qu.:4.526   1st Qu.:64.65   1st Qu.:11.10   1st Qu.:1.8090  
##  Median :5.232   Median :74.10   Median :13.50   Median :2.0954  
##  Mean   :5.416   Mean   :71.36   Mean   :13.10   Mean   :2.1122  
##  3rd Qu.:6.322   3rd Qu.:77.55   3rd Qu.:15.28   3rd Qu.:2.4624  
##  Max.   :7.587   Max.   :83.50   Max.   :20.20   Max.   :3.6021
library(corrplot)
## corrplot 0.82 loaded
#plot matrix of all variables
plot(Happiness, pch=16 , col ="blue", main = "Matrix Scatterplot of Happiness , LifeSpan , 
     Education and Dystopia Residual")

The matrix plot above allows us to vizualise the relationship among all variables in one single image.For example, we can see how happiness and LifeSpan are related (see first column, second row top to bottom graph).

Another interesting example is the relationship between LifeSpan and education (third column left to right second row top to bottom graph). Here we can see that as LifeSpan increases, average years spent in schooling also increases! Whew! talking about stoping school at at early age.

Also from the matrix plot, note how dystopiaResidual seems to have a similar pattern relative to happiness when plotted against education (fourth column left to right second row top to bottom graph).

Now, we will try to run simple linear regression on our dataset and create some models

create formula

fmla <- happiness ~ lifeSpan + education

fit the model

Happiness_model <- lm(fmla, data = Happiness)

print Happiness_model and call summary()

Happiness_model
## 
## Call:
## lm(formula = fmla, data = Happiness)
## 
## Coefficients:
## (Intercept)     lifeSpan    education  
##    -0.88554      0.06396      0.13266
summary(Happiness_model)
## 
## Call:
## lm(formula = fmla, data = Happiness)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7665 -0.5195  0.0839  0.4873  1.5826 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.88554    0.55906  -1.584 0.115539    
## lifeSpan     0.06396    0.01176   5.437 2.46e-07 ***
## education    0.13266    0.03462   3.832 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7396 on 135 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.5978 
## F-statistic: 102.8 on 2 and 135 DF,  p-value: < 2.2e-16

we have a negative coefficient, so our results tells us that a decrease in happiness means an increase in both LifeSpan and education. In simple terms, as people tend to live longer and spend longer years in schooling , it would also decrease their chances of being happy. :(

plot(Happiness_model)

predict

Happiness$prediction <- predict(Happiness_model)

ggplot

ggplot(Happiness, aes(x = prediction, y = happiness)) + 
  geom_point() +
  geom_abline(color = "blue")

Plot a correlation graph

happinesscor = cor(Happiness[1:4])
corrplot(happinesscor, method = "number")

graphically evaluate Happiness_model

Make predictions from the model

Happiness$predictions <- predict(Happiness_model)

Fill in the blanks to plot predictions (on x-axis) versus the happiness

ggplot(Happiness, aes(x = predictions, y = happiness)) + 
  geom_point() + 
  geom_abline()

Calculate residuals

Happiness$residuals <- Happiness$predictions - Happiness$happiness

Fill in the blanks to plot the residuals vs. predictions

ggplot(Happiness, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

Load the package WVPlots

library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(Happiness, "predictions", "happiness", "Happiness model")

#comment:A relative gini coefficient close to one shows that the model correctly sorts situations from lower ones.

view summary

summary(Happiness)
##    happiness        lifeSpan       education     dystopiaResidual
##  Min.   :2.839   Min.   :49.00   Min.   : 5.40   Min.   :0.6704  
##  1st Qu.:4.526   1st Qu.:64.65   1st Qu.:11.10   1st Qu.:1.8090  
##  Median :5.232   Median :74.10   Median :13.50   Median :2.0954  
##  Mean   :5.416   Mean   :71.36   Mean   :13.10   Mean   :2.1122  
##  3rd Qu.:6.322   3rd Qu.:77.55   3rd Qu.:15.28   3rd Qu.:2.4624  
##  Max.   :7.587   Max.   :83.50   Max.   :20.20   Max.   :3.6021  
##    prediction     predictions      residuals      
##  Min.   :3.312   Min.   :3.312   Min.   :-1.5826  
##  1st Qu.:4.656   1st Qu.:4.656   1st Qu.:-0.4873  
##  Median :5.631   Median :5.631   Median :-0.0839  
##  Mean   :5.416   Mean   :5.416   Mean   : 0.0000  
##  3rd Qu.:6.035   3rd Qu.:6.035   3rd Qu.: 0.5195  
##  Max.   :7.064   Max.   :7.064   Max.   : 1.7665

For convenience put the residuals in the variable res

res <- Happiness$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))
## [1] 0.7315398
# Calculate the standard deviation of happiness and print it
(sd_Happiness <- sd(Happiness$happiness))
## [1] 1.166301

Comment:An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.

This is my first R project and I just started using it. The models still neeeds more improvement. (Even the graphs!) I hope you all enjoy this post and feel free to comment below. I would like to hear from more experience R users so you could help me improve my skills. Thanks!