library(ggplot2)
library(ggfortify)
library(knitr)

Introduction

The two datasets shared on Kaggle that I’m going to use are:

**Human Development UN Human Development Index Index](https://www.kaggle.com/robbies/humandevelopmentindex)**

Sustainable Development Solutions Network’s World Happiness Report from 2015

The World Happiness Report is a landmark survey of the state of global happiness. The World Happiness Report 2016 Update, which ranks 156 countries by their happiness levels, was released in Rome in advance of UN World Happiness Day, March 20th. The reports review the state of happiness in the world today and show how the new science of happiness explains personal and national variations in happiness. They reflect a new worldwide demand for more attention to happiness as a criteria for government policy.

So, the code below reads in the data sources and joins them together by country name. In here, we use dplyr and join the two datasets for comparison.

library(readr)
library(dplyr)

Read in data files from X2015 and human_development datasets

human_development <- read.csv("~/Documents/Rstudio/datasets/human_development.csv", stringsAsFactors = F)

X2015 <- read.csv("~/Documents/Rstudio/datasets/2015.csv", stringsAsFactors = F)

view data

glimpse(human_development)

## Observations: 195
## Variables: 8
## $ HDI.Rank                               <int> 1, 2, 3, 4, 5, 6, 6, 8,...
## $ Country                                <chr> "Norway", "Australia", ...
## $ Human.Development.Index..HDI.          <dbl> 0.944, 0.935, 0.930, 0....
## $ Life.Expectancy.at.Birth               <dbl> 81.6, 82.4, 83.0, 80.2,...
## $ Expected.Years.of.Education            <dbl> 17.5, 20.2, 15.8, 18.7,...
## $ Mean.Years.of.Education                <dbl> 12.6, 13.0, 12.8, 12.7,...
## $ Gross.National.Income..GNI..per.Capita <chr> "64,992", "42,261", "56...
## $ GNI.per.Capita.Rank.Minus.HDI.Rank     <int> 5, 17, 6, 11, 9, 11, 16...

glimpse(X2015)

## Observations: 158
## Variables: 12
## $ Country                       <chr> "Switzerland", "Iceland", "Denma...
## $ Region                        <chr> "Western Europe", "Western Europ...
## $ Happiness.Rank                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
## $ Happiness.Score               <dbl> 7.587, 7.561, 7.527, 7.522, 7.42...
## $ Standard.Error                <dbl> 0.03411, 0.04884, 0.03328, 0.038...
## $ Economy..GDP.per.Capita.      <dbl> 1.39651, 1.30232, 1.32548, 1.459...
## $ Family                        <dbl> 1.34951, 1.40223, 1.36058, 1.330...
## $ Health..Life.Expectancy.      <dbl> 0.94143, 0.94784, 0.87464, 0.885...
## $ Freedom                       <dbl> 0.66557, 0.62877, 0.64938, 0.669...
## $ Trust..Government.Corruption. <dbl> 0.41978, 0.14145, 0.48357, 0.365...
## $ Generosity                    <dbl> 0.29678, 0.43630, 0.34139, 0.346...
## $ Dystopia.Residual             <dbl> 2.51738, 2.70201, 2.49204, 2.465...

create HDI vs Hap

HDIvsHap <- human_development %>%
  left_join(X2015 , by = "Country") %>%
  mutate(Country = factor(Country)) %>%
  select(Country,HDI.Rank,Happiness.Rank, Happiness.Score,Human.Development.Index..HDI., Mean.Years.of.Education,Dystopia.Residual)

view the summary of our newly joined dataset

summary(HDIvsHap)

##                 Country       HDI.Rank      Happiness.Rank  
##  Afghanistan        :  1   Min.   :  1.00   Min.   :  1.00  
##  Albania            :  1   1st Qu.: 47.75   1st Qu.: 36.25  
##  Algeria            :  1   Median : 94.00   Median : 79.50  
##  Andorra            :  1   Mean   : 94.31   Mean   : 78.19  
##  Angola             :  1   3rd Qu.:141.25   3rd Qu.:118.75  
##  Antigua and Barbuda:  1   Max.   :188.00   Max.   :158.00  
##  (Other)            :189   NA's   :7        NA's   :57      
##  Happiness.Score Human.Development.Index..HDI. Mean.Years.of.Education
##  Min.   :2.839   Min.   :0.3480                Min.   : 1.400         
##  1st Qu.:4.526   1st Qu.:0.5770                1st Qu.: 5.550         
##  Median :5.232   Median :0.7210                Median : 8.400         
##  Mean   :5.416   Mean   :0.6918                Mean   : 8.079         
##  3rd Qu.:6.322   3rd Qu.:0.8000                3rd Qu.:10.600         
##  Max.   :7.587   Max.   :0.9440                Max.   :13.100         
##  NA's   :57                                                           
##  Dystopia.Residual
##  Min.   :0.6704   
##  1st Qu.:1.8090   
##  Median :2.0954   
##  Mean   :2.1122   
##  3rd Qu.:2.4624   
##  Max.   :3.6021   
##  NA's   :57

We observed some NA’s so we have to remove NA’s

HDIvsHap <-na.omit(HDIvsHap)

again, we check our new dataset

summary(HDIvsHap)

##         Country       HDI.Rank      Happiness.Rank   Happiness.Score
##  Afghanistan:  1   Min.   :  1.00   Min.   :  1.00   Min.   :2.839  
##  Albania    :  1   1st Qu.: 40.25   1st Qu.: 36.25   1st Qu.:4.526  
##  Algeria    :  1   Median : 85.00   Median : 79.50   Median :5.232  
##  Angola     :  1   Mean   : 90.50   Mean   : 78.19   Mean   :5.416  
##  Argentina  :  1   3rd Qu.:144.50   3rd Qu.:118.75   3rd Qu.:6.322  
##  Armenia    :  1   Max.   :188.00   Max.   :158.00   Max.   :7.587  
##  (Other)    :132                                                    
##  Human.Development.Index..HDI. Mean.Years.of.Education Dystopia.Residual
##  Min.   :0.3480                Min.   : 1.400          Min.   :0.6704   
##  1st Qu.:0.5497                1st Qu.: 5.675          1st Qu.:1.8090   
##  Median :0.7330                Median : 8.500          Median :2.0954   
##  Mean   :0.7007                Mean   : 8.177          Mean   :2.1122   
##  3rd Qu.:0.8357                3rd Qu.:10.900          3rd Qu.:2.4624   
##  Max.   :0.9440                Max.   :13.100          Max.   :3.6021   
##

Now that we already have our dataset, we can take a closer look at it. We can use formattable.

library(formattable)

HDIvsHap %>%
  head(39) %>%
  formattable((list(
    Country = color_bar("yellow"),
    HDI.Rank = color_bar("lightgreen"),
    Happiness.Rank = color_bar ("light pink"),
    Happiness.Score = color_bar("deepskyblue"),
    Human.Development.Index..HDI = color_bar("yellow"),
    Mean.Years.of.Education = color_bar("deepskyblue"),
    Dystopia.Residual = color_bar("deepskyblue"))))

	Country	HDI.Rank	Happiness.Rank	Happiness.Score	Human.Development.Index..HDI.	Mean.Years.of.Education	Dystopia.Residual
1	Norway	1	4	7.522	0.944	12.6	2.46531
2	Australia	2	10	7.284	0.935	13.0	2.26646
3	Switzerland	3	1	7.587	0.930	12.8	2.51738
4	Denmark	4	3	7.527	0.923	12.7	2.49204
5	Netherlands	5	7	7.378	0.922	11.9	2.46570
6	Germany	6	26	6.750	0.916	13.1	2.11569
7	Ireland	6	18	6.940	0.916	12.2	1.97570
8	United States	8	15	7.119	0.915	12.9	2.51011
9	Canada	9	5	7.427	0.913	13.0	2.45176
10	New Zealand	9	9	7.286	0.913	12.5	2.26425
11	Singapore	11	24	6.798	0.912	10.6	1.88501
14	Sweden	14	8	7.364	0.907	12.1	2.37119
15	United Kingdom	14	21	6.867	0.907	13.1	1.96994
16	Iceland	16	2	7.561	0.899	10.6	2.70201
18	Israel	18	11	7.278	0.894	12.5	3.08854
19	Luxembourg	19	17	6.946	0.892	11.7	1.96961
20	Japan	20	46	5.987	0.891	11.5	1.68435
21	Belgium	21	19	6.937	0.890	11.3	2.41484
22	France	22	29	6.575	0.888	11.1	2.21126
23	Austria	23	13	7.200	0.885	10.8	2.53320
24	Finland	24	6	7.406	0.883	10.3	2.61955
25	Slovenia	25	55	5.848	0.880	11.9	1.61583
26	Spain	26	36	6.329	0.876	9.6	2.12367
27	Italy	27	50	5.948	0.873	10.1	2.02518
28	Czech Republic	28	31	6.505	0.870	12.3	2.67782
29	Greece	29	102	4.857	0.865	10.3	1.80101
30	Estonia	30	73	5.429	0.861	12.5	1.58782
32	Cyprus	32	67	5.689	0.850	11.6	1.88931
33	Qatar	32	28	6.611	0.850	9.1	1.55674
35	Slovakia	35	45	5.995	0.844	12.2	2.24639
36	Poland	36	60	5.791	0.843	11.8	1.86565
37	Lithuania	37	56	5.833	0.839	12.4	2.44649
38	Malta	37	37	6.302	0.839	10.3	1.64880
39	Saudi Arabia	39	35	6.411	0.837	8.7	2.43872
40	Argentina	40	30	6.574	0.836	9.8	2.83600
41	United Arab Emirates	41	20	6.901	0.835	9.5	2.24743
42	Chile	42	27	6.670	0.832	9.8	2.67585
43	Portugal	43	88	5.102	0.830	8.2	1.26462
44	Hungary	44	104	4.800	0.828	11.6	1.24074

Well, looks like our table contains several surprises. We could see Australia as the 4th HDI rank but she only ranks 10th in Happiness ranking. Check also Germany, were she ranks 6th in HDI but ranks 26th in happiness. Iceland, which ranks 2nd in happiness, drops down to 16th in HDI. Even Japan, which ranks 20th in HDI but almost double down on happiness ranking at 46th. Over-all , the table tells us that having high HDI does not translate to happy people in a country. From above observation, I have to admire countries like Norway, Australia, Switzerland, Denmark, the Netherlands, New Zealand and Canada for having well-rounded policies in each of their individual countries which results in a better work-life balance for their citizens.

**Will happiness increase or decrease if people both have longer lives and spend long years at school? In here we will try to answer our question and run some simple linear regression models

Lets make another data and select only columns that we are going to use for our linear regression

my.data <- human_development %>%
  left_join(X2015 , by = "Country") %>%
  mutate(Country = factor(Country)) %>%
  select(Happiness.Score,Life.Expectancy.at.Birth, Expected.Years.of.Education, Dystopia.Residual)

its very important to always check our joined dataset. I dont know..I guess this has become a habit of mine.

glimpse(my.data)

## Observations: 195
## Variables: 4
## $ Happiness.Score             <dbl> 7.522, 7.284, 7.587, 7.527, 7.378,...
## $ Life.Expectancy.at.Birth    <dbl> 81.6, 82.4, 83.0, 80.2, 81.6, 80.9...
## $ Expected.Years.of.Education <dbl> 17.5, 20.2, 15.8, 18.7, 17.9, 16.5...
## $ Dystopia.Residual           <dbl> 2.46531, 2.26646, 2.51738, 2.49204...

we will change the column names so it will look better before we plot our new dataset to corrplot

colnames(my.data) <- c("happiness", "lifeSpan", "education","dystopiaResidual")

again, check if our columns are in order

summary(my.data)

##    happiness        lifeSpan       education     dystopiaResidual
##  Min.   :2.839   Min.   :49.00   Min.   : 4.10   Min.   :0.6704  
##  1st Qu.:4.526   1st Qu.:65.75   1st Qu.:11.10   1st Qu.:1.8090  
##  Median :5.232   Median :73.10   Median :13.10   Median :2.0954  
##  Mean   :5.416   Mean   :71.07   Mean   :12.86   Mean   :2.1122  
##  3rd Qu.:6.322   3rd Qu.:76.80   3rd Qu.:14.90   3rd Qu.:2.4624  
##  Max.   :7.587   Max.   :84.00   Max.   :20.20   Max.   :3.6021  
##  NA's   :57                                      NA's   :57

remove NA’s

Happiness <-na.omit(my.data)

summary

summary(Happiness)

##    happiness        lifeSpan       education     dystopiaResidual
##  Min.   :2.839   Min.   :49.00   Min.   : 5.40   Min.   :0.6704  
##  1st Qu.:4.526   1st Qu.:64.65   1st Qu.:11.10   1st Qu.:1.8090  
##  Median :5.232   Median :74.10   Median :13.50   Median :2.0954  
##  Mean   :5.416   Mean   :71.36   Mean   :13.10   Mean   :2.1122  
##  3rd Qu.:6.322   3rd Qu.:77.55   3rd Qu.:15.28   3rd Qu.:2.4624  
##  Max.   :7.587   Max.   :83.50   Max.   :20.20   Max.   :3.6021

library(corrplot)

## corrplot 0.82 loaded

#plot matrix of all variables
plot(Happiness, pch=16 , col ="blue", main = "Matrix Scatterplot of Happiness , LifeSpan , 
     Education and Dystopia Residual")

The matrix plot above allows us to vizualise the relationship among all variables in one single image.For example, we can see how happiness and LifeSpan are related (see first column, second row top to bottom graph).

Another interesting example is the relationship between LifeSpan and education (third column left to right second row top to bottom graph). Here we can see that as LifeSpan increases, average years spent in schooling also increases! Whew! talking about stoping school at at early age.

Also from the matrix plot, note how dystopiaResidual seems to have a similar pattern relative to happiness when plotted against education (fourth column left to right second row top to bottom graph).

Now, we will try to run simple linear regression on our dataset and create some models

create formula

fmla <- happiness ~ lifeSpan + education

fit the model

Happiness_model <- lm(fmla, data = Happiness)

print Happiness_model and call summary()

Happiness_model

## 
## Call:
## lm(formula = fmla, data = Happiness)
## 
## Coefficients:
## (Intercept)     lifeSpan    education  
##    -0.88554      0.06396      0.13266

summary(Happiness_model)

## 
## Call:
## lm(formula = fmla, data = Happiness)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7665 -0.5195  0.0839  0.4873  1.5826 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.88554    0.55906  -1.584 0.115539    
## lifeSpan     0.06396    0.01176   5.437 2.46e-07 ***
## education    0.13266    0.03462   3.832 0.000194 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7396 on 135 degrees of freedom
## Multiple R-squared:  0.6037, Adjusted R-squared:  0.5978 
## F-statistic: 102.8 on 2 and 135 DF,  p-value: < 2.2e-16

we have a negative coefficient, so our results tells us that a decrease in happiness means an increase in both LifeSpan and education. In simple terms, as people tend to live longer and spend longer years in schooling , it would also decrease their chances of being happy. :(

plot(Happiness_model)

predict

Happiness$prediction <- predict(Happiness_model)

ggplot

ggplot(Happiness, aes(x = prediction, y = happiness)) + 
  geom_point() +
  geom_abline(color = "blue")

Plot a correlation graph

happinesscor = cor(Happiness[1:4])
corrplot(happinesscor, method = "number")

graphically evaluate Happiness_model

Make predictions from the model

Happiness$predictions <- predict(Happiness_model)

Fill in the blanks to plot predictions (on x-axis) versus the happiness

ggplot(Happiness, aes(x = predictions, y = happiness)) + 
  geom_point() + 
  geom_abline()

Calculate residuals

Happiness$residuals <- Happiness$predictions - Happiness$happiness

Fill in the blanks to plot the residuals vs. predictions

ggplot(Happiness, aes(x = predictions, y = residuals)) + 
  geom_pointrange(aes(ymin = 0, ymax = residuals)) + 
  geom_hline(yintercept = 0, linetype = 3) + 
  ggtitle("residuals vs. linear model prediction")

Load the package WVPlots

library(WVPlots)

# Plot the Gain Curve
GainCurvePlot(Happiness, "predictions", "happiness", "Happiness model")

#comment:A relative gini coefficient close to one shows that the model correctly sorts situations from lower ones.

view summary

summary(Happiness)

##    happiness        lifeSpan       education     dystopiaResidual
##  Min.   :2.839   Min.   :49.00   Min.   : 5.40   Min.   :0.6704  
##  1st Qu.:4.526   1st Qu.:64.65   1st Qu.:11.10   1st Qu.:1.8090  
##  Median :5.232   Median :74.10   Median :13.50   Median :2.0954  
##  Mean   :5.416   Mean   :71.36   Mean   :13.10   Mean   :2.1122  
##  3rd Qu.:6.322   3rd Qu.:77.55   3rd Qu.:15.28   3rd Qu.:2.4624  
##  Max.   :7.587   Max.   :83.50   Max.   :20.20   Max.   :3.6021  
##    prediction     predictions      residuals      
##  Min.   :3.312   Min.   :3.312   Min.   :-1.5826  
##  1st Qu.:4.656   1st Qu.:4.656   1st Qu.:-0.4873  
##  Median :5.631   Median :5.631   Median :-0.0839  
##  Mean   :5.416   Mean   :5.416   Mean   : 0.0000  
##  3rd Qu.:6.035   3rd Qu.:6.035   3rd Qu.: 0.5195  
##  Max.   :7.064   Max.   :7.064   Max.   : 1.7665

For convenience put the residuals in the variable res

res <- Happiness$residuals

# Calculate RMSE, assign it to the variable rmse and print it
(rmse <- sqrt(mean(res^2)))

## [1] 0.7315398

# Calculate the standard deviation of happiness and print it
(sd_Happiness <- sd(Happiness$happiness))

## [1] 1.166301

Comment:An RMSE much smaller than the outcome’s standard deviation suggests a model that predicts well.

This is my first R project and I just started using it. The models still neeeds more improvement. (Even the graphs!) I hope you all enjoy this post and feel free to comment below. I would like to hear from more experience R users so you could help me improve my skills. Thanks!

HDI versus Happiness (2015)

Brennon Borbon

July 24, 2017

Introduction

Lets make another data and select only columns that we are going to use for our linear regression