Final Project

Saayed Alam
December 6, 2018

Introduction

According to a new report by the Centers of Disease Control and Prevention (CDC), the suicide rate is the highest it's been in decades in United States. Which led to a question, what makes a country happy? Without a question the most common reason people commit suicide is because they are unhappy. The purpose of this project is to determine which factors are most important to live a happier life. The finding could help people and countries focus on the factors needed to attain a higher level of happiness or at the least push them towards the right direction.

Getting the data

The dataset comes Sustainable Development Solutions Network, an initiative by United Nations to promote sustainable development around the world. The survey from the year 2012, ranks 155 countries by their happiness levels. The ranking of the countries are based on seven factors - family, life expectancy, economy, generosity, trust in government, freedom and dystopia residuals. The sum of these seven factors equal to the happiness score. The higher the happiness score, the lower the happiness rank. Dystopia is an imaginary country that has the world's least happy people. It is a benchmark for other countries to show how far they are from being the poorest country pertaining to happiness level.

library(tidyverse)
library(ggthemes)
library(corrplot)
library(RColorBrewer)
library(rvest )
library(caTools)
library(Hmisc)
library(car)

Exploring the data

happiness_rank <- read.csv("https://raw.githubusercontent.com/saayedalam/Data/master/happiness_rank_2017.csv")

glimpse(happiness_rank)

## Observations: 155
## Variables: 12
## $ Country                       <fct> Norway, Denmark, Iceland, Switze...
## $ Happiness.Rank                <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1...
## $ Happiness.Score               <dbl> 7.537, 7.522, 7.504, 7.494, 7.46...
## $ Whisker.high                  <dbl> 7.594445, 7.581728, 7.622030, 7....
## $ Whisker.low                   <dbl> 7.479556, 7.462272, 7.385970, 7....
## $ Economy..GDP.per.Capita.      <dbl> 1.616463, 1.482383, 1.480633, 1....
## $ Family                        <dbl> 1.533524, 1.551122, 1.610574, 1....
## $ Health..Life.Expectancy.      <dbl> 0.7966665, 0.7925655, 0.8335521,...
## $ Freedom                       <dbl> 0.6354226, 0.6260067, 0.6271626,...
## $ Generosity                    <dbl> 0.36201224, 0.35528049, 0.475540...
## $ Trust..Government.Corruption. <dbl> 0.31596383, 0.40077007, 0.153526...
## $ Dystopia.Residual             <dbl> 2.277027, 2.313707, 2.322715, 2....

Tidying the data

The dataset has twelve variables. For readability, some of the columns are renamed. Also, for the purpose of this project, Economy factor will be omitted. Later, the economy data i.e. GDP per capita will be imported from a different data source. Also, whisker high and whisker low variables are removed since they will not be needed for statistical and visual analysis.

happiness_rank <- happiness_rank %>%
  dplyr::rename(Life.Expectancy = Health..Life.Expectancy.
         ,Trust = Trust..Government.Corruption.) %>%
  select(-Whisker.high, -Whisker.low, -'Economy..GDP.per.Capita.')

Transforming the data

For the scope of this project, a new variable called continent is created based on 155 countries from the dataset. The type of continent variable is changed to factor to discover whether there are different trends in regards to which of the seven factors play a significant role in gaining higher happiness level.

happiness_rank <- happiness_rank %>%
  mutate(Continent = case_when( 
  Country %in% c("Israel", "United Arab Emirates", "Singapore", "Thailand", "Taiwan Province of China", "Qatar", "Saudi Arabia", "Kuwait", "Bahrain", "Malaysia", "Uzbekistan", "Japan", "South Korea", "Turkmenistan", "Kazakhstan", "Turkey", "Hong Kong S.A.R., China", "Philippines", "Jordan", "China", "Pakistan", "Indonesia", "Azerbaijan", "Lebanon", "Vietnam", "Tajikistan", "Bhutan", "Kyrgyzstan", "Nepal", "Mongolia", "Palestinian Territories", "Iran", "Bangladesh", "Myanmar", "Iraq", "Sri Lanka", "Armenia", "India", "Georgia", "Cambodia", "Afghanistan", "Yemen", "Syria") ~ "Asia",
  Country %in%  c("Norway", "Denmark", "Iceland", "Switzerland", "Finland", "Netherlands", "Sweden", "Austria", "Ireland", "Germany", "Belgium", "Luxembourg", "United Kingdom", "Czech Republic", "Malta", "France", "Spain", "Slovakia", "Poland", "Italy", "Russia", "Lithuania", "Latvia", "Moldova", "Romania", "Slovenia", "North Cyprus", "Cyprus", "Estonia", "Belarus", "Serbia", "Hungary", "Croatia", "Kosovo", "Montenegro", "Greece", "Portugal", "Bosnia and Herzegovina", "Macedonia", "Bulgaria", "Albania", "Ukraine") ~ "Europe", 
  Country %in%  c("Canada", "Costa Rica", "United States", "Mexico", "Panama","Trinidad and Tobago", "El Salvador", "Belize", "Guatemala", "Jamaica", "Nicaragua", "Dominican Republic", "Honduras", "Haiti") ~ "North America", 
  Country %in%  c("Chile", "Brazil", "Argentina", "Uruguay", "Colombia", "Ecuador", "Bolivia", "Peru", "Paraguay", "Venezuela") ~ "South America",
  Country %in%  c("New Zealand", "Australia") ~ "Australia",
  TRUE ~ "Africa")) %>%
  mutate(Continent = as.factor(Continent)) %>%
  select(Country, Continent, everything()) 

glimpse(happiness_rank)

## Observations: 155
## Variables: 10
## $ Country           <fct> Norway, Denmark, Iceland, Switzerland, Finla...
## $ Continent         <fct> Europe, Europe, Europe, Europe, Europe, Euro...
## $ Happiness.Rank    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ Happiness.Score   <dbl> 7.537, 7.522, 7.504, 7.494, 7.469, 7.377, 7....
## $ Family            <dbl> 1.533524, 1.551122, 1.610574, 1.516912, 1.54...
## $ Life.Expectancy   <dbl> 0.7966665, 0.7925655, 0.8335521, 0.8581313, ...
## $ Freedom           <dbl> 0.6354226, 0.6260067, 0.6271626, 0.6200706, ...
## $ Generosity        <dbl> 0.36201224, 0.35528049, 0.47554022, 0.290549...
## $ Trust             <dbl> 0.31596383, 0.40077007, 0.15352656, 0.367007...
## $ Dystopia.Residual <dbl> 2.277027, 2.313707, 2.322715, 2.276716, 2.43...

Visualizing the data

The survey says happiness score and happiness rank are inversely proportional. To verify the claim, the linearity is examined.

happiness_rank %>%
  ggplot(aes(Happiness.Score, Happiness.Rank)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "", x = "Happiness Score", y = "Happiness Rank") +
  theme_pander()

Happiness Score by Continent

A scatter plot is applied to visualize the distribution of the happiness in different continents.

happiness_rank %>%
  ggplot(aes(Continent, Happiness.Score, color = Continent)) +
  geom_point() +
  theme_fivethirtyeight() +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5, vjust = 0.3)) +
  labs(title = "Happiness Score by Continent", 
       x = " ",
       y = "Happiness Score")

Descriptive statistical analysis by Continent

A descriptive statistical analysis of the six factors and happiness score using the Hmisc package.

happiness_rank %>%
  select(-Happiness.Rank, -Country, -Continent) %>%
  Hmisc::describe()

## . 
## 
##  7  Variables      155  Observations
## ---------------------------------------------------------------------------
## Happiness.Score 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      151        1    5.354    1.301    3.574    3.800 
##      .25      .50      .75      .90      .95 
##    4.506    5.279    6.102    6.927    7.293 
## 
## lowest : 2.693 2.905 3.349 3.462 3.471, highest: 7.469 7.494 7.504 7.522 7.537
## ---------------------------------------------------------------------------
## Family 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      155        1    1.189   0.3106   0.6213   0.7814 
##      .25      .50      .75      .90      .95 
##   1.0426   1.2539   1.4143   1.4856   1.5215 
## 
## lowest : 0.0000000 0.3961026 0.4318825 0.4352998 0.5125688
## highest: 1.5481951 1.5489691 1.5511216 1.5582311 1.6105740
## ---------------------------------------------------------------------------
## Life.Expectancy 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      155        1   0.5513   0.2677   0.1118   0.1925 
##      .25      .50      .75      .90      .95 
##   0.3699   0.6060   0.7230   0.8273   0.8448 
## 
## lowest : 0.000000000 0.005564754 0.018772686 0.041134715 0.048642170
## highest: 0.888960600 0.900214076 0.913475871 0.943062425 0.949492395
## ---------------------------------------------------------------------------
## Freedom 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      155        1   0.4088   0.1691   0.1179   0.2007 
##      .25      .50      .75      .90      .95 
##   0.3037   0.4375   0.5166   0.5874   0.6133 
## 
## lowest : 0.00000000 0.01499586 0.03036986 0.05990075 0.08153944
## highest: 0.62600672 0.62716264 0.63337582 0.63542259 0.65824866
## ---------------------------------------------------------------------------
## Generosity 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      155        1   0.2469   0.1482  0.05149  0.08534 
##      .25      .50      .75      .90      .95 
##  0.15411  0.23154  0.32376  0.42829  0.48970 
## 
## lowest : 0.00000000 0.01016466 0.02880684 0.03220996 0.04378538
## highest: 0.50000513 0.57212311 0.57473058 0.61170459 0.83807516
## ---------------------------------------------------------------------------
## Trust 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      155        1   0.1231   0.1047  0.02072  0.03213 
##      .25      .50      .75      .90      .95 
##  0.05727  0.08985  0.15330  0.28256  0.33724 
## 
## lowest : 0.000000000 0.004387901 0.008964816 0.010091286 0.011051531
## highest: 0.384398729 0.400770068 0.439299256 0.455220014 0.464307785
## ---------------------------------------------------------------------------
## Dystopia.Residual 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##      155        0      155        1     1.85   0.5526    1.056    1.316 
##      .25      .50      .75      .90      .95 
##    1.591    1.833    2.145    2.488    2.731 
## 
## lowest : 0.3779137 0.4193892 0.5400612 0.5546331 0.6211305
## highest: 2.8078084 2.8371549 2.8938911 2.8986392 3.1174846
## ---------------------------------------------------------------------------

Correlation of six factors to Happiness Score

Which factor plays the most significant role in contributing to happiness score?

happiness_corr <- cor(happiness_rank[c(4:10)])
corrplot(happiness_corr, method = "pie", type = "upper", order = "FPC",
         col = brewer.pal(n = 7, name = "GnBu"),
         tl.col = "black", cl.align = "r", cl.ratio = 0.3)

Scatter plot

Scatter plot of the top 2 correlated factors with happiness score by continents. Australia is disregarded since it has only two data points.

happiness_rank %>%
  filter(Continent != "Australia") %>%
  ggplot(aes(Life.Expectancy, Happiness.Score)) +
  geom_point(aes(color = Continent), size = 3, alpha = 0.8) +
  geom_smooth(aes(color = Continent, fill = Continent), method = "lm", fullrange = TRUE) +
  facet_wrap(~ Continent) + 
  theme_fivethirtyeight() +
  ggtitle("Life Expectancy")

happiness_rank %>%
  filter(Continent != "Australia") %>%
  ggplot(aes(Family, Happiness.Score)) +
  geom_point(aes(color = Continent), size = 3, alpha = 0.8) +
  geom_smooth(aes(color = Continent, fill = Continent), method = "lm", fullrange = TRUE) +
  facet_wrap(~ Continent) + 
  theme_fivethirtyeight() +
  ggtitle("Family")

The correlation plot above shows Life Expectancy and Family play a major role in a continent's happiness level.
A study published in the Proceedings of the Academy of National Sciences and reported by CNN, states happiness is linked to longer life. This indicates two things - longer life eventually leads to higher level of happiness or happiness leads to longer life.

Getting GDP per capita data

What about economy? To examine this question, a list of countries by GDP per capita published by World Bank in year 2017 is scrapped from Wikipedia.

wiki_gdp_capita <- "https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(PPP)_per_capita"
wiki_gdp_capita <- read_html(wiki_gdp_capita)
gdp_capita <- wiki_gdp_capita %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/div/table[1]/tbody/tr[2]/td[2]/table') %>%
  html_table()

gdp_capita <- gdp_capita[[1]] #getting the data frame from the list

Tidying and Transforming the new data

As with any harvested data, a thorough clean up is necessary.

#renaming the column names and removing rank
gdp_capita <- gdp_capita %>% 
  dplyr::rename(Country = 'Country/Territory', 
                USD = 'Int$') %>%
  select(-Rank) 

#removing commas from dollar amount  
gdp_capita$USD <- str_replace_all(gdp_capita$USD, "[:punct:]", '') %>%
  as.numeric()

#all rows in happiness rank that do no have a match in gdp per capita
anti_join(happiness_rank, gdp_capita) %>%
  select(Country)

##                     Country
## 1  Taiwan Province of China
## 2              North Cyprus
## 3   Hong Kong S.A.R., China
## 4                 Venezuela
## 5                   Somalia
## 6   Palestinian Territories
## 7       Congo (Brazzaville)
## 8          Congo (Kinshasa)
## 9               Ivory Coast
## 10                    Yemen
## 11              South Sudan
## 12                    Syria

happiness_rank_gdp <- happiness_rank

#changing the value of the rows which are spelled differently
happiness_rank_gdp$country <- recode(happiness_rank_gdp$Country, "'North Cyprus' = 'Cyprus'")
happiness_rank_gdp$Country <- recode(happiness_rank_gdp$Country, "'Hong Kong S.A.R., China' = 'Hong Kong'")
happiness_rank_gdp$Country <- recode(happiness_rank_gdp$Country, "'Congo (Kinshasa)' = 'Congo, Rep.'")
happiness_rank_gdp$Country <- recode(happiness_rank_gdp$Country, "'Congo (Brazzaville)' = 'Congo, Dem. Rep.'")
happiness_rank_gdp$Country <- recode(happiness_rank_gdp$Country, "'South Sudan' = 'Sudan'")

#joining both datasets and delete the duplicate column of Country
happiness_rank_gdp <- left_join(happiness_rank_gdp, gdp_capita)

#standarizing the USD column and change it the name of the column to Economy
happiness_rank_gdp <- happiness_rank_gdp %>%
  mutate(Economy = percent_rank(USD)) %>%
  select(-11, -12) %>%
  drop_na()

#eight rows were removed since gdp capita do not have data for them
glimpse(happiness_rank_gdp)

## Observations: 147
## Variables: 11
## $ Country           <chr> "Norway", "Denmark", "Iceland", "Switzerland...
## $ Continent         <fct> Europe, Europe, Europe, Europe, Europe, Euro...
## $ Happiness.Rank    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 1...
## $ Happiness.Score   <dbl> 7.537, 7.522, 7.504, 7.494, 7.469, 7.377, 7....
## $ Family            <dbl> 1.533524, 1.551122, 1.610574, 1.516912, 1.54...
## $ Life.Expectancy   <dbl> 0.7966665, 0.7925655, 0.8335521, 0.8581313, ...
## $ Freedom           <dbl> 0.6354226, 0.6260067, 0.6271626, 0.6200706, ...
## $ Generosity        <dbl> 0.36201224, 0.35528049, 0.47554022, 0.290549...
## $ Trust             <dbl> 0.31596383, 0.40077007, 0.15352656, 0.367007...
## $ Dystopia.Residual <dbl> 2.277027, 2.313707, 2.322715, 2.276716, 2.43...
## $ Economy           <dbl> 0.9452055, 0.8972603, 0.9246575, 0.9589041, ...

Visualizing the new data

There is a correlation between economy and happiness score. And it's higher than life expectancy and family factors.

happiness_rank_gdp %>%
  filter(Continent != "Australia") %>%
  ggplot(aes(Economy, Happiness.Score)) +
  geom_point(aes(color = Continent), size = 3, alpha = 0.8) +
  geom_smooth(aes(color = Continent, fill = Continent), method = "lm", fullrange = TRUE) +
  facet_wrap(~ Continent) + 
  theme_fivethirtyeight() +
  ggtitle("Economy")

happiness_gdp_corr <- cor(happiness_rank_gdp[c(4:11)])
corrplot(happiness_gdp_corr, method = "pie", type = "upper", order = "FPC",
         col = brewer.pal(n = 7, name = "OrRd"),
         tl.col = "black", cl.align = "r", cl.ratio = 0.3)

Prediction

In this section, multiple linear regression, a machine learning algorithm will be implemented to see if it's effective at predicting happiness score based on the seven factors on future unseen data.

#splitting the dataset into train set and test set using caTools package
set.seed(123)
dataset <- happiness_rank_gdp[4:11]
split <- sample.split(dataset$Happiness.Score, SplitRatio = 0.8)
train_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

#fitting the model to the train set
m_linear_regression <- lm(formula = Happiness.Score ~ ., data = train_set)
summary(m_linear_regression)

## 
## Call:
## lm(formula = Happiness.Score ~ ., data = train_set)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.51571 -0.05793  0.00568  0.05979  0.43585 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        0.23756    0.05865   4.050 9.60e-05 ***
## Family             1.06804    0.04847  22.034  < 2e-16 ***
## Life.Expectancy    1.14576    0.07331  15.629  < 2e-16 ***
## Freedom            1.14461    0.08669  13.204  < 2e-16 ***
## Generosity         0.68511    0.07844   8.734 3.25e-14 ***
## Trust              0.95842    0.11675   8.209 4.92e-13 ***
## Dystopia.Residual  0.99407    0.01934  51.400  < 2e-16 ***
## Economy            1.25387    0.06634  18.899  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1034 on 109 degrees of freedom
## Multiple R-squared:  0.9918, Adjusted R-squared:  0.9913 
## F-statistic:  1881 on 7 and 109 DF,  p-value: < 2.2e-16

Since the happiness score is the sum of the independent variables, linear correlation is very strong. As a result, multiple linear regression will predict happiness score with high accuracy given these factors. To verify, prediction is made based on the test data set.

predict_test <- predict(m_linear_regression, newdata = test_set)
predict_actual <- as.data.frame(cbind(Prediction = predict_test, Actual = test_set$Happiness.Score))

predict_actual %>%
  ggplot(aes(Actual, Prediction)) +
  geom_point() +
  geom_abline() +
  theme_pander() +
  theme(legend.position = "none", plot.title = element_text(hjust = 0.5, vjust = 0.3)) +
  labs(title = "Multiple Linear Regression",
       x = "Actual Happiness Score",
       y = "Predicted Happiness Score")