Preparation / Data Clean

For the first step, we load necessary packages and check if there’s an null value in the dataset. There’s no NA value - the data is pretty clean and ready for use. We have the same data for 2015 and 2016, for a comparison purpose, we import both.

library(ggplot2)
library(ggmap)
library(ggthemes)
library(tidyverse)
library(GGally)
library(MASS)
library(randomForest)

# read data
WH_15 <- read.csv('~/2015.csv',header = TRUE)
WH_16 <- read.csv('~/2016.csv',header = TRUE)

# no missing value
summary(is.na(WH_15))

##     Year          Country          Region        Happiness.Rank 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:158       FALSE:158       FALSE:158       FALSE:158      
##  Happiness.Score Economy..GDP.per.Capita.   Family       
##  Mode :logical   Mode :logical            Mode :logical  
##  FALSE:158       FALSE:158                FALSE:158      
##  Health..Life.Expectancy.  Freedom        Trust..Government.Corruption.
##  Mode :logical            Mode :logical   Mode :logical                
##  FALSE:158                FALSE:158       FALSE:158                    
##  Generosity      Dystopia.Residual
##  Mode :logical   Mode :logical    
##  FALSE:158       FALSE:158

summary(is.na(WH_16))

##     Year          Country          Region        Happiness.Rank 
##  Mode :logical   Mode :logical   Mode :logical   Mode :logical  
##  FALSE:157       FALSE:157       FALSE:157       FALSE:157      
##  Happiness.Score Economy..GDP.per.Capita.   Family       
##  Mode :logical   Mode :logical            Mode :logical  
##  FALSE:157       FALSE:157                FALSE:157      
##  Health..Life.Expectancy.  Freedom        Trust..Government.Corruption.
##  Mode :logical            Mode :logical   Mode :logical                
##  FALSE:157                FALSE:157       FALSE:157                    
##  Generosity      Dystopia.Residual
##  Mode :logical   Mode :logical    
##  FALSE:157       FALSE:157

# summary view of both data
summary(WH_15)

##       Year             Country                                Region  
##  Min.   :2015   Afghanistan:  1   Sub-Saharan Africa             :40  
##  1st Qu.:2015   Albania    :  1   Central and Eastern Europe     :29  
##  Median :2015   Algeria    :  1   Latin America and Caribbean    :22  
##  Mean   :2015   Angola     :  1   Western Europe                 :21  
##  3rd Qu.:2015   Argentina  :  1   Middle East and Northern Africa:20  
##  Max.   :2015   Armenia    :  1   Southeastern Asia              : 9  
##                 (Other)    :152   (Other)                        :17  
##  Happiness.Rank   Happiness.Score Economy..GDP.per.Capita.
##  Min.   :  1.00   Min.   :2.839   Min.   :0.0000          
##  1st Qu.: 40.25   1st Qu.:4.526   1st Qu.:0.5458          
##  Median : 79.50   Median :5.232   Median :0.9102          
##  Mean   : 79.49   Mean   :5.376   Mean   :0.8461          
##  3rd Qu.:118.75   3rd Qu.:6.244   3rd Qu.:1.1584          
##  Max.   :158.00   Max.   :7.587   Max.   :1.6904          
##                                                           
##      Family       Health..Life.Expectancy.    Freedom      
##  Min.   :0.0000   Min.   :0.0000           Min.   :0.0000  
##  1st Qu.:0.8568   1st Qu.:0.4392           1st Qu.:0.3283  
##  Median :1.0295   Median :0.6967           Median :0.4355  
##  Mean   :0.9910   Mean   :0.6303           Mean   :0.4286  
##  3rd Qu.:1.2144   3rd Qu.:0.8110           3rd Qu.:0.5491  
##  Max.   :1.4022   Max.   :1.0252           Max.   :0.6697  
##                                                            
##  Trust..Government.Corruption.   Generosity     Dystopia.Residual
##  Min.   :0.00000               Min.   :0.0000   Min.   :0.3286   
##  1st Qu.:0.06168               1st Qu.:0.1506   1st Qu.:1.7594   
##  Median :0.10722               Median :0.2161   Median :2.0954   
##  Mean   :0.14342               Mean   :0.2373   Mean   :2.0990   
##  3rd Qu.:0.18025               3rd Qu.:0.3099   3rd Qu.:2.4624   
##  Max.   :0.55191               Max.   :0.7959   Max.   :3.6021   
##

summary(WH_16)

##       Year             Country                                Region  
##  Min.   :2016   Afghanistan:  1   Sub-Saharan Africa             :38  
##  1st Qu.:2016   Albania    :  1   Central and Eastern Europe     :29  
##  Median :2016   Algeria    :  1   Latin America and Caribbean    :24  
##  Mean   :2016   Angola     :  1   Western Europe                 :21  
##  3rd Qu.:2016   Argentina  :  1   Middle East and Northern Africa:19  
##  Max.   :2016   Armenia    :  1   Southeastern Asia              : 9  
##                 (Other)    :151   (Other)                        :17  
##  Happiness.Rank   Happiness.Score Economy..GDP.per.Capita.
##  Min.   :  1.00   Min.   :2.905   Min.   :0.0000          
##  1st Qu.: 40.00   1st Qu.:4.404   1st Qu.:0.6702          
##  Median : 79.00   Median :5.314   Median :1.0278          
##  Mean   : 78.98   Mean   :5.382   Mean   :0.9539          
##  3rd Qu.:118.00   3rd Qu.:6.269   3rd Qu.:1.2796          
##  Max.   :157.00   Max.   :7.526   Max.   :1.8243          
##                                                           
##      Family       Health..Life.Expectancy.    Freedom      
##  Min.   :0.0000   Min.   :0.0000           Min.   :0.0000  
##  1st Qu.:0.6418   1st Qu.:0.3829           1st Qu.:0.2575  
##  Median :0.8414   Median :0.5966           Median :0.3975  
##  Mean   :0.7936   Mean   :0.5576           Mean   :0.3710  
##  3rd Qu.:1.0215   3rd Qu.:0.7299           3rd Qu.:0.4845  
##  Max.   :1.1833   Max.   :0.9528           Max.   :0.6085  
##                                                            
##  Trust..Government.Corruption.   Generosity     Dystopia.Residual
##  Min.   :0.00000               Min.   :0.0000   Min.   :0.8179   
##  1st Qu.:0.06126               1st Qu.:0.1546   1st Qu.:2.0317   
##  Median :0.10547               Median :0.2225   Median :2.2907   
##  Mean   :0.13762               Mean   :0.2426   Mean   :2.3258   
##  3rd Qu.:0.17554               3rd Qu.:0.3119   3rd Qu.:2.6646   
##  Max.   :0.50521               Max.   :0.8197   Max.   :3.8377   
##

Analysis for WH 15 data

Let’s take a look and 15 data first.

Top and Bottom 15 Happiness Score countries

As there are too many countries included in the dataset, we are more interested in the top / bottom countries. We take TOP and Bottom 15 countries and graph their Happiness out. We also using different colors to label the region these countries belongs to.

WH_15_top_15 <- WH_15 %>%
  arrange(desc(Happiness.Score)) %>%
  head(n = 15)

WH_15_bt_15 <- WH_15 %>%
  arrange(Happiness.Score) %>%
  head(n = 15)

WH_15_top_15$Country <- factor(WH_15_top_15$Country, levels = WH_15_top_15$Country[order(WH_15_top_15$Happiness.Score)])
WH_15_bt_15$Country <- factor(WH_15_bt_15$Country, levels = WH_15_bt_15$Country[order(WH_15_bt_15$Happiness.Score)])

ths <- ggplot(data = WH_15_top_15, mapping = aes(x = Country, y = Happiness.Score, label = Happiness.Score))
ths + geom_point(aes(color = Region),size = 2) +
  geom_text(aes(color = Region), size = 3,hjust = -0.2) +
  theme(legend.position="bottom") +
  coord_flip() +
  ggtitle("Country with top 15 Happiness Score")

bhs <- ggplot(data = WH_15_bt_15, mapping = aes(x = Country, y = Happiness.Score, label = Happiness.Score))
bhs + geom_point(aes(color = Region),size = 2) +
  geom_text(aes(color = Region), size = 3,hjust = -0.2) +
  theme(legend.position="bottom") +
  coord_flip() +
  ggtitle("Country with bottom 15 Happiness Score")

There are tons of interesting insights:

For the top 15 countries, the top 8 countries are all from western Europe except for Canada from North America
Israel is the only representative of Middle East and Northern Africa in the top list
United States is only the 15th in the top score ranking
For the bottom 15 countries, 12 out of 15 are from Sub-Saharan Africa
Cambodia is the only representative in the bottom list for Southeastern Asia, while Afghanistan is the only one from Southern Asia and Syria the only one from Middle East and northern Africa

Happiness Score by regions

WH_15_Rg <- WH_15 %>%
  group_by(Region) %>%
  summarise(mean_score = mean(Happiness.Score)) %>%
  arrange(desc(mean_score))

WH_15_Rg$Region <- factor(WH_15_Rg$Region, levels = WH_15_Rg$Region[order(WH_15_Rg$mean_score)])

rwhs_15 <- ggplot(data = WH_15_Rg, aes(Region)) 
rwhs_15 + geom_bar(aes(weight = mean_score, alpha = mean_score, fill = mean_score), fill = 'dark green') +
  coord_flip() +
  theme(legend.position="bottom") +
  ggtitle("Mean Score by Region") +
  labs(y = "mean_score")

rwhs_15_2 <- ggplot(data = WH_15, aes(x = Region, y = Happiness.Score))
rwhs_15_2 + geom_boxplot(fill = "white", colour = "#3366FF") +
  coord_flip() +
  ggtitle("Score Distribution by Region")

By aggregate by Region mean happiness score, we can see the top performer is Australia and New Zealand, then NA, then Western EU. The bottom ones rea Sub0Saharan Africa, Southern Asia and Southeastern Asia.

By looking at the score distribution, what’s interesting here is Western EU has most of the top guys, but as the score is pretty spred out, the average only ranks 3rd. Middle East and Northern Africa ranks only 6th when comparing mean, but it spreads out and has some high performers. So what are the bottom performers of Western EU and top performers of middle east?

WH_15[,c("Country","Region","Happiness.Score")]%>%
arrange(Happiness.Score) %>%
filter(Region == "Western Europe") %>%
head(n = 5)

##        Country         Region Happiness.Score
## 1       Greece Western Europe           4.857
## 2     Portugal Western Europe           5.102
## 3       Cyprus Western Europe           5.689
## 4 North Cyprus Western Europe           5.695
## 5        Italy Western Europe           5.948

WH_15[,c("Country","Region","Happiness.Score")]%>%
arrange(desc(Happiness.Score)) %>%
filter(Region == "Middle East and Northern Africa") %>%
head(n = 5)

##                Country                          Region Happiness.Score
## 1               Israel Middle East and Northern Africa           7.278
## 2 United Arab Emirates Middle East and Northern Africa           6.901
## 3                 Oman Middle East and Northern Africa           6.853
## 4                Qatar Middle East and Northern Africa           6.611
## 5         Saudi Arabia Middle East and Northern Africa           6.411

We also conducted an ANOVA test, which shows the difference across regions is significant.

#anova for diff across region
region_anova <- aov(Happiness.Score ~ Region, WH_15)
summary(region_anova)

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Region        9 123.68  13.743   24.76 <2e-16 ***
## Residuals   148  82.15   0.555                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Comparison between 15 AND 16 data

YOY Country

16 data is pretty consistant with 15 data, so we don’t show 16 data results separately. but what will be pretty fun is to see if the there’s a move in the top / bottom performers, if so, what’s the direction of moving? We did it by first of all, combining two years’ data and caculate the difference between happiness scores YOY (year16 - year15). Then here we show the differences (blue is negative move, while red is positive):

WH_15_16_Mg <- merge(WH_15, WH_16, by = "Country", all = TRUE)
WH_15_16_Mg$diff <- round((WH_15_16_Mg$Happiness.Score.x - WH_15_16_Mg$Happiness.Score.y),4)

WH_15_16_top_15 <- WH_15_16_Mg %>%
  arrange(desc(Happiness.Score.x)) %>%
  head(n = 15)


WH_15_16_top_15$Country <- factor(WH_15_16_top_15$Country, levels = WH_15_16_top_15$Country[order(WH_15_16_top_15$Happiness.Score.x)])


yoy_tp <- ggplot(data = WH_15_16_top_15, aes(x = Country, y = diff, label = diff))
yoy_tp + geom_point(aes(colour = diff < 0)) +
  theme(legend.position="bottom") +
  geom_text(aes(color = diff < 0), size = 3,hjust = -0.2 ) + 
  coord_flip() +
  ggtitle("Country with top 15 Happiness Score YOY Change")

WH_15_16_bt_15 <- WH_15_16_Mg %>%
  arrange(Happiness.Score.x) %>%
  head(n = 15)


WH_15_16_bt_15$Country <- factor(WH_15_16_bt_15$Country, levels = WH_15_16_bt_15$Country[order(WH_15_16_bt_15$Happiness.Score.x)])


yoy_bt <- ggplot(data = WH_15_16_bt_15, aes(x = Country, y = diff, label = diff))
yoy_bt + geom_point(aes(colour = diff < 0)) +
  theme(legend.position="bottom") +
  geom_text(aes(color = diff < 0), size = 3,hjust = -0.2 ) + 
  coord_flip() +
  ggtitle("Country with Bottom 15 Happiness Score YOY Change")

Some interesting insights here are: - 12 out of 15 top performers of 2015 still see an improvment of happiness score in 2016, with Mexico increased the most (+0.409) - Unfortunately, 9 out of 15 bottom performers of 2015 see decrease in 2016, with Togo (which is the worst in 2015) decreased the most (-0.464) - Central African Republic disappeared in 2016. It might due to the lack of data, or political turmoil

YOY Region

So how about the YOY change of region?

WH_15_16_Rg <- WH_15_16_Mg %>%
  group_by(Region.x) %>%
  summarise(mean_diff = round(mean(diff,na.rm = TRUE),4)) %>%
  arrange(desc(mean_diff)) %>%
  filter(!is.na(Region.x)) 


WH_15_16_Rg$Region.x <- factor(WH_15_16_Rg$Region.x, levels = WH_15_16_Rg$Region.x[order(WH_15_16_Rg$mean_diff)])

rwhs_yoy <- ggplot(data = WH_15_16_Rg, aes(x = Region.x, y = mean_diff, label = mean_diff)) 
rwhs_yoy + geom_point(aes(colour = mean_diff < 0)) +
  theme(legend.position="bottom") +
  geom_text(aes(color = mean_diff < 0), size = 3,hjust = -0.2 ) + 
  coord_flip() +
  ggtitle("Region YOY Score Change") +
  labs(x = "Region", y = "Mean_Diff_Score")

Averagely, Latin American and Caribbean increase the most of +0.078, while Middle East and Orth Africa decreased the most of -0.0553.

Correlation and Modeling

Correlation

So with an overview of the scores, we are interested in - what are the factors that positively / negatively influencing the happiness score? Before modeling, let’s take a look of the overall correlation:

ggpairs(WH_15, columns = 5:12)

There are several pretty clear linear relationships between the score variable and other variables. all variables are positively correlationted with happiness score.

Modeling

Utilizing RandomForest and taking all variables into the model to explain Happiness Score - Which explains 89.74% of the score.

wh_rf <- randomForest(Happiness.Score ~., data = WH_15[,5:12])
wh_rf

## 
## Call:
##  randomForest(formula = Happiness.Score ~ ., data = WH_15[, 5:12]) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##           Mean of squared residuals: 0.1350795
##                     % Var explained: 89.63

ANLY_510_90_World_Happiness_Score_Analysis_Team_3

Team_3

June 15, 2017