For the first step, we load necessary packages and check if there’s an null value in the dataset. There’s no NA value - the data is pretty clean and ready for use. We have the same data for 2015 and 2016, for a comparison purpose, we import both.
library(ggplot2)
library(ggmap)
library(ggthemes)
library(tidyverse)
library(GGally)
library(MASS)
library(randomForest)
# read data
WH_15 <- read.csv('~/2015.csv',header = TRUE)
WH_16 <- read.csv('~/2016.csv',header = TRUE)
# no missing value
summary(is.na(WH_15))
## Year Country Region Happiness.Rank
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:158 FALSE:158 FALSE:158 FALSE:158
## Happiness.Score Economy..GDP.per.Capita. Family
## Mode :logical Mode :logical Mode :logical
## FALSE:158 FALSE:158 FALSE:158
## Health..Life.Expectancy. Freedom Trust..Government.Corruption.
## Mode :logical Mode :logical Mode :logical
## FALSE:158 FALSE:158 FALSE:158
## Generosity Dystopia.Residual
## Mode :logical Mode :logical
## FALSE:158 FALSE:158
summary(is.na(WH_16))
## Year Country Region Happiness.Rank
## Mode :logical Mode :logical Mode :logical Mode :logical
## FALSE:157 FALSE:157 FALSE:157 FALSE:157
## Happiness.Score Economy..GDP.per.Capita. Family
## Mode :logical Mode :logical Mode :logical
## FALSE:157 FALSE:157 FALSE:157
## Health..Life.Expectancy. Freedom Trust..Government.Corruption.
## Mode :logical Mode :logical Mode :logical
## FALSE:157 FALSE:157 FALSE:157
## Generosity Dystopia.Residual
## Mode :logical Mode :logical
## FALSE:157 FALSE:157
# summary view of both data
summary(WH_15)
## Year Country Region
## Min. :2015 Afghanistan: 1 Sub-Saharan Africa :40
## 1st Qu.:2015 Albania : 1 Central and Eastern Europe :29
## Median :2015 Algeria : 1 Latin America and Caribbean :22
## Mean :2015 Angola : 1 Western Europe :21
## 3rd Qu.:2015 Argentina : 1 Middle East and Northern Africa:20
## Max. :2015 Armenia : 1 Southeastern Asia : 9
## (Other) :152 (Other) :17
## Happiness.Rank Happiness.Score Economy..GDP.per.Capita.
## Min. : 1.00 Min. :2.839 Min. :0.0000
## 1st Qu.: 40.25 1st Qu.:4.526 1st Qu.:0.5458
## Median : 79.50 Median :5.232 Median :0.9102
## Mean : 79.49 Mean :5.376 Mean :0.8461
## 3rd Qu.:118.75 3rd Qu.:6.244 3rd Qu.:1.1584
## Max. :158.00 Max. :7.587 Max. :1.6904
##
## Family Health..Life.Expectancy. Freedom
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.8568 1st Qu.:0.4392 1st Qu.:0.3283
## Median :1.0295 Median :0.6967 Median :0.4355
## Mean :0.9910 Mean :0.6303 Mean :0.4286
## 3rd Qu.:1.2144 3rd Qu.:0.8110 3rd Qu.:0.5491
## Max. :1.4022 Max. :1.0252 Max. :0.6697
##
## Trust..Government.Corruption. Generosity Dystopia.Residual
## Min. :0.00000 Min. :0.0000 Min. :0.3286
## 1st Qu.:0.06168 1st Qu.:0.1506 1st Qu.:1.7594
## Median :0.10722 Median :0.2161 Median :2.0954
## Mean :0.14342 Mean :0.2373 Mean :2.0990
## 3rd Qu.:0.18025 3rd Qu.:0.3099 3rd Qu.:2.4624
## Max. :0.55191 Max. :0.7959 Max. :3.6021
##
summary(WH_16)
## Year Country Region
## Min. :2016 Afghanistan: 1 Sub-Saharan Africa :38
## 1st Qu.:2016 Albania : 1 Central and Eastern Europe :29
## Median :2016 Algeria : 1 Latin America and Caribbean :24
## Mean :2016 Angola : 1 Western Europe :21
## 3rd Qu.:2016 Argentina : 1 Middle East and Northern Africa:19
## Max. :2016 Armenia : 1 Southeastern Asia : 9
## (Other) :151 (Other) :17
## Happiness.Rank Happiness.Score Economy..GDP.per.Capita.
## Min. : 1.00 Min. :2.905 Min. :0.0000
## 1st Qu.: 40.00 1st Qu.:4.404 1st Qu.:0.6702
## Median : 79.00 Median :5.314 Median :1.0278
## Mean : 78.98 Mean :5.382 Mean :0.9539
## 3rd Qu.:118.00 3rd Qu.:6.269 3rd Qu.:1.2796
## Max. :157.00 Max. :7.526 Max. :1.8243
##
## Family Health..Life.Expectancy. Freedom
## Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.6418 1st Qu.:0.3829 1st Qu.:0.2575
## Median :0.8414 Median :0.5966 Median :0.3975
## Mean :0.7936 Mean :0.5576 Mean :0.3710
## 3rd Qu.:1.0215 3rd Qu.:0.7299 3rd Qu.:0.4845
## Max. :1.1833 Max. :0.9528 Max. :0.6085
##
## Trust..Government.Corruption. Generosity Dystopia.Residual
## Min. :0.00000 Min. :0.0000 Min. :0.8179
## 1st Qu.:0.06126 1st Qu.:0.1546 1st Qu.:2.0317
## Median :0.10547 Median :0.2225 Median :2.2907
## Mean :0.13762 Mean :0.2426 Mean :2.3258
## 3rd Qu.:0.17554 3rd Qu.:0.3119 3rd Qu.:2.6646
## Max. :0.50521 Max. :0.8197 Max. :3.8377
##
Let’s take a look and 15 data first.
As there are too many countries included in the dataset, we are more interested in the top / bottom countries. We take TOP and Bottom 15 countries and graph their Happiness out. We also using different colors to label the region these countries belongs to.
WH_15_top_15 <- WH_15 %>%
arrange(desc(Happiness.Score)) %>%
head(n = 15)
WH_15_bt_15 <- WH_15 %>%
arrange(Happiness.Score) %>%
head(n = 15)
WH_15_top_15$Country <- factor(WH_15_top_15$Country, levels = WH_15_top_15$Country[order(WH_15_top_15$Happiness.Score)])
WH_15_bt_15$Country <- factor(WH_15_bt_15$Country, levels = WH_15_bt_15$Country[order(WH_15_bt_15$Happiness.Score)])
ths <- ggplot(data = WH_15_top_15, mapping = aes(x = Country, y = Happiness.Score, label = Happiness.Score))
ths + geom_point(aes(color = Region),size = 2) +
geom_text(aes(color = Region), size = 3,hjust = -0.2) +
theme(legend.position="bottom") +
coord_flip() +
ggtitle("Country with top 15 Happiness Score")
bhs <- ggplot(data = WH_15_bt_15, mapping = aes(x = Country, y = Happiness.Score, label = Happiness.Score))
bhs + geom_point(aes(color = Region),size = 2) +
geom_text(aes(color = Region), size = 3,hjust = -0.2) +
theme(legend.position="bottom") +
coord_flip() +
ggtitle("Country with bottom 15 Happiness Score")
There are tons of interesting insights:
United States is only the 15th in the top score ranking
Cambodia is the only representative in the bottom list for Southeastern Asia, while Afghanistan is the only one from Southern Asia and Syria the only one from Middle East and northern Africa
WH_15_Rg <- WH_15 %>%
group_by(Region) %>%
summarise(mean_score = mean(Happiness.Score)) %>%
arrange(desc(mean_score))
WH_15_Rg$Region <- factor(WH_15_Rg$Region, levels = WH_15_Rg$Region[order(WH_15_Rg$mean_score)])
rwhs_15 <- ggplot(data = WH_15_Rg, aes(Region))
rwhs_15 + geom_bar(aes(weight = mean_score, alpha = mean_score, fill = mean_score), fill = 'dark green') +
coord_flip() +
theme(legend.position="bottom") +
ggtitle("Mean Score by Region") +
labs(y = "mean_score")
rwhs_15_2 <- ggplot(data = WH_15, aes(x = Region, y = Happiness.Score))
rwhs_15_2 + geom_boxplot(fill = "white", colour = "#3366FF") +
coord_flip() +
ggtitle("Score Distribution by Region")
By aggregate by Region mean happiness score, we can see the top performer is Australia and New Zealand, then NA, then Western EU. The bottom ones rea Sub0Saharan Africa, Southern Asia and Southeastern Asia.
By looking at the score distribution, what’s interesting here is Western EU has most of the top guys, but as the score is pretty spred out, the average only ranks 3rd. Middle East and Northern Africa ranks only 6th when comparing mean, but it spreads out and has some high performers. So what are the bottom performers of Western EU and top performers of middle east?
WH_15[,c("Country","Region","Happiness.Score")]%>%
arrange(Happiness.Score) %>%
filter(Region == "Western Europe") %>%
head(n = 5)
## Country Region Happiness.Score
## 1 Greece Western Europe 4.857
## 2 Portugal Western Europe 5.102
## 3 Cyprus Western Europe 5.689
## 4 North Cyprus Western Europe 5.695
## 5 Italy Western Europe 5.948
WH_15[,c("Country","Region","Happiness.Score")]%>%
arrange(desc(Happiness.Score)) %>%
filter(Region == "Middle East and Northern Africa") %>%
head(n = 5)
## Country Region Happiness.Score
## 1 Israel Middle East and Northern Africa 7.278
## 2 United Arab Emirates Middle East and Northern Africa 6.901
## 3 Oman Middle East and Northern Africa 6.853
## 4 Qatar Middle East and Northern Africa 6.611
## 5 Saudi Arabia Middle East and Northern Africa 6.411
We also conducted an ANOVA test, which shows the difference across regions is significant.
#anova for diff across region
region_anova <- aov(Happiness.Score ~ Region, WH_15)
summary(region_anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## Region 9 123.68 13.743 24.76 <2e-16 ***
## Residuals 148 82.15 0.555
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
16 data is pretty consistant with 15 data, so we don’t show 16 data results separately. but what will be pretty fun is to see if the there’s a move in the top / bottom performers, if so, what’s the direction of moving? We did it by first of all, combining two years’ data and caculate the difference between happiness scores YOY (year16 - year15). Then here we show the differences (blue is negative move, while red is positive):
WH_15_16_Mg <- merge(WH_15, WH_16, by = "Country", all = TRUE)
WH_15_16_Mg$diff <- round((WH_15_16_Mg$Happiness.Score.x - WH_15_16_Mg$Happiness.Score.y),4)
WH_15_16_top_15 <- WH_15_16_Mg %>%
arrange(desc(Happiness.Score.x)) %>%
head(n = 15)
WH_15_16_top_15$Country <- factor(WH_15_16_top_15$Country, levels = WH_15_16_top_15$Country[order(WH_15_16_top_15$Happiness.Score.x)])
yoy_tp <- ggplot(data = WH_15_16_top_15, aes(x = Country, y = diff, label = diff))
yoy_tp + geom_point(aes(colour = diff < 0)) +
theme(legend.position="bottom") +
geom_text(aes(color = diff < 0), size = 3,hjust = -0.2 ) +
coord_flip() +
ggtitle("Country with top 15 Happiness Score YOY Change")
WH_15_16_bt_15 <- WH_15_16_Mg %>%
arrange(Happiness.Score.x) %>%
head(n = 15)
WH_15_16_bt_15$Country <- factor(WH_15_16_bt_15$Country, levels = WH_15_16_bt_15$Country[order(WH_15_16_bt_15$Happiness.Score.x)])
yoy_bt <- ggplot(data = WH_15_16_bt_15, aes(x = Country, y = diff, label = diff))
yoy_bt + geom_point(aes(colour = diff < 0)) +
theme(legend.position="bottom") +
geom_text(aes(color = diff < 0), size = 3,hjust = -0.2 ) +
coord_flip() +
ggtitle("Country with Bottom 15 Happiness Score YOY Change")
Some interesting insights here are: - 12 out of 15 top performers of 2015 still see an improvment of happiness score in 2016, with Mexico increased the most (+0.409) - Unfortunately, 9 out of 15 bottom performers of 2015 see decrease in 2016, with Togo (which is the worst in 2015) decreased the most (-0.464) - Central African Republic disappeared in 2016. It might due to the lack of data, or political turmoil
So how about the YOY change of region?
WH_15_16_Rg <- WH_15_16_Mg %>%
group_by(Region.x) %>%
summarise(mean_diff = round(mean(diff,na.rm = TRUE),4)) %>%
arrange(desc(mean_diff)) %>%
filter(!is.na(Region.x))
WH_15_16_Rg$Region.x <- factor(WH_15_16_Rg$Region.x, levels = WH_15_16_Rg$Region.x[order(WH_15_16_Rg$mean_diff)])
rwhs_yoy <- ggplot(data = WH_15_16_Rg, aes(x = Region.x, y = mean_diff, label = mean_diff))
rwhs_yoy + geom_point(aes(colour = mean_diff < 0)) +
theme(legend.position="bottom") +
geom_text(aes(color = mean_diff < 0), size = 3,hjust = -0.2 ) +
coord_flip() +
ggtitle("Region YOY Score Change") +
labs(x = "Region", y = "Mean_Diff_Score")
Averagely, Latin American and Caribbean increase the most of +0.078, while Middle East and Orth Africa decreased the most of -0.0553.
So with an overview of the scores, we are interested in - what are the factors that positively / negatively influencing the happiness score? Before modeling, let’s take a look of the overall correlation:
ggpairs(WH_15, columns = 5:12)
There are several pretty clear linear relationships between the score variable and other variables. all variables are positively correlationted with happiness score.
Utilizing RandomForest and taking all variables into the model to explain Happiness Score - Which explains 89.74% of the score.
wh_rf <- randomForest(Happiness.Score ~., data = WH_15[,5:12])
wh_rf
##
## Call:
## randomForest(formula = Happiness.Score ~ ., data = WH_15[, 5:12])
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 2
##
## Mean of squared residuals: 0.1350795
## % Var explained: 89.63