1. Overview
1.1 Purpose of research
This RPubs aims to understand the happiness scores of countries in the world and which factors contribute to happiness.
2. Major Data and Design Challenges
2.1 Lack of Map data for geospatial plots
When i first downloaded the data, I found out that there wasnt any field I could use to plot map data.
2.2 Dataset found have different naming convention for different years
After checking each data file, I realized that the columns have different naming for different years. Which would become a problem when I try to join them into a single dataset.
2.3 Dataset found not consistent in number of columns
Furthermore, the order of the columns were different and some years have additional columns. Again it will be a problem when i try to combine the datasets as some years will not have data for certain fields.
3. How to overcome Challenges
3.1 Source for country map dataset
After search through the internet, I managed to find a way to create geospatial maps by joining ISO3166 country codes to each country. Afterwhich, the hc_map library will be able to visualize the map data based on the country codes.
3.2 Standardize column names across every dataset
To standardize the column names, I edited the data files manually and input the standardizaed column names.
3.3 Standardize the columns to be used for visualization
Again, i edited the data files directly and remove the additional columns that other years do not have. The reason for editing directly is becausei felt that it would be the fastest way to do it compared to scripting through R.
3.4 Sketch of proposed design
4. Step by Step Data Visualization
4.1 Import Libraries
packages = c( 'tidyverse', 'dplyr', 'plotly', 'maps', 'heatmaply', 'highcharter', 'ggcorrplot','DT')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
4.2 Import Data
library('plyr') # only used for joining data files.
temp = list.files(pattern="*.csv")
## combine all data
data <- ldply(temp, read.csv,header=TRUE)
## save combined data
write.csv(data,"data/combined.csv",row.names = FALSE)
data <- read_csv("data/combined.csv")
datatable(head(data))
4.3 get ISO3166 country codes
# get iso3166 country codes
dat <- iso3166
dat <- rename(dat, "iso-a3" = a3)
datatable(head(dat))
4.4 Data Preprocessing
After trying to run the codes, I realized that certain countries did not have the exact same name as the ISO3166 country code data, hence I experienced some trouble joining the data. To solve this, I manually filter out the countries that had different names and corrected them.
## Manual changing country names to match country names in iso3166
data$Country[data$Country == 'South Korea'] <- 'Republic of Korea'
data$Country[data$Country == 'United Kingdom'] <- 'United Kingdom of Great Britain and Northern Ireland'
data$Country[data$Country == 'Venezuela (Bolivarian Republic of)'] <-"Venezuela, Bolivarian Republic of"
data$Country[data$Country == 'Bolivia'] <-"Bolivia, Plurinational State of"
data$Country[data$Country == 'Russia'] <- 'Russian Federation'
data$Country[data$Country == 'Taiwan Province of China'] <- 'Taiwan'
data$Country[data$Country == 'Moldova, the Republic of'] <- 'Moldova, Republic of'
data$Country[data$Country == 'Bolivia (Plurinational State of)'] <- 'Bolivia, Plurinational State of'
data$Country[data$Country == 'North Cyprus'] <- 'Cyprus'
data$Country[data$Country == 'Hong Kong S.A.R. of China'] <- 'Hong Kong'
data$Country[data$Country == 'Congo (Brazzaville)'] <- 'Democratic Republic of the Congo'
data$Country[data$Country == 'North Macedonia'] <- 'the former Yugoslav Republic of Macedonia'
data$Country[data$Country == 'Laos'] <- "Lao People's Democratic Republic"
data$Country[data$Country == 'Iran'] <- 'Iran, Islamic Republic of'
data$Country[data$Country == 'Palestinian Territories'] <- 'Palestine, State of'
data$Country[data$Country == 'Tanzania'] <- 'Tanzania, United Republic of'
data$GovernmentCorruption <- as.double(data$GovernmentCorruption)
data_joined <- left_join(data,dat,by= c("Country"="ISOname" ))
4.5 Visualization 1: Box plot of Happiness score by year
The first visualization I wanted to do was a boxplot showing the happiness sccore by Year. This way I can see how the distribution is llike per year and look for potential patterns/insights.
boxplot_by_year <- data %>% plot_ly(y = ~HappinessScore,x= ~Year, type = "box",boxpoints = "all")
boxplot_by_year <- boxplot_by_year %>% layout(title="Happiness Index over the years")
boxplot_by_year
4.6 Visualization 2: Map
The second visualization that came to mind was to display a world map indicating the Happiness Score by each countries. This is so that I will be able to see which countries have the highest scores and if theres the scores are similar within regions.
#get 2020 data
data2020 <- data_joined %>%
filter(Year == 2020)
happiness_map <- hcmap(
map = "custom/world-highres3", # high resolution world map
data = data2020, # name of dataset
joinBy = "iso-a3",
name = "Happiness Score",
value = "HappinessScore",
nullColor = "#DADADA",
tooltip = list(valueDecimals = 2)
) %>%
hc_mapNavigation(enabled = FALSE) %>%
hc_title(text = "Happiness Score by Countries 2020") # title
happiness_map
4.7 Visualization 3: Top 10 Happiness score by year
Next i was interested to see how the top 10 countries in terms of happiness score fare over the years.
top_10_happiness <- data %>%
select(Country,HappinessRank) %>%
group_by(Country) %>%
summarise(HappinessRank = mean(HappinessRank)) %>%
arrange(HappinessRank)
top_10_happiness <- top_10_happiness[1:10,]
top_10_happiness_by_year_data <- subset(data, (Country %in% top_10_happiness$Country))
top_10_happiness_by_year <- plot_ly(top_10_happiness_by_year_data, x= ~Year, y = ~HappinessScore,
mode = "line",
color =~Country,
type = "scatter")
top_10_happiness_by_year <- top_10_happiness_by_year %>% layout(title = 'Top 10 Happiness score by year')
top_10_happiness_by_year
4.8 Visualization 4: Bottom 10 Happiness score by year
Followed by the bottom 10 do see if there is any patterns in the visualizations
data2021 <- data %>%
filter(Year == 2021)
btm_10_happiness <- data %>%
select(Country,HappinessRank) %>%
group_by(Country) %>%
summarise(HappinessRank = mean(HappinessRank)) %>%
arrange(desc(HappinessRank))
btm_10_happiness <- btm_10_happiness %>%
filter(Country %in% data2021$Country)
btm_10_happiness <- btm_10_happiness[1:10,]
btm_10_happiness_by_year_data <- subset(data, ((Country %in% btm_10_happiness$Country)))
btm_10_happiness_by_year <- plot_ly(btm_10_happiness_by_year_data, x= ~Year, y = ~HappinessScore,
mode = "line",
color =~Country,
type = "scatter")
btm_10_happiness_by_year <- btm_10_happiness_by_year %>% layout(title = 'Bottom 10 Happiness score by year')
btm_10_happiness_by_year
4.9 Visualization 5: Correlation plot
I was interested to see which factors contribute more to the happiness score, and i felt that a correlation plot was the best way to do it
corr_data <- data %>%
select(4:10) %>%
drop_na()
corr <- round(cor(corr_data),1)
# Compute a matrix of correlation p-values
p.mat <- cor_pmat(corr_data)
# Visualize the lower triangle of the correlation matrix
# Barring the no significant coefficient
corr.plot <- ggcorrplot(
corr, hc.order = TRUE, type = "lower", outline.col = "white",
p.mat = p.mat
)
corr_plot <- ggplotly(corr.plot) %>% layout(title="Correlation Plot")
corr_plot
4.10 Visualization 6: Scatter plot of Happiness against Freedom and Social Support
Laslty, based on the 2 factors with the highest correlation, I wanted to see if there was any observable pattern across the years.
Happiness_by_freedom_support <- data %>%
plot_ly( x = ~SocialSupport, y = ~Freedom, text = ~Country, type = 'scatter', mode = 'markers',frame=data$Year,
marker = list(size = ~HappinessScore, opacity = 0.5,sizemode = 'diameter'))
Happiness_by_freedom_support <- Happiness_by_freedom_support %>%
layout(title = 'Happiness index by Freedom and Social support score',
xaxis = list(showgrid = FALSE),
yaxis = list(showgrid = FALSE),
showlegend = FALSE)
Happiness_by_freedom_support <- Happiness_by_freedom_support %>%
animation_opts(
2500
)
Happiness_by_freedom_support
5. Final Visualizations.
Happiness Index By Year
The gap between happiness score between countries seemed to have widen over the years.
Happiness Index distribution across the world 2020
Countries in the west generally have higher Happiness Score.
Top 10 Countries by Happiness Score and Year
Even within the top 10 countries, the happiness gap seemed to be widening over the years.
Bottom 10 Countries by Happiness Score and Year
Whereas for the bottom 10 countries, the happiness score gap seemed to be narrowing.
Correlation plot between factors
From the plot, we can see that the Social Support and Freedom correlates positively with Happiness score. The other factors have little or no impact to the score. It is suprising to see that Generoisty does not impact Happiness score at all.
Happiness against Freedom and Social Support
From the chart we can see that countries with high freem and Social support tend to have the highest Happiness Score. However, while it is not as high, countries with either high freedom or social support score seemed to have higher Happiness Score as well.
Another insight we can gather is that, the freedom score experienced a huge increase from 2019 to 2020.
Press play to see the coordinates change every year.
6. Additional Insights Gathered
Happiness Index distribution across the world 2020
Based on the chart, countries in the west generally have higher Happiness Score with the exception of Australia which has a happiness score depsite being in the east side.
Happiness Index distribution across the world 2020
From this chart, we can see that there is a trend of happiness score widening over the years.The interquartie range has also been shortening over the year as more and more countries are converging towards an average happiness score of 5.5.
Top 10 countries vs Bottom 10 countries in happiness score
From the chart we can see that the top 10 coutries seemed to have a diverging happiness score while the bottom 10 countries seemed to be converging in terms of happiness score. This could be that even in country with high happiness score, it might be difficult to maintain.