1.Abstract

This project explores data sets from the World Happiness Project from 2017 to 2020 to learn about and visualize the trends and associations concerning happiness scores. We explore the correlation between GDP per capita and happiness score, happiness distribution across various regions of the world, the contributions of life expectancy and social support to happiness, and visualize the happiness scores across the globe using tree and choropleth maps. We complete all data analysis and visualization using R programming methods.

2. Introduction

Many things contribute to whether an individual is happy or not, much less a country as a whole. This project uses R programming to visualize these contributions to learn about and draw conclusions on the factors that help explain why some countries are happier than others. The aspects of happiness explored are GDP per capita, life expectancy, and social support.

3. Methodology

Data Source: World Happiness Report datasets (2017-2020) Wikipedia. Tools: R programming language, base packages, readxl, ggplot2, dplyr, tidyr, stringr, sqldf, rvest, treemap, maps. Data preprocessing: scraping, merging, and cleaning datasets. Data Visualization: Creating barplots, scatterplots, boxplots, bubbleplots, treemaps, and choropleth maps using ggplot2 and treemaps with world_data from the maps package. Unique colors from RColorBrewer.

4. Results

4.1 Box plot of Happiest Countries

The bar plot above displays the top ten happiest countries based on the data from the year 2020. We observe that the happiness scores of all the countries are within a small margin of each other. The greatest difference between happiness scores exists between Finland and Luxembourg at 0.526, which is still relatively small. We also observe that the majority of these countries are from within Europe.

4.2 Scatter Plot of Happiness Score Over Time (Six Countries)

This scatter plot shows how the happiness scores of China, India, Indonesia, Japan, Russia, and the United States have changed from 2017-2020. We added lines to aid in visualizing the changes in the scores over time. While the happiness scores of most of these countries are relatively stable, the score of India and Russia have decreased the most over time.

4.3 Scatter Plot of Happiness Over Time (Top ten)

Similar to the previous scatter plot, this one displays happiness scores over time, but the selected countries are the top ten happiest countries for every year from 2017-2020. We did not include lines here because we wanted to emphasize the main observation of which countries have consistently ranked as the happiest over the years. A closer inspection of the scatter plot will show that Denmark, Finland, Iceland, Netherlands, New Zealand, Norway, Sweden, and Switzerland appear among the top ten happiest countries every year.

4.4 Scatter Plot of GDP vs. Happiness Score

This scatter plot visualizes the correlation between GDP per capita and Happiness score from 2017-2020, where each point of the graph represents a country, and the point’s color tells us what year the data is from. While each year had a slightly different regression line, the differences were so slight we decided to display a single regression line for all four years of data. Based on the graph, it is clear that these two factors have a strong linear relationship and, therefore, are highly correlated with each other.

4.5 Box Plot Happiness Distrubution by Region

The box plot above displays the distribution of happiness scores based on the 2020 data by showing the median, quartiles, and outliers by continent. The two countries in the Unassigned category are Russia and Egypt. They were not reassigned to other categories because they are transcontinental countries. Oceania contains only Australia and New Zealand, whose scores are 7.300 and 7.223, respectively, meaning that there is pretty much no spread in the data, as accounted for by the box plot that appears as a line for that region.

4.6 Bubble Plot Healthy Life Expectancy vs. Social Support on Happiness Score

This bubble plot displays the dual impact of life expectancy and social support on the happiness scores of different countries in the 2020 data. Each bubble represents a single country. Although the graph shows a relatively linear relationship, as life expectancy increases, social support increases, and happiness scores rise. However, social support seems to be the limiting factor because the bubbles do not surpass a social support measurement of approximately 1.5. We removed a bubble centered at (0,0) that resulted from missing values in the 2020 dataset. A healthy life expectancy of 0 does not make sense and conflicts with the 2020 records on this country (Central African Republic).

4.7 Treemap Overall Numerical Ranking in 2020

This treemap shows all 153 countries included in the 2020 dataset ordered by numerical ranking using color.

4.8 Choropleth Map Happiness Score by Region

The choropleth map visualizes happiness scores by country. Together these countries create a bigger picture of happiness scores by world region. Areas in gray indicate no available data (at least in 2020) for these countries. These happiest countries are concentrated around North America, Europe, and Oceania.

5. Conclusion

This project aimed to explore data from the World Happiness project from 2017-2020 to begin to answer who the happiest countries are and show a few factors that may contribute. We showed the effect on happiness scores based on GDP per capita and life expectancy and the variance of happiness scores across countries and the world.

6. References

Wikipedia Datasets: [https://en.wikipedia.org/wiki/World_Happiness_Report] Life expectancy records (Central African Republic): [https://data.who.int/countries/140]

7. Appendices

7.1 Set up Code

knitr::opts_chunk$set(echo = TRUE)
library(readxl)
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(RColorBrewer)
library(sqldf)
library(rvest)
library(treemap)
library(maps)

7.2 Reading Data, Prepping Data, and Bar plot

Happiness_project_datacomplete <- "Happiness_project.xlsx"

read_xlsx(Happiness_project_datacomplete)

sheet_names <- excel_sheets(Happiness_project_datacomplete)

Happiness_project_data2020 <- read_excel(Happiness_project_datacomplete, sheet = 1)

Happiness_project_data2020_top_ten <- Happiness_project_data2020 %>%
  select (1:3) %>%
  slice (1:10)

names(Happiness_project_data2020_top_ten)[names(Happiness_project_data2020_top_ten) == "Country or region"] <- "Country"

ggplot(Happiness_project_data2020_top_ten, aes(x = reorder(Country,-Score), y = Score)) +
  geom_col(fill = c("lightblue")) +
  labs(title = "Top Ten Happiest Countries in 2020",
       x = "Country",
       y = "Happiness Score") +
  scale_y_continuous(limits = c(0, 10)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.3 Dataframe, Data Wrangling, Scatterplot 4.2

countries <- c("China", "India", "USA", "Indonesia", "Japan", "Russia")

scores_2020 <- c(5.124, 3.573, 6.940, 5.286, 5.871, 5.546)
scores_2019 <- c(5.191, 4.015, 6.892, 5.192, 5.886, 5.648)
scores_2018 <- c(5.246, 4.190, 6.886, 5.093, 5.915, 5.810)
scores_2017 <- c(5.273, 4.315, 6.993, 5.262, 5.920, 5.963)

six_countries_score <- data.frame(Countries = countries, Scores2020 = scores_2020, Scores2019 = scores_2019, Scores2018 = scores_2018, Scores2017 = scores_2017)

six_countries_score_long <- pivot_longer(six_countries_score, cols = starts_with("Scores"), names_to = "Year", values_to = "Scores")

six_countries_score_long$Year <- factor(six_countries_score_long$Year, levels = c("Scores2020", "Scores2019","Scores2018", "Scores2017"), labels = c("2020", "2019", "2018", "2017"))

six_countries_score_long <- six_countries_score_long %>%
  arrange(Countries, Year)

ggplot(six_countries_score_long, aes(x = Year, y = Scores, color = Countries, group = Countries)) +
  geom_point(position = position_jitter(width = 0.1, height = 0.1), size = 4, alpha = 0.6) +
  geom_line(alpha = 0.7) +
  labs(title = "Happiness Scores of Six Countries from 2017-2020", 
       x = "Year", 
       y = "Scores") +
  theme_minimal()

7.4 Data Reading, Prepping, and Wrangling, Scatterplot 4.3

Happiness_project_data2019 <- read_excel(Happiness_project_datacomplete, sheet = 2)

Happiness_project_data2019_top_ten <- Happiness_project_data2019 %>%
  select (1:3) %>%
  slice (1:10)

names(Happiness_project_data2019_top_ten)[names(Happiness_project_data2019_top_ten) == "Country or region"] <- "Country"

Happiness_project_data2018 <- read_excel(Happiness_project_datacomplete, sheet = 3)

Happiness_project_data2018_top_ten <- Happiness_project_data2018 %>%
  select (1:3) %>%
  slice (1:10)

names(Happiness_project_data2018_top_ten)[names(Happiness_project_data2018_top_ten) == "Country or region"] <- "Country"

Happiness_project_data2017 <- read_excel(Happiness_project_datacomplete, sheet = 4)

Happiness_project_data2017_top_ten <- Happiness_project_data2017 %>%
  select (1, 3, 4) %>%
  slice (1:10)

Happiness_project_data2017_top_ten <- Happiness_project_data2017_top_ten %>%
  mutate(Score = as.numeric(Score))

names(Happiness_project_data2017_top_ten)[names(Happiness_project_data2017_top_ten) == "Overall Rank"] <- "Overall rank"

combined1_Happiness_project_data_top_ten <- merge(Happiness_project_data2020_top_ten, Happiness_project_data2019_top_ten, by = "Overall rank")

combined2_Happiness_project_data_top_ten <- merge(combined1_Happiness_project_data_top_ten, Happiness_project_data2018_top_ten, by = "Overall rank")

finalcombined_Happiness_project_data_top_ten <- merge(combined2_Happiness_project_data_top_ten, Happiness_project_data2017_top_ten, by = "Overall rank")

colnames(finalcombined_Happiness_project_data_top_ten) <- c("Overall_Rank", "Countries_2020", "Scores_2020", "Countries_2019", "Scores_2019", "Countries_2018", "Scores_2018", "Countries_2017", "Scores_2017")

countries_long <- finalcombined_Happiness_project_data_top_ten %>%
  select(Overall_Rank, starts_with("Countries")) %>%
  pivot_longer(
    cols = starts_with("Countries"),
    names_to = c("CountryType", "Year"),
    names_sep = "_",
    values_to = "Country"
  ) %>%
  mutate(Country = str_trim(Country))

scores_long <- finalcombined_Happiness_project_data_top_ten %>%
  select(Overall_Rank, starts_with("Scores")) %>%
  pivot_longer(
    cols = starts_with("Scores"),
    names_to = c("Metric", "Year"),
    names_sep = "_",
    values_to = "Score"
  )

finalcombined_Happiness_project_data_top_ten_long <- countries_long %>%
  full_join(scores_long, by = c("Overall_Rank", "Year")) %>%
  select(Overall_Rank, Year, Country, Score)

colors <- brewer.pal(n = 12, name = "Set3")

ggplot(finalcombined_Happiness_project_data_top_ten_long, aes(x = Year, y = Score, color = Country)) +
  geom_point(position = position_jitter(width = 0.2, height = 0.2), size = 4, alpha = 0.9) +
  scale_color_manual(values = colors) +  # Use the colors from the palette
  labs(title = "Top Ten Happiest Countries from 2017-2020", 
       x = "Year", 
       y = "Score") +
  theme_minimal() +
  theme(legend.position = "right")

7.5 Data Prepping, Wrangling, and Merging, Scatterplot 4.4

gdp_2020 <- Happiness_project_data2020[, c('Country or region', 'GDP per capita')]

happiness_2020 <- Happiness_project_data2020[, c('Country or region', 'Score')]

gdp_2019 <- Happiness_project_data2019[, c('Country or region', 'GDP per capita')]

happiness_2019 <- Happiness_project_data2019[, c('Country or region', 'Score')]

gdp_2018 <- Happiness_project_data2018[, c('Country or region', 'GDP per capita')]

happiness_2018 <- Happiness_project_data2018[, c('Country or region', 'Score')]

names(Happiness_project_data2017)[names(Happiness_project_data2017) == "Country"] <- "Country or region"

Happiness_project_data2017$`GDP per capita` <- as.numeric(Happiness_project_data2017$`GDP per capita`)

Happiness_project_data2017$`Score` <- as.numeric(Happiness_project_data2017$`Score`)

gdp_2017 <- Happiness_project_data2017[, c('Country or region', 'GDP per capita')]

happiness_2017 <- Happiness_project_data2017[, c('Country or region', 'Score')]

merged_data_question4 <- sqldf("
  SELECT
    gdp2017.`Country or region`,
    2017 AS year,
    gdp2017.`GDP per capita`,
    happiness2017.`Score`
  FROM
    gdp_2017 gdp2017
  JOIN
    happiness_2017 happiness2017 ON gdp2017.`Country or region` = happiness2017.`Country or region`
  
  UNION ALL
  
  SELECT
    gdp2018.`Country or region`,
    2018 AS year,
    gdp2018.`GDP per capita`,
    happiness2018.`Score`
  FROM
    gdp_2018 gdp2018
  JOIN
    happiness_2018 happiness2018 ON gdp2018.`Country or region` = happiness2018.`Country or region`
  
  UNION ALL
  
  SELECT
    gdp2019.`Country or region`,
    2019 AS year,
    gdp2019.`GDP per capita`,
    happiness2019.`Score`
  FROM
    gdp_2019 gdp2019
  JOIN
    happiness_2019 happiness2019 ON gdp2019.`Country or region` = happiness2019.`Country or region`
  
  UNION ALL
  
  SELECT
    gdp2020.`Country or region`,
    2020 AS year,
    gdp2020.`GDP per capita`,
    happiness2020.`Score`
  FROM
    gdp_2020 gdp2020
  JOIN
    happiness_2020 happiness2020 ON gdp2020.`Country or region` = happiness2020.`Country or region`
")

merged_data_question4 <- merged_data_question4 %>%
  filter(!is.na(`GDP per capita`), !is.na(`Score`), is.finite(`GDP per capita`), is.finite(`Score`))

ggplot(merged_data_question4, aes(x = `GDP per capita`, y = `Score`, color = as.factor(year))) +
  geom_point(size = 3, alpha = 0.7) +  
  geom_smooth(method = "lm", se = FALSE, color = "black") +  # One line of best fit
  labs(title = "GDP per Capita vs. Happiness Score 2017-2020",
       x = "GDP per Capita",
       y = "Happiness Score",
       color = "Year") +
  theme_minimal()

7.6 Data Scraping, Wrangling and Boxplot

scrape = function(path){
  
  url <- "https://en.wikipedia.org/wiki/World_Happiness_Report"
 
  web_page <- read_html(url)
  
  happiness_table <- web_page %>%
    html_node(xpath = path) %>%
    html_table(fill = TRUE)
  
  return(happiness_table)
}

t2020 = scrape('//*[@id="mw-content-text"]/div[1]/div[40]/table/tbody/tr[2]/td/table')

t2020 = rename(t2020, country = `Country or region`)

t2020$year = 2020

data(GNI2014)

data_with_continental = merge(t2020, GNI2014, by = "country", all.x = TRUE)
data_with_continental = data_with_continental[c("country", "Score", "GDP per capita", "year", "continent")]

na_countries <- data_with_continental %>%
  filter(is.na(continent))

continent_mapping <- data.frame(
  country = c("Congo (Brazzaville)", "Congo (Kinshasa)", "Eswatini", "Gambia", "Hong Kong", "Iran", "Ivory Coast", "Kyrgyzstan", "Laos", "North Cyprus", "North Macedonia", "Palestine", "Slovakia", "South Korea", "Taiwan", "Venezuela", "Yemen"),  # Replace with actual country names
  continent = c("Africa", "Africa", "Africa", "Africa", "Asia", "Asia", "Africa", "Asia", "Asia", "Europe", "Europe", "Asia", "Europe", "Asia", "Asia", "South America", "Asia")          
)

data_with_continental$continent <- as.character(data_with_continental$continent)

data_with_continental <- data_with_continental %>%
  mutate(continent = case_when(
    country == "Maldives" ~ "Asia",
    country == "Mauritius" ~ "Africa",
    TRUE ~ continent
  ))

data_with_continental <- data_with_continental %>%
  left_join(continent_mapping, by = "country", suffix = c("", "_map")) %>%
  mutate(continent = ifelse(is.na(continent) | continent == "", continent_map, continent)) %>%
  select(-continent_map)

data_with_continental$continent[is.na(data_with_continental$continent)] <- "Unassigned"

data_with_continental$continent <- factor(data_with_continental$continent)

ggplot(data_with_continental, aes(x = continent, y = Score)) +
  geom_boxplot() +
  labs(title = "Distribution of Happiness Scores by Continent (2020)",
       x = "Continent",
       y = "Happiness Score") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

7.7 Data Cleaning and Bubbleplot

bubble_plotdata <- Happiness_project_data2020[, c('Healthy life expectancy', 'Social support', 'Score')]

bubble_plotdata$`Healthy life expectancy` <- (bubble_plotdata$`Healthy life expectancy` * 100)

bubble_plotdata <- bubble_plotdata %>%
  filter(!(`Healthy life expectancy` == 0 & `Social support` == 0))

ggplot(bubble_plotdata, aes(x = `Healthy life expectancy`, y = `Social support`, 
                             size = Score, color = Score)) +
  geom_point(alpha = 0.7) +  
  scale_size_continuous(range = c(2, 12), name = "Happiness Score") +  
  scale_color_gradient(low = "blue", high = "red", name = "Happiness Score") +  
  labs(title = "Impact of Healthy Life Expectancy and Social Support on Happiness in 2020",
       subtitle = "Bubble size and color indicate happiness score",
       x = "Healthy Life Expectancy (years)",
       y = "Social Support") +
  scale_y_continuous(limits = c(0, 2)) + 
  scale_x_continuous(breaks = seq(0, max(bubble_plotdata$`Healthy life expectancy`), by = 10)) +
  theme_minimal() +  
  guides(size = guide_legend(title = "Happiness Score"), 
         color = guide_legend(title = "Happiness Score")) + 
  theme(legend.title = element_text(size = 10),
        legend.text = element_text(size = 8))

7.8

7.8.1 Create Treemap PNG
treemap_data <- Happiness_project_data2020[, c('Country or region', 'Overall rank', 'Score')]

png("treemap.png", width = 1200, height = 800)
treemap(treemap_data,
        index = c("Country or region", "Overall rank"),
        vSize = "Score",
        vColor = "Overall rank",
        type = "value", 
        fontsize.labels = 18,
        cex.legend = 10,
        title = "Happiness Project 2020 Countries based on Overall rank"
)
dev.off()
7.8.2 Display PNG
knitr::include_graphics("treemap.png")

7.9 Data Wrangling and Choropleth Map

world_map <- map_data("world")

unique_world_countries <- unique(world_map$region)
unique_data_countries <- unique(data_with_continental$country)

missing_countries <- setdiff(unique_data_countries, unique_world_countries)

data_with_continental <- data_with_continental %>%
  mutate(country = recode(country,
                          "United States" = "USA",
                          "United Kingdom" = "UK",
                          "Congo (Brazzaville)" = "Republic of Congo",
                          "Congo (Kinshasa)" = "Democratic Republic of Congo",
                          "Eswatini" = "Swaziland")) %>%
 
  bind_rows(data.frame(country = c("Trinidad", "Tobago"), 
                       Score = c(6.192, 6.192),  
                       continent = c("Africa", "Africa"))) 

merged_data_choropleth <- merge(world_map, data_with_continental, by.x = "region", by.y = "country", all.x = TRUE)

ggplot(merged_data_choropleth, aes(x = long, y = lat, group = group, fill = Score)) +
  geom_polygon() +
  coord_fixed(ratio = 1.3) +
  scale_fill_gradient(low = "red", high = "green", name = "Happiness Score") +
  xlab("Longitude") +  
  ylab("Latitude") +
  ggtitle("Choropleth Map of Happiness Scores by Country 2020") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))