World Happiness Report

Author

D Devkota

Introduction

This data set from the Gallup World Poll and published by the UN Sustainable Development Solutions Network (World Happiness Report 2016), includes countries, Happiness and related factors such as Economy, Health, Freedom, Corruption and Generosity. This analysis will explore how these variabes influence a country’s overall happiness.

Source: World Happiness Report 2016 (https://www.worldhappiness.report/ed/2016/)

Loading libraries and dataset

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.1.6
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

happiness <- readr::read_csv("World Happiness Report.csv")

Rows: 153 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country, Region
dbl (10): Happiness Rank, Happiness Score, Economy, Family, Health, Freedom,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning the data

names(happiness)

 [1] "Country"          "Happiness Rank"   "Happiness Score"  "Economy"         
 [5] "Family"           "Health"           "Freedom"          "Generosity"      
 [9] "Corruption"       "Dystopia"         "Job Satisfaction" "Region"

happiness_clean <- happiness |>
  filter(!is.na(`Happiness Score`),
         !is.na(`Economy`),
         !is.na(`Health`),
         !is.na(`Region`))

Backward elimination

model2 <- lm(`Happiness Score` ~ `Economy` + `Freedom`, data = happiness_clean)
summary(model2)


Call:
lm(formula = `Happiness Score` ~ Economy + Freedom, data = happiness_clean)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.07379 -0.36771  0.07261  0.35958  1.36065 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.5461     0.1508  16.886  < 2e-16 ***
Economy       1.8632     0.1199  15.541  < 2e-16 ***
Freedom       2.3814     0.3355   7.097 4.72e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5781 on 150 degrees of freedom
Multiple R-squared:  0.744, Adjusted R-squared:  0.7406 
F-statistic:   218 on 2 and 150 DF,  p-value: < 2.2e-16

Creating boxplot

p1 <- happiness_clean |>
  ggplot(aes(x = Region, y = `Happiness Score`, fill = Region)) +
  geom_boxplot() +
  labs(
    x = "Region",
    y = "Happiness Score",
    title = "Happiness Score Distribution by Region",
    fill = "Region",
    caption = "Source: World Happiness Report(2016)"
  ) +
  scale_fill_manual(values = c(
    "Western Europe" = "blue",
    "North America"  = "red",
    "Asia-Pacific"   = "green",
    "Latin America"  = "orange",
    "Eastern Europe" = "purple",
    "Africa"         = "brown"
  )) +
  theme_minimal()

p1

Breif Essay

I cleaned the data set by removing missing values using the filter() function with !is.na() for important variables like happiness score and Economy. This ensured the data was accurate for analysis.

The box plot shows happiness score distributions across regions.Each region is represented by a box where Western Europe and North America have the highest happiness scores while Africa and Eastern Europe have the lowest.The different colors made regional differences visible immediately.

I wanted to add jittered points using (geom_jitter) to show individual country scores but the plot became overcrowded and adding a time series was not possible because my dataset contains only one year.