This data set from the Gallup World Poll and published by the UN Sustainable Development Solutions Network (World Happiness Report 2016), includes countries, Happiness and related factors such as Economy, Health, Freedom, Corruption and Generosity. This analysis will explore how these variabes influence a country’s overall happiness.
Source: World Happiness Report 2016 (https://www.worldhappiness.report/ed/2016/)
WHR
Loading libraries and dataset
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.2.0 ✔ readr 2.1.6
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.2 ✔ tibble 3.3.1
✔ lubridate 1.9.5 ✔ tidyr 1.3.2
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 153 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country, Region
dbl (10): Happiness Rank, Happiness Score, Economy, Family, Health, Freedom,...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
model2 <-lm(`Happiness Score`~`Economy`+`Freedom`, data = happiness_clean)summary(model2)
Call:
lm(formula = `Happiness Score` ~ Economy + Freedom, data = happiness_clean)
Residuals:
Min 1Q Median 3Q Max
-2.07379 -0.36771 0.07261 0.35958 1.36065
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.5461 0.1508 16.886 < 2e-16 ***
Economy 1.8632 0.1199 15.541 < 2e-16 ***
Freedom 2.3814 0.3355 7.097 4.72e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.5781 on 150 degrees of freedom
Multiple R-squared: 0.744, Adjusted R-squared: 0.7406
F-statistic: 218 on 2 and 150 DF, p-value: < 2.2e-16
Creating boxplot
p1 <- happiness_clean |>ggplot(aes(x = Region, y =`Happiness Score`, fill = Region)) +geom_boxplot() +labs(x ="Region",y ="Happiness Score",title ="Happiness Score Distribution by Region",fill ="Region",caption ="Source: World Happiness Report(2016)" ) +scale_fill_manual(values =c("Western Europe"="blue","North America"="red","Asia-Pacific"="green","Latin America"="orange","Eastern Europe"="purple","Africa"="brown" )) +theme_minimal()p1
Breif Essay
I cleaned the data set by removing missing values using the filter() function with !is.na() for important variables like happiness score and Economy. This ensured the data was accurate for analysis.
The box plot shows happiness score distributions across regions.Each region is represented by a box where Western Europe and North America have the highest happiness scores while Africa and Eastern Europe have the lowest.The different colors made regional differences visible immediately.
I wanted to add jittered points using (geom_jitter) to show individual country scores but the plot became overcrowded and adding a time series was not possible because my dataset contains only one year.