I am using a data set from Wellbeing Research Centre, Gallup, the UN Sustainable Development Solutions Network, and their Editorial Board. This data set shows the “happiness score” for several countries in different regions. Other variables include things like levels of social support, generosity, GDP per capita, and freedom to make life choices in those countries. For my plots, I decided to look only at the correlation between happiness and life expectancy. I was hoping to find any information that could explain if the happier someone is, it increases their life span.
First I load the packages to let me create things
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.5.2
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.1.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
Rows: 153 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country name, Regional indicator
dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Cleaning up the data by simplifying variable names. Then summarizing.
# A tibble: 6 × 20
country_name regional_indicator ladder_score standard_error_of_ladder_score
<chr> <chr> <dbl> <dbl>
1 Finland Western Europe 7.81 0.0312
2 Denmark Western Europe 7.65 0.0335
3 Switzerland Western Europe 7.56 0.0350
4 Iceland Western Europe 7.50 0.0596
5 Norway Western Europe 7.49 0.0348
6 Netherlands Western Europe 7.45 0.0278
# ℹ 16 more variables: upperwhisker <dbl>, lowerwhisker <dbl>,
# logged_gdp_per_capita <dbl>, social_support <dbl>,
# healthy_life_expectancy <dbl>, freedom_to_make_life_choices <dbl>,
# generosity <dbl>, perceptions_of_corruption <dbl>,
# ladder_score_in_dystopia <dbl>, `explained_by:_log_gdp_per_capita` <dbl>,
# `explained_by:_social_support` <dbl>,
# `explained_by:_healthy_life_expectancy` <dbl>, …
summary(happiness)
country_name regional_indicator ladder_score
Length:153 Length:153 Min. :2.567
Class :character Class :character 1st Qu.:4.724
Mode :character Mode :character Median :5.515
Mean :5.473
3rd Qu.:6.228
Max. :7.809
standard_error_of_ladder_score upperwhisker lowerwhisker
Min. :0.02590 Min. :2.628 Min. :2.506
1st Qu.:0.04070 1st Qu.:4.826 1st Qu.:4.603
Median :0.05061 Median :5.608 Median :5.431
Mean :0.05354 Mean :5.578 Mean :5.368
3rd Qu.:0.06068 3rd Qu.:6.364 3rd Qu.:6.139
Max. :0.12059 Max. :7.870 Max. :7.748
logged_gdp_per_capita social_support healthy_life_expectancy
Min. : 6.493 Min. :0.3195 Min. :45.20
1st Qu.: 8.351 1st Qu.:0.7372 1st Qu.:58.96
Median : 9.456 Median :0.8292 Median :66.31
Mean : 9.296 Mean :0.8087 Mean :64.45
3rd Qu.:10.265 3rd Qu.:0.9067 3rd Qu.:69.29
Max. :11.451 Max. :0.9747 Max. :76.80
freedom_to_make_life_choices generosity perceptions_of_corruption
Min. :0.3966 Min. :-0.30091 Min. :0.1098
1st Qu.:0.7148 1st Qu.:-0.12701 1st Qu.:0.6830
Median :0.7998 Median :-0.03366 Median :0.7831
Mean :0.7834 Mean :-0.01457 Mean :0.7331
3rd Qu.:0.8777 3rd Qu.: 0.08543 3rd Qu.:0.8492
Max. :0.9750 Max. : 0.56066 Max. :0.9356
ladder_score_in_dystopia explained_by:_log_gdp_per_capita
Min. :1.972 Min. :0.0000
1st Qu.:1.972 1st Qu.:0.5759
Median :1.972 Median :0.9185
Mean :1.972 Mean :0.8688
3rd Qu.:1.972 3rd Qu.:1.1692
Max. :1.972 Max. :1.5367
explained_by:_social_support explained_by:_healthy_life_expectancy
Min. :0.0000 Min. :0.0000
1st Qu.:0.9867 1st Qu.:0.4954
Median :1.2040 Median :0.7598
Mean :1.1556 Mean :0.6929
3rd Qu.:1.3871 3rd Qu.:0.8672
Max. :1.5476 Max. :1.1378
explained_by:_freedom_to_make_life_choices explained_by:_generosity
Min. :0.0000 Min. :0.0000
1st Qu.:0.3815 1st Qu.:0.1150
Median :0.4833 Median :0.1767
Mean :0.4636 Mean :0.1894
3rd Qu.:0.5767 3rd Qu.:0.2555
Max. :0.6933 Max. :0.5698
explained_by:_perceptions_of_corruption dystopia_+_residual
Min. :0.00000 Min. :0.2572
1st Qu.:0.05580 1st Qu.:1.6299
Median :0.09844 Median :2.0463
Mean :0.13072 Mean :1.9723
3rd Qu.:0.16306 3rd Qu.:2.3503
Max. :0.53316 Max. :3.4408
Plotting the variables I want to find possible correlation in
For me it was happiness and life expectancy.
p1 <-ggplot(happiness, aes(x = ladder_score, y = healthy_life_expectancy)) +labs(title ="Correlation Between Life Expectancy and Happiness",caption ="Source: ",x ="Happiness Level", y ="Life Expectancy") +theme_minimal(base_size =12)p1 +geom_point()
Moving the axis to start at 0
p2 <- p1 +geom_point() +xlim(0,10)+ylim(0,77)p2
Linear regression with confidence interval
p3 <- p2 +geom_smooth(method='lm',formula=y~x) # lm = linear modelp3
Trying to find the statistical information to create a formula based on it.
fit1 <-lm(healthy_life_expectancy ~ ladder_score, data = happiness) #lm(y ~ x)summary(fit1)
Call:
lm(formula = healthy_life_expectancy ~ ladder_score, data = happiness)
Residuals:
Min 1Q Median 3Q Max
-13.7689 -2.4393 -0.0655 2.6068 12.1445
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.6923 1.8388 20.50 <2e-16 ***
ladder_score 4.8880 0.3293 14.85 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 4.515 on 151 degrees of freedom
Multiple R-squared: 0.5934, Adjusted R-squared: 0.5907
F-statistic: 220.4 on 1 and 151 DF, p-value: < 2.2e-16
Using the data given, the equation for the linear equation would be Life Expectancy = 4.89(happiness score) + 37.69. So this means that for each increase in the average happiness score, there is an increase of life expectancy by 4.89 years. The P-Value for “ladder_score” (which is the happiness score) has 3 asterisks meaning that it meaningful. Also, when looking at the adjusted R-squared, it shows that around 60% of the variation in the observations may be explained by the model. There is another 40% that is yet to be explained by the model. However, I still think above 50% is a pretty good number. I could add more variables to help explained why life expectancy would increase, but for now I only want to see how strong the correlation is between happiness and life expectancy.
Next, I will group by region and find the mean life expectancy and happiness score for each region
# A tibble: 10 × 3
regional_indicator sum_life avg_happy
<chr> <dbl> <dbl>
1 Central and Eastern Europe 68.1 5.88
2 Commonwealth of Independent States 64.7 5.36
3 East Asia 71.1 5.71
4 Latin America and Caribbean 66.7 5.98
5 Middle East and North Africa 65.3 5.23
6 North America and ANZ 72.2 7.17
7 South Asia 62.4 4.48
8 Southeast Asia 64.7 5.38
9 Sub-Saharan Africa 55.1 4.38
10 Western Europe 72.9 6.90
I plotted the data by the grouped version and used a point graph to show the increase per region
I decided that this was too plain so I changed it in the next one.
p1 <-ggplot(grouped_happy, aes(x = avg_happy, y = sum_life, group = regional_indicator, color = regional_indicator)) +labs(title ="Correlation Between Life Expectancy and Happiness",caption ="Source: ",x ="Happiness Level", y ="Life Expectancy",color ="Region") +theme_minimal(base_size =12) +scale_color_brewer(palette ="Set1")p1 +geom_point(size =5)
Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
I plotted a new graph (*this is my final visualization please grade this one)
Except this time I decided that I would be more aesthetically pleasing if I didnt group the dots all into one region and instead just kept the number of countries in the region and identified them only by color for regions.
p1 <-ggplot(happiness, aes(x = ladder_score, y = healthy_life_expectancy, group = regional_indicator, color = regional_indicator)) +labs(title ="Correlation Between Happiness Level and Life Expectancy \n by Country in Region",caption ="Source: Wellbeing Research Centre, Gallup, \n the UN Sustainable Development Solutions Network, and their Editorial Board ",x ="Happiness Level", y ="Life Expectancy",color ="Region") +theme_minimal(base_size =12) +scale_color_brewer(palette ="Set3")p1 +geom_point(size =2.5)
Ending thoughts and essay
For my ending visualization I made sure to simplify the variable names by getting rid of capital letters and making the spaces underscores. I did not have any N/A’s in my data set so I didn’t need to get rid of any. Afterwards, I grouped by region so it was not crowded by all the country names and to make it easier to read. My visualization is supposed to show the correlation between happiness level and life expectancy. I also wanted to show which regions had the happiest countries and life expectancy. I did not group each dot by region and I showed the group only by color because I wanted to show how each the countries in those regions vary from each other. I was not too surprised (I also did not have a prior assumption) and most of the countries in the regions are close together. However, it seems like one country in the Latin American and Caribbean region strayed from its cluster. I could use the data to specifically look at that region to see which country and why it may be different. The happiest region from the data is shown as Western Europe. Since they have a high level of happiness, they also have a higher life expectancy as the graph already shows it has a positive correlation. The lowest life expectancy and happiness level is Sub-Saharan Africa. It would be interesting to use those two regions and compare other variables to see which others factor into life expectancy. For this visualization, I think I was able to show the data I wanted to show. However, I would’ve really liked to have an interactive one where I could see which country each dot belonged to and perhaps provide even more variables.