Happiness Project 1

Author

E Lott

Intro

I am using a data set from Wellbeing Research Centre, Gallup, the UN Sustainable Development Solutions Network, and their Editorial Board. This data set shows the “happiness score” for several countries in different regions. Other variables include things like levels of social support, generosity, GDP per capita, and freedom to make life choices in those countries. For my plots, I decided to look only at the correlation between happiness and life expectancy. I was hoping to find any information that could explain if the happier someone is, it increases their life span.

First I load the packages to let me create things

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.5.2
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.1.0     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggplot2)
library(plotly)


Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Load the document into RStudio to use

setwd("C:/Users/Erika/OneDrive/Desktop/DATA 110")
happiness <- read_csv("happiness2020.csv")

Rows: 153 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country name, Regional indicator
dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning up the data by simplifying variable names. Then summarizing.

names(happiness) <- tolower(names(happiness))
names(happiness) <- gsub(" ","_",names(happiness))
head(happiness)

# A tibble: 6 × 20
  country_name regional_indicator ladder_score standard_error_of_ladder_score
  <chr>        <chr>                     <dbl>                          <dbl>
1 Finland      Western Europe             7.81                         0.0312
2 Denmark      Western Europe             7.65                         0.0335
3 Switzerland  Western Europe             7.56                         0.0350
4 Iceland      Western Europe             7.50                         0.0596
5 Norway       Western Europe             7.49                         0.0348
6 Netherlands  Western Europe             7.45                         0.0278
# ℹ 16 more variables: upperwhisker <dbl>, lowerwhisker <dbl>,
#   logged_gdp_per_capita <dbl>, social_support <dbl>,
#   healthy_life_expectancy <dbl>, freedom_to_make_life_choices <dbl>,
#   generosity <dbl>, perceptions_of_corruption <dbl>,
#   ladder_score_in_dystopia <dbl>, `explained_by:_log_gdp_per_capita` <dbl>,
#   `explained_by:_social_support` <dbl>,
#   `explained_by:_healthy_life_expectancy` <dbl>, …

summary(happiness)

 country_name       regional_indicator  ladder_score  
 Length:153         Length:153         Min.   :2.567  
 Class :character   Class :character   1st Qu.:4.724  
 Mode  :character   Mode  :character   Median :5.515  
                                       Mean   :5.473  
                                       3rd Qu.:6.228  
                                       Max.   :7.809  
 standard_error_of_ladder_score  upperwhisker    lowerwhisker  
 Min.   :0.02590                Min.   :2.628   Min.   :2.506  
 1st Qu.:0.04070                1st Qu.:4.826   1st Qu.:4.603  
 Median :0.05061                Median :5.608   Median :5.431  
 Mean   :0.05354                Mean   :5.578   Mean   :5.368  
 3rd Qu.:0.06068                3rd Qu.:6.364   3rd Qu.:6.139  
 Max.   :0.12059                Max.   :7.870   Max.   :7.748  
 logged_gdp_per_capita social_support   healthy_life_expectancy
 Min.   : 6.493        Min.   :0.3195   Min.   :45.20          
 1st Qu.: 8.351        1st Qu.:0.7372   1st Qu.:58.96          
 Median : 9.456        Median :0.8292   Median :66.31          
 Mean   : 9.296        Mean   :0.8087   Mean   :64.45          
 3rd Qu.:10.265        3rd Qu.:0.9067   3rd Qu.:69.29          
 Max.   :11.451        Max.   :0.9747   Max.   :76.80          
 freedom_to_make_life_choices   generosity       perceptions_of_corruption
 Min.   :0.3966               Min.   :-0.30091   Min.   :0.1098           
 1st Qu.:0.7148               1st Qu.:-0.12701   1st Qu.:0.6830           
 Median :0.7998               Median :-0.03366   Median :0.7831           
 Mean   :0.7834               Mean   :-0.01457   Mean   :0.7331           
 3rd Qu.:0.8777               3rd Qu.: 0.08543   3rd Qu.:0.8492           
 Max.   :0.9750               Max.   : 0.56066   Max.   :0.9356           
 ladder_score_in_dystopia explained_by:_log_gdp_per_capita
 Min.   :1.972            Min.   :0.0000                  
 1st Qu.:1.972            1st Qu.:0.5759                  
 Median :1.972            Median :0.9185                  
 Mean   :1.972            Mean   :0.8688                  
 3rd Qu.:1.972            3rd Qu.:1.1692                  
 Max.   :1.972            Max.   :1.5367                  
 explained_by:_social_support explained_by:_healthy_life_expectancy
 Min.   :0.0000               Min.   :0.0000                       
 1st Qu.:0.9867               1st Qu.:0.4954                       
 Median :1.2040               Median :0.7598                       
 Mean   :1.1556               Mean   :0.6929                       
 3rd Qu.:1.3871               3rd Qu.:0.8672                       
 Max.   :1.5476               Max.   :1.1378                       
 explained_by:_freedom_to_make_life_choices explained_by:_generosity
 Min.   :0.0000                             Min.   :0.0000          
 1st Qu.:0.3815                             1st Qu.:0.1150          
 Median :0.4833                             Median :0.1767          
 Mean   :0.4636                             Mean   :0.1894          
 3rd Qu.:0.5767                             3rd Qu.:0.2555          
 Max.   :0.6933                             Max.   :0.5698          
 explained_by:_perceptions_of_corruption dystopia_+_residual
 Min.   :0.00000                         Min.   :0.2572     
 1st Qu.:0.05580                         1st Qu.:1.6299     
 Median :0.09844                         Median :2.0463     
 Mean   :0.13072                         Mean   :1.9723     
 3rd Qu.:0.16306                         3rd Qu.:2.3503     
 Max.   :0.53316                         Max.   :3.4408

Plotting the variables I want to find possible correlation in

For me it was happiness and life expectancy.

p1 <- ggplot(happiness, aes(x = ladder_score, y = healthy_life_expectancy)) +
  labs(title = "Correlation Between Life Expectancy and Happiness",
  caption = "Source: ",
  x = "Happiness Level", 
  y = "Life Expectancy") +
  theme_minimal(base_size = 12)
p1 + geom_point()

Moving the axis to start at 0

p2 <- p1 + geom_point() + xlim(0,10)+ ylim(0,77)
p2

Linear regression with confidence interval

p3 <- p2 + geom_smooth(method='lm',formula=y~x)  # lm = linear model
p3

Finding the correlation number

cor(happiness$ladder_score, happiness$healthy_life_expectancy)

[1] 0.7703163

Trying to find the statistical information to create a formula based on it.

fit1 <- lm(healthy_life_expectancy ~ ladder_score, data = happiness)  #lm(y ~ x)
summary(fit1)


Call:
lm(formula = healthy_life_expectancy ~ ladder_score, data = happiness)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.7689  -2.4393  -0.0655   2.6068  12.1445 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)   37.6923     1.8388   20.50   <2e-16 ***
ladder_score   4.8880     0.3293   14.85   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.515 on 151 degrees of freedom
Multiple R-squared:  0.5934,    Adjusted R-squared:  0.5907 
F-statistic: 220.4 on 1 and 151 DF,  p-value: < 2.2e-16

Using the data given, the equation for the linear equation would be Life Expectancy = 4.89(happiness score) + 37.69. So this means that for each increase in the average happiness score, there is an increase of life expectancy by 4.89 years. The P-Value for “ladder_score” (which is the happiness score) has 3 asterisks meaning that it meaningful. Also, when looking at the adjusted R-squared, it shows that around 60% of the variation in the observations may be explained by the model. There is another 40% that is yet to be explained by the model. However, I still think above 50% is a pretty good number. I could add more variables to help explained why life expectancy would increase, but for now I only want to see how strong the correlation is between happiness and life expectancy.

Next, I will group by region and find the mean life expectancy and happiness score for each region

grouped_happy <- happiness |>
  group_by(regional_indicator) |>
    summarize(sum_life = mean(healthy_life_expectancy, na.rm = TRUE), 
              avg_happy = mean(ladder_score))
grouped_happy

# A tibble: 10 × 3
   regional_indicator                 sum_life avg_happy
   <chr>                                 <dbl>     <dbl>
 1 Central and Eastern Europe             68.1      5.88
 2 Commonwealth of Independent States     64.7      5.36
 3 East Asia                              71.1      5.71
 4 Latin America and Caribbean            66.7      5.98
 5 Middle East and North Africa           65.3      5.23
 6 North America and ANZ                  72.2      7.17
 7 South Asia                             62.4      4.48
 8 Southeast Asia                         64.7      5.38
 9 Sub-Saharan Africa                     55.1      4.38
10 Western Europe                         72.9      6.90

I plotted the data by the grouped version and used a point graph to show the increase per region

I decided that this was too plain so I changed it in the next one.

p1 <- ggplot(grouped_happy, aes(x = avg_happy, y = sum_life, group = regional_indicator, color = regional_indicator)) +
  labs(title = "Correlation Between Life Expectancy and Happiness",
  caption = "Source: ",
  x = "Happiness Level", 
  y = "Life Expectancy",
  color = "Region") +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set1")
p1 + geom_point(size = 5)

Warning in RColorBrewer::brewer.pal(n, pal): n too large, allowed maximum for palette Set1 is 9
Returning the palette you asked for with that many colors

Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).

I plotted a new graph (*this is my final visualization please grade this one)

Except this time I decided that I would be more aesthetically pleasing if I didnt group the dots all into one region and instead just kept the number of countries in the region and identified them only by color for regions.

p1 <- ggplot(happiness, aes(x = ladder_score, y = healthy_life_expectancy, group = regional_indicator, color = regional_indicator)) +
  labs(title = "Correlation Between Happiness Level and Life Expectancy \n by Country in Region",
  caption = "Source: Wellbeing Research Centre, Gallup, \n the UN Sustainable Development Solutions Network, and their Editorial Board ",
  x = "Happiness Level", 
  y = "Life Expectancy",
  color = "Region") +
  theme_minimal(base_size = 12) +
  scale_color_brewer(palette = "Set3")
p1 + geom_point(size = 2.5)

Ending thoughts and essay

For my ending visualization I made sure to simplify the variable names by getting rid of capital letters and making the spaces underscores. I did not have any N/A’s in my data set so I didn’t need to get rid of any. Afterwards, I grouped by region so it was not crowded by all the country names and to make it easier to read. My visualization is supposed to show the correlation between happiness level and life expectancy. I also wanted to show which regions had the happiest countries and life expectancy. I did not group each dot by region and I showed the group only by color because I wanted to show how each the countries in those regions vary from each other. I was not too surprised (I also did not have a prior assumption) and most of the countries in the regions are close together. However, it seems like one country in the Latin American and Caribbean region strayed from its cluster. I could use the data to specifically look at that region to see which country and why it may be different. The happiest region from the data is shown as Western Europe. Since they have a high level of happiness, they also have a higher life expectancy as the graph already shows it has a positive correlation. The lowest life expectancy and happiness level is Sub-Saharan Africa. It would be interesting to use those two regions and compare other variables to see which others factor into life expectancy. For this visualization, I think I was able to show the data I wanted to show. However, I would’ve really liked to have an interactive one where I could see which country each dot belonged to and perhaps provide even more variables.