Project Proposal

Author

Yoseph Habtu

Section 1: Introduction

Research Question:

How does GDP per capita impact happiness across different regions?

Data Collection Method:

The dataset used for this project was obtained from Kaggle but was derived from the 2023 World Happiness Report, which primarily sourced its data from the Gallup World Poll, an annual survey that gathers information on various aspects of well-being and happiness from people around the world.

Variables:

The dataset contains a total of 21 variables and 137 observations. The variables that will potentially be used for this project are as follows (please note that this list shows the names of the variables before cleaning):

  • Country name = The name of the country included in the dataset.

  • Regional indicator = The regional grouping or indicator to which the country belongs.

  • Happiness score = A numerical score representing the level of happiness reported by individuals in the country.

  • Logged GDP per capita = The logarithm of the country’s gross domestic product (GDP) per capita, a measure of economic prosperity.

  • Social support = A measure of the perceived social support and social connections available to individuals in the country.

  • Healthy life expectancy = The average number of years a person is expected to live in good health in the country.

  • Freedom to make life choices = The degree to which individuals perceive they have the freedom to make life choices and decisions.

  • Generosity = The level of generosity or altruistic behaviors reported by individuals in the country.

  • Perceptions of corruption = The perception of corruption within the country, based on surveys and assessments.

  • Explained by: Log GDP per capita = The contribution of the logged GDP per capita variable in explaining the happiness score.

  • Explained by: Social support = The contribution of the social support variable in explaining the happiness score.

  • Explained by: Healthy life expectancy = The contribution of the healthy life expectancy variable in explaining the happiness score.

  • Explained by: Freedom to make life choices = The contribution of the freedom to make life choices variable in explaining the happiness score.

  • Explained by: Generosity = The contribution of the generosity variable in explaining the happiness score.

  • Explained by: Perceptions of corruption = The contribution of the perceptions of corruption variable in explaining the happiness score.

The variables listed above might not all be used in this project, but they are the ones I plan to explore while conducting statistical analyses and creating visualizations to answer my proposed research question. To see the definitions for all the variables included in the dataset, go to the README text file in the data folder in this project’s directory.

Section 2: Data

Loading tidyverse and reading in the data

library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.3     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
whr <- read_csv("https://raw.githubusercontent.com/yosephhabtu/whr2023/main/whr2023.csv")
Rows: 137 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): Country name, iso alpha, Regional indicator
dbl (18): Happiness score, Standard error of ladder score, upperwhisker, low...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Cleaning variables (lowercasing names and removing spaces)

names(whr) <- tolower(names(whr))
names(whr) <- gsub(" ", "_", names(whr))
head(whr)
# A tibble: 6 × 21
  country_name iso_alpha regional_indicator                 happiness_score
  <chr>        <chr>     <chr>                                        <dbl>
1 Afghanistan  AFG       South Asia                                    1.86
2 Albania      ALB       Central and Eastern Europe                    5.28
3 Algeria      DZA       Middle East and North Africa                  5.33
4 Argentina    ARG       Latin America and Caribbean                   6.02
5 Armenia      ARM       Commonwealth of Independent States            5.34
6 Australia    AUS       North America and ANZ                         7.10
# ℹ 17 more variables: standard_error_of_ladder_score <dbl>,
#   upperwhisker <dbl>, lowerwhisker <dbl>, logged_gdp_per_capita <dbl>,
#   social_support <dbl>, healthy_life_expectancy <dbl>,
#   freedom_to_make_life_choices <dbl>, generosity <dbl>,
#   perceptions_of_corruption <dbl>, ladder_score_in_dystopia <dbl>,
#   `explained_by:_log_gdp_per_capita` <dbl>,
#   `explained_by:_social_support` <dbl>, …
glimpse(whr)
Rows: 137
Columns: 21
$ country_name                                 <chr> "Afghanistan", "Albania",…
$ iso_alpha                                    <chr> "AFG", "ALB", "DZA", "ARG…
$ regional_indicator                           <chr> "South Asia", "Central an…
$ happiness_score                              <dbl> 1.859, 5.277, 5.329, 6.02…
$ standard_error_of_ladder_score               <dbl> 0.033, 0.066, 0.062, 0.06…
$ upperwhisker                                 <dbl> 1.923, 5.406, 5.451, 6.14…
$ lowerwhisker                                 <dbl> 1.795, 5.148, 5.207, 5.90…
$ logged_gdp_per_capita                        <dbl> 7.324, 9.567, 9.300, 9.95…
$ social_support                               <dbl> 0.341, 0.718, 0.855, 0.89…
$ healthy_life_expectancy                      <dbl> 54.712, 69.150, 66.549, 6…
$ freedom_to_make_life_choices                 <dbl> 0.382, 0.794, 0.571, 0.82…
$ generosity                                   <dbl> -0.081, -0.007, -0.117, -…
$ perceptions_of_corruption                    <dbl> 0.847, 0.878, 0.717, 0.81…
$ ladder_score_in_dystopia                     <dbl> 1.778, 1.778, 1.778, 1.77…
$ `explained_by:_log_gdp_per_capita`           <dbl> 0.645, 1.449, 1.353, 1.59…
$ `explained_by:_social_support`               <dbl> 0.000, 0.951, 1.298, 1.38…
$ `explained_by:_healthy_life_expectancy`      <dbl> 0.087, 0.480, 0.409, 0.42…
$ `explained_by:_freedom_to_make_life_choices` <dbl> 0.000, 0.549, 0.252, 0.58…
$ `explained_by:_generosity`                   <dbl> 0.093, 0.133, 0.073, 0.08…
$ `explained_by:_perceptions_of_corruption`    <dbl> 0.059, 0.037, 0.152, 0.08…
$ `dystopia_+_residual`                        <dbl> 0.976, 1.678, 1.791, 1.86…

* This glimpse info, along with the raw CSV URL and data dictionary for the variables, can also be found in the README file in the data folder.

Section 3: Data Analysis Plan

Outcome Variable (Y): Happiness score

Predictor Variable (X): Logged GDP per capita (this variable will be the main focus, but others will also be explored)

Comparison Groups:

  • Regions/countries

  • Predictor variables other than GDP (social support, life expectancy, freedom, generosity, and corruption)

Methods + Use:

  • Side-by-side linear regression models (scatterplots)

    • To explore the nature and strength of the relationship between happiness scores and logged GDP and compare it with the correlation between other predictor variables
  • Grouped Bar Charts

    • To compare mean happiness scores and logged GDP per capita across different regions
  • Using boxplots and/or bar charts to compare the summary statistics for various predictor variables of the top 5 happiest countries vs the top 5 least happiest countries

    • To visualize patterns that help show the most defining characteristics of happy/unhappy countries

Preliminary Exploratory Data Analysis

Summary statistics for happiness scores and logged GDP per capita:

summary(whr$happiness_score)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.859   4.724   5.684   5.540   6.334   7.804 
summary(whr$logged_gdp_per_capita)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  5.527   8.591   9.567   9.450  10.540  11.660 

Histograms of happiness scores and logged GDP per Capita:

hist1 <- whr |> ggplot(aes(x = happiness_score)) +
  geom_histogram(binwidth = 0.3, color = "black", fill = "yellow2") +
  labs(
    x = "Happiness Score (0-10)",
    y = "Count", 
    title = "Happiness Score Frequency Distribution"
  ) +
  theme_minimal()

hist2 <- whr |> ggplot(aes(x = logged_gdp_per_capita)) +
  geom_histogram(binwidth = 0.3, color = "black", fill = "green3") +
  labs(
    x = "GDP per Capita (Log Scale)",
    y = "Count", 
    title = "GDP per Capita Frequency Distribution"
  ) +
  theme_minimal()

library(gridExtra)

Attaching package: 'gridExtra'
The following object is masked from 'package:dplyr':

    combine
grid.arrange(hist1, hist2)

Use/Insights:

These histograms provide a visual representation of the distribution of values within the happiness scores and logged GDP per capita variables. Looking at these plots, you can see they have similar shapes of distribution, which suggests there might be a certain degree of correspondence or similarity in the patterns of these two variables.

Boxplots of happiness scores and logged GDP per capita across regions:

box1 <- whr |> ggplot(aes(x = regional_indicator, y = happiness_score)) +
  geom_boxplot(fill = "lightblue") +
  coord_flip() +
  labs(
    x = "Region",
    y = "Happiness Score (0-10)",
    title = "Happiness Scores Across Regions"
  ) + 
  theme_minimal()

box2 <- whr |> ggplot(aes(x = regional_indicator, y = logged_gdp_per_capita)) +
  geom_boxplot(fill = "red3") +
  coord_flip() +
  labs(
    x = "Region",
    y = "GDP per Capita (Log Scale)",
    title = "GDP per Capita Across Regions"
  ) + 
  theme_minimal()

grid.arrange(box1, box2)

Use/Insights:

These boxplots not only offer graphical summaries of the distributions of happiness scores and logged GDP per capita, but also help compare distributions within different regions. Just like the histograms, there are similarities in the shapes of these boxplots across the regions, besides a few outliers, which may suggest a positive relationship between happiness and GDP per capita.

Scatterplot of Happiness Scores vs GDP with a Linear Regression Line:

whr |> ggplot(aes(x = logged_gdp_per_capita, y = happiness_score)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color = "red") +
  labs(
    x = "GDP per Capita (Log Scale)",
    y = "Happiness Score (0-10)",
    title = "Happiness vs GDP of Countries (2023)"
  ) +
  theme_minimal()
`geom_smooth()` using formula = 'y ~ x'

Use/Insights: The scatterplot helps visualize the relationship between two continuous variables—happiness scores and logged GDP per capita. The points on the plot represent individual observations (countries), and the linear regression line indicates the general trend in the data. Based on the slope and direction of the line, it can be determined that there is a strong positive correlation between happiness and GDP. The calculated correlation coefficient below helps support this inference.

Correlation coefficient of regression line:

cor(whr$logged_gdp_per_capita, whr$happiness_score)
[1] 0.7843673

Summary statistics for linear regression model:

summary(lm(happiness_score ~ logged_gdp_per_capita, data = whr))

Call:
lm(formula = happiness_score ~ logged_gdp_per_capita, data = whr)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1687 -0.3183  0.0216  0.3951  2.5764 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)           -1.45868    0.48018  -3.038  0.00286 ** 
logged_gdp_per_capita  0.74060    0.05041  14.692  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7097 on 135 degrees of freedom
Multiple R-squared:  0.6152,    Adjusted R-squared:  0.6124 
F-statistic: 215.9 on 1 and 135 DF,  p-value: < 2.2e-16

Additional Insights: The adjusted R-squared value shown in the above summary statistics for the linear regression model also corroborates the strong positive linear correlation between happiness scores and logged GDP because a value of 0.6124 means that about 61.24% of the variation in the data is likely to be explained by the model.