The World Happiness Report measures global happiness through surveys collected by the Gallup World Poll. Answers are collected based on a Cantril ladder life question. Participants are asked to think of lifestyles on a ladder, with their ideal and best possible life being at a 10, or the top of the ladder, and their worst possible life at a 0, or the very bottom of the ladder. They are then asked to rate their own lives on this scale along with seven main observed factors that may contribute to their responses. These seven factors are: economic production, social support, life expectancy, freedom, perceptions of corruption, generosity, and positive and negative affect. This dataset features data collected from 2013-2018. For the purpose of this project, I decided to focus on the most recent data collected, 2018.
Happiness records are significant in recognizing the emotional and mental wellbeing of groups of people, and can, therefore, be used to make informed decisions on a governmental and organizational level to help improve quality of life for the general population. These reports can also be used to assess international progression. With this dataset, we can look at world happiness and use a variety of factors, such as social support and finances, to evaluate their potential impact on happiness. Kaggle Link for Dataset: https://www.kaggle.com/datasets/unsdsn/world-happiness/
This data was collected in the form of surveys through a poll, mkaing it an observational study.
Dependent Variable: Happiness score (Cantril Ladder Life question score)
Independent Variable: seven explanatory variables - Per Capita Income (Log GDP per capita) - Social support - Healthy Life expectancy (Healthy life expectancy at birth) - Freedom of Choice (Freedom to make life choices) - Generosity - Perceived corruption (Perceptions of corruption) - Positive and Negative affect
What factors are significant predictors of people’s happiness?
This null hypothesis assumes that there is no relationship or predictive power between the independent variables (the seven explanatory variables) and the dependent variable (happiness score).
This alternative hypothesis suggests that there is a relationship between at least one of the explanatory variables and the happiness score.
library(readr)
library(ggplot2)
url1 <- "https://raw.githubusercontent.com/renidak01/DATA607_FINALPROJECT/main/world-happiness-report.csv"
df1 <- read.csv(url1)
cat("Variables in df1:\n") #df1 variables
## Variables in df1:
print(names(df1))
## [1] "Country.name" "year"
## [3] "Life.Ladder" "Log.GDP.per.capita"
## [5] "Social.support" "Healthy.life.expectancy.at.birth"
## [7] "Freedom.to.make.life.choices" "Generosity"
## [9] "Perceptions.of.corruption" "Positive.affect"
## [11] "Negative.affect"
I began by reading the csv file for the data collected in the world happiness report. I wanted to take a look at the variable names so that I could start changing some of them, and so that I could see what variables I was working with and what needed to be kept or changed.
nrow(df1)
## [1] 1949
Here we see that there are 1949 observations in this dataset. We will need to filter out all of the years except for the year which we will be working with.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df1 <- df1 %>%
rename(Country = Country.name,
Year = year,
`Healthy.Life.Expectancy` = Healthy.life.expectancy.at.birth,
`Freedom.of.Choice` = Freedom.to.make.life.choices,
`Perceived.Corruption` = Perceptions.of.corruption,
'Per.Capita.Income'=Log.GDP.per.capita,
'Social.Support'=Social.support,
'Negative.Affect'=Negative.affect,
'Positive.Affect'=Positive.affect)
print(names(df1))
## [1] "Country" "Year"
## [3] "Life.Ladder" "Per.Capita.Income"
## [5] "Social.Support" "Healthy.Life.Expectancy"
## [7] "Freedom.of.Choice" "Generosity"
## [9] "Perceived.Corruption" "Positive.Affect"
## [11] "Negative.Affect"
I started by renaming the variables to make them easier to work with. I did this using dplyr’s pipe operator, for the purpose of efficiency and to keep the data more concise.
print(summary(df1))
## Country Year Life.Ladder Per.Capita.Income
## Length:1949 Min. :2005 Min. :2.375 Min. : 6.635
## Class :character 1st Qu.:2010 1st Qu.:4.640 1st Qu.: 8.464
## Mode :character Median :2013 Median :5.386 Median : 9.460
## Mean :2013 Mean :5.467 Mean : 9.368
## 3rd Qu.:2017 3rd Qu.:6.283 3rd Qu.:10.353
## Max. :2020 Max. :8.019 Max. :11.648
## NA's :36
## Social.Support Healthy.Life.Expectancy Freedom.of.Choice Generosity
## Min. :0.2900 Min. :32.30 Min. :0.2580 Min. :-0.3350
## 1st Qu.:0.7498 1st Qu.:58.69 1st Qu.:0.6470 1st Qu.:-0.1130
## Median :0.8355 Median :65.20 Median :0.7630 Median :-0.0255
## Mean :0.8126 Mean :63.36 Mean :0.7426 Mean : 0.0001
## 3rd Qu.:0.9050 3rd Qu.:68.59 3rd Qu.:0.8560 3rd Qu.: 0.0910
## Max. :0.9870 Max. :77.10 Max. :0.9850 Max. : 0.6980
## NA's :13 NA's :55 NA's :32 NA's :89
## Perceived.Corruption Positive.Affect Negative.Affect
## Min. :0.0350 Min. :0.3220 Min. :0.0830
## 1st Qu.:0.6900 1st Qu.:0.6255 1st Qu.:0.2060
## Median :0.8020 Median :0.7220 Median :0.2580
## Mean :0.7471 Mean :0.7100 Mean :0.2685
## 3rd Qu.:0.8720 3rd Qu.:0.7990 3rd Qu.:0.3200
## Max. :0.9830 Max. :0.9440 Max. :0.7050
## NA's :110 NA's :22 NA's :16
Here I noticed that there is some missing data, so I decided to filter those out before analyzing the summary statistics.
library(dplyr)
df1 <- df1 %>%
filter(!is.na(Per.Capita.Income),
!is.na(Social.Support),
!is.na(Healthy.Life.Expectancy),
!is.na(Freedom.of.Choice),
!is.na(Generosity),
!is.na(Perceived.Corruption),
!is.na(Positive.Affect),
!is.na(Negative.Affect )
)
print(summary(df1))
## Country Year Life.Ladder Per.Capita.Income
## Length:1708 Min. :2005 Min. :2.375 Min. : 6.635
## Class :character 1st Qu.:2010 1st Qu.:4.595 1st Qu.: 8.394
## Mode :character Median :2013 Median :5.364 Median : 9.457
## Mean :2013 Mean :5.447 Mean : 9.322
## 3rd Qu.:2017 3rd Qu.:6.259 3rd Qu.:10.272
## Max. :2020 Max. :7.971 Max. :11.648
## Social.Support Healthy.Life.Expectancy Freedom.of.Choice
## Min. :0.2900 Min. :32.30 Min. :0.2580
## 1st Qu.:0.7410 1st Qu.:58.17 1st Qu.:0.6440
## Median :0.8350 Median :65.10 Median :0.7575
## Mean :0.8103 Mean :63.23 Mean :0.7394
## 3rd Qu.:0.9080 3rd Qu.:68.69 3rd Qu.:0.8520
## Max. :0.9870 Max. :77.10 Max. :0.9850
## Generosity Perceived.Corruption Positive.Affect Negative.Affect
## Min. :-0.3350000 Min. :0.035 Min. :0.3220 Min. :0.0940
## 1st Qu.:-0.1112500 1st Qu.:0.697 1st Qu.:0.6230 1st Qu.:0.2080
## Median :-0.0255000 Median :0.806 Median :0.7220 Median :0.2590
## Mean :-0.0006376 Mean :0.751 Mean :0.7095 Mean :0.2694
## 3rd Qu.: 0.0890000 3rd Qu.:0.875 3rd Qu.:0.8013 3rd Qu.:0.3192
## Max. : 0.6890000 Max. :0.983 Max. :0.9440 Max. :0.7050
df_2018 <- df1 %>%
filter(Year == 2018)
print(summary(df_2018))
## Country Year Life.Ladder Per.Capita.Income
## Length:126 Min. :2018 Min. :2.694 Min. : 6.635
## Class :character 1st Qu.:2018 1st Qu.:4.840 1st Qu.: 8.524
## Mode :character Median :2018 Median :5.492 Median : 9.563
## Mean :2018 Mean :5.543 Mean : 9.398
## 3rd Qu.:2018 3rd Qu.:6.247 3rd Qu.:10.361
## Max. :2018 Max. :7.858 Max. :11.645
## Social.Support Healthy.Life.Expectancy Freedom.of.Choice Generosity
## Min. :0.4850 Min. :48.20 Min. :0.3740 Min. :-0.33500
## 1st Qu.:0.7382 1st Qu.:58.75 1st Qu.:0.7190 1st Qu.:-0.13625
## Median :0.8410 Median :66.25 Median :0.7955 Median :-0.04650
## Mean :0.8095 Mean :64.52 Mean :0.7847 Mean :-0.02354
## 3rd Qu.:0.9097 3rd Qu.:69.53 3rd Qu.:0.8758 3rd Qu.: 0.07200
## Max. :0.9660 Max. :76.80 Max. :0.9700 Max. : 0.51200
## Perceived.Corruption Positive.Affect Negative.Affect
## Min. :0.0970 Min. :0.4240 Min. :0.1070
## 1st Qu.:0.6910 1st Qu.:0.6422 1st Qu.:0.2180
## Median :0.7955 Median :0.7325 Median :0.2825
## Mean :0.7327 Mean :0.7120 Mean :0.2934
## 3rd Qu.:0.8542 3rd Qu.:0.7930 3rd Qu.:0.3585
## Max. :0.9520 Max. :0.8840 Max. :0.5440
I have also only used the data from the year 2018, and have filtered out the rest. Here we can see some summary statistics. This is not too useful to us yet, as these are global statistics, meaning that they are not specific to a given country. This can, however, give us an idea about the emotional and psychological state of the world in general. For example, we see that the global average life expectancy is about 64.52 years old, with the minimum average life expectancy being 48.20 years old, and the maximum average life expectancy being 76.80. Because these are global statistics, it is unclear which country these extremes belong to.
head(df_2018[, !names(df_2018) %in% "Year"])
## Country Life.Ladder Per.Capita.Income Social.Support
## 1 Afghanistan 2.694 7.692 0.508
## 2 Albania 5.004 9.518 0.684
## 3 Algeria 5.043 9.348 0.799
## 4 Argentina 5.793 10.032 0.900
## 5 Armenia 5.062 9.451 0.814
## 6 Australia 7.177 10.811 0.940
## Healthy.Life.Expectancy Freedom.of.Choice Generosity Perceived.Corruption
## 1 52.6 0.374 -0.094 0.928
## 2 68.7 0.824 0.009 0.899
## 3 65.9 0.583 -0.146 0.759
## 4 68.8 0.846 -0.211 0.855
## 5 66.9 0.808 -0.163 0.677
## 6 73.6 0.916 0.146 0.405
## Positive.Affect Negative.Affect
## 1 0.424 0.405
## 2 0.713 0.319
## 3 0.591 0.293
## 4 0.820 0.321
## 5 0.581 0.455
## 6 0.759 0.187
I decided to display the first few rows from the dataframe of this dataset as well. I removed the year column to avoid redundancy and save space, as we have already established that we are only working with 2018 data.
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Life_Ladder = mean(Life.Ladder, na.rm = TRUE)) %>%
arrange(desc(Average_Life_Ladder)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Life_Ladder), y = Average_Life_Ladder)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Life Ladder Score") +
ggtitle("Average Life Ladder Score by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_ll_score <- average_scores %>% arrange(Average_Life_Ladder) %>% head(1)
highest_ll_score <- average_scores %>% arrange(desc(Average_Life_Ladder)) %>% head(1)
print(lowest_ll_score)
## # A tibble: 1 × 2
## Country Average_Life_Ladder
## <chr> <dbl>
## 1 Afghanistan 2.69
print(highest_ll_score)
## # A tibble: 1 × 2
## Country Average_Life_Ladder
## <chr> <dbl>
## 1 Finland 7.86
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Log.GDP.per.capita = mean(Per.Capita.Income, na.rm = TRUE)) %>%
arrange(desc(Average_Log.GDP.per.capita)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Log.GDP.per.capita), y = Average_Log.GDP.per.capita)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Per Capita Income") +
ggtitle("Average Per Capita Income by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_gdp <- average_scores %>% arrange(Average_Log.GDP.per.capita) %>% head(1)
highest_gdp <- average_scores %>% arrange(desc(Average_Log.GDP.per.capita)) %>% head(1)
print(lowest_gdp)
## # A tibble: 1 × 2
## Country Average_Log.GDP.per.capita
## <chr> <dbl>
## 1 Burundi 6.64
print(highest_gdp)
## # A tibble: 1 × 2
## Country Average_Log.GDP.per.capita
## <chr> <dbl>
## 1 Luxembourg 11.6
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Healthy.Life.Expectancy = mean(Healthy.Life.Expectancy, na.rm = TRUE)) %>%
arrange(desc(Average_Healthy.Life.Expectancy)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Healthy.Life.Expectancy), y = Average_Healthy.Life.Expectancy)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Healthy Life Expectancy") +
ggtitle("Average Healthy Life Expectancy by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_le <- average_scores %>% arrange(Average_Healthy.Life.Expectancy) %>% head(1)
highest_le <- average_scores %>% arrange(desc(Average_Healthy.Life.Expectancy)) %>% head(1)
print(lowest_le)
## # A tibble: 1 × 2
## Country Average_Healthy.Life.Expectancy
## <chr> <dbl>
## 1 Chad 48.2
print(highest_le)
## # A tibble: 1 × 2
## Country Average_Healthy.Life.Expectancy
## <chr> <dbl>
## 1 Singapore 76.8
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Freedom.of.Choice = mean(Freedom.of.Choice, na.rm = TRUE)) %>%
arrange(desc(Average_Freedom.of.Choice)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Freedom.of.Choice), y = Average_Freedom.of.Choice)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Freedom of Choice Score") +
ggtitle("Average Freedom of Choice Score by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_foc <- average_scores %>% arrange(Average_Freedom.of.Choice) %>% head(1)
highest_foc <- average_scores %>% arrange(desc(Average_Freedom.of.Choice)) %>% head(1)
print(lowest_foc)
## # A tibble: 1 × 2
## Country Average_Freedom.of.Choice
## <chr> <dbl>
## 1 Afghanistan 0.374
print(highest_foc)
## # A tibble: 1 × 2
## Country Average_Freedom.of.Choice
## <chr> <dbl>
## 1 Uzbekistan 0.97
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Generosity = mean( Generosity, na.rm = TRUE)) %>%
arrange(desc(Average_Generosity)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Generosity), y = Average_Generosity)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Generosity") +
ggtitle("Average Generosity by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_gen <- average_scores %>% arrange(Average_Generosity) %>% head(1)
highest_gen <- average_scores %>% arrange(desc(Average_Generosity)) %>% head(1)
print(lowest_gen)
## # A tibble: 1 × 2
## Country Average_Generosity
## <chr> <dbl>
## 1 Greece -0.335
print(highest_gen)
## # A tibble: 1 × 2
## Country Average_Generosity
## <chr> <dbl>
## 1 Indonesia 0.512
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Perceived.Corruption = mean(Perceived.Corruption, na.rm = TRUE)) %>%
arrange(desc(Average_Perceived.Corruption)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Perceived.Corruption), y = Average_Perceived.Corruption)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Perceived Corruption") +
ggtitle("Average Perceived Corruption by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_percor <- average_scores %>% arrange(Average_Perceived.Corruption) %>% head(1)
highest_percor <- average_scores %>% arrange(desc(Average_Perceived.Corruption)) %>% head(1)
print(lowest_percor)
## # A tibble: 1 × 2
## Country Average_Perceived.Corruption
## <chr> <dbl>
## 1 Singapore 0.097
print(highest_percor)
## # A tibble: 1 × 2
## Country Average_Perceived.Corruption
## <chr> <dbl>
## 1 Bulgaria 0.952
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Positive.Affect = mean(Positive.Affect, na.rm = TRUE)) %>%
arrange(desc(Average_Positive.Affect)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Positive.Affect), y = Average_Positive.Affect)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Positive Affect") +
ggtitle("Average Positive Affect by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_posaff <- average_scores %>% arrange(Average_Positive.Affect) %>% head(1)
highest_posaff <- average_scores %>% arrange(desc(Average_Positive.Affect)) %>% head(1)
print(lowest_posaff)
## # A tibble: 1 × 2
## Country Average_Positive.Affect
## <chr> <dbl>
## 1 Afghanistan 0.424
print(highest_posaff)
## # A tibble: 1 × 2
## Country Average_Positive.Affect
## <chr> <dbl>
## 1 Panama 0.884
average_scores <- df_2018 %>%
group_by(Country) %>%
summarize(Average_Negative.Affect = mean(Negative.Affect, na.rm = TRUE)) %>%
arrange(desc(Average_Negative.Affect)) # Arrange in descending order of scores
ggplot(average_scores, aes(x = reorder(Country, Average_Negative.Affect), y = Average_Negative.Affect)) +
geom_bar(stat = "identity", fill = "skyblue") +
labs(x = "Country Name", y = "Average Negative Affect") +
ggtitle("Average Negative Affect by Country") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
lowest_negaff <- average_scores %>% arrange(Average_Negative.Affect) %>% head(1)
highest_negaff <- average_scores %>% arrange(desc(Average_Negative.Affect)) %>% head(1)
print(lowest_negaff)
## # A tibble: 1 × 2
## Country Average_Negative.Affect
## <chr> <dbl>
## 1 Singapore 0.107
print(highest_negaff)
## # A tibble: 1 × 2
## Country Average_Negative.Affect
## <chr> <dbl>
## 1 Chad 0.544
Then I will do some calculations to find which variable has the strongest impact on happiness. Out of curiosity, I would also like to know which one has the least
model <- lm(Life.Ladder ~ Per.Capita.Income + Social.Support + Healthy.Life.Expectancy +
Freedom.of.Choice + Generosity + Perceived.Corruption + Positive.Affect + Negative.Affect,
data = df_2018)
summary(model)
##
## Call:
## lm(formula = Life.Ladder ~ Per.Capita.Income + Social.Support +
## Healthy.Life.Expectancy + Freedom.of.Choice + Generosity +
## Perceived.Corruption + Positive.Affect + Negative.Affect,
## data = df_2018)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8769 -0.3034 0.0334 0.2829 1.5914
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.10526 0.88191 -4.655 8.61e-06 ***
## Per.Capita.Income 0.39879 0.10340 3.857 0.000188 ***
## Social.Support 2.84310 0.80558 3.529 0.000597 ***
## Healthy.Life.Expectancy 0.02788 0.01441 1.934 0.055506 .
## Freedom.of.Choice 0.67916 0.59814 1.135 0.258509
## Generosity 0.25813 0.35168 0.734 0.464414
## Perceived.Corruption -0.93360 0.31117 -3.000 0.003297 **
## Positive.Affect 1.50665 0.60656 2.484 0.014410 *
## Negative.Affect 3.01529 0.74832 4.029 9.98e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5341 on 117 degrees of freedom
## Multiple R-squared: 0.7759, Adjusted R-squared: 0.7605
## F-statistic: 50.62 on 8 and 117 DF, p-value: < 2.2e-16
Overall, based on the coefficients’ significance levels and the model’s R-squared value, it seems that Per.Capita.Income, Social.Support, Healthy.Life.Expectancy, Perceived.Corruption, and Positive and Negative Affect are statistically significant predictors of the Life Ladder score in this model. I was surprised to see that healthy life expectancy did not have as significant of an impact as I would have thought.Generosity and Freedom.of.Choice on the other hand, do not appear to have a statistically significant impact. It looks like economic production however, was the most significant factor contribution to overall happiness.
Because of the multiple r-squared value, 0.7605, approximately 76.05% of the variance in Life Ladder scores is explained by the independent variables.
The positive and negative affect are indicative of emotional and mental state of the general population. I would argue that those with a lower positive affect, and especially those with a higher negative affect, are countries which require improved conditions for a higher quality of life, and mental health resources. This type of information can be used for advocacy.
It would be interesting to compare this data with that of another year. Comparing data across time and then correlating to important events in history can help to also determine what impacts happiness and how. It would also be useful to have more recent data, to keep the data relevant. Another limitation could be from the data collection method. The sample which was surveyed has to be an accurate reflection of the population, and that may or may not have been the case for every country.
To conclude, it looks like economic production is the most significant contributing factor to happiness. There is a reason people say that money buys happiness! This type of research, along with the analysis performed, can be used to improve quality of life all over the world. In the future, I would use this data set to compare what makes people the unhappiest, and work from there.