── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(RColorBrewer)library(ggfortify)
Introduction
In this project I will be using 8 variables from the World Happiness report 2020 by the University of Oxford. This data set seeks to measure global happiness based on country based on a number of different indicators. I plan to explore the relationship between the different indicators of happiness through my linear regression model and through a heat map.
The first two variables ‘Country Name’ and ‘Regional Indicator’ are self explanatory while ‘Logged GDP per capita’ refers to the GDP per capita during 2020. Social support refers to the percentage of people in a country that said yes to a binary question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them.” Healthy life expectancy refers to the average life expectancy for each country. Freedom to make life choices refers to the percentage of people in a country that said yes to the binary question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” Generosity refers to “the residual of regressing the national average of GWP responses to the question, ‘Have you donated money to a charity in the past month?’on GDP per capita.” Lastly Perceptions of corruptions refers to the percentage of people in a country who said yes to the binary question “Is corruption widespread throughout the government? However,”Where data for government corruption are missing, the perception of business corruption is used as the overall corruption-perception measure: ’Is corruption widespread within businesses or not?’”
Rows: 153 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): Country name, Regional indicator
dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(happiness)
# A tibble: 6 × 20
`Country name` `Regional indicator` `Ladder score` Standard error of ladder …¹
<chr> <chr> <dbl> <dbl>
1 Finland Western Europe 7.81 0.0312
2 Denmark Western Europe 7.65 0.0335
3 Switzerland Western Europe 7.56 0.0350
4 Iceland Western Europe 7.50 0.0596
5 Norway Western Europe 7.49 0.0348
6 Netherlands Western Europe 7.45 0.0278
# ℹ abbreviated name: ¹`Standard error of ladder score`
# ℹ 16 more variables: upperwhisker <dbl>, lowerwhisker <dbl>,
# `Logged GDP per capita` <dbl>, `Social support` <dbl>,
# `Healthy life expectancy` <dbl>, `Freedom to make life choices` <dbl>,
# Generosity <dbl>, `Perceptions of corruption` <dbl>,
# `Ladder score in Dystopia` <dbl>, `Explained by: Log GDP per capita` <dbl>,
# `Explained by: Social support` <dbl>, …
Selecting the 8 variables I am going to use
happiness2 <- happiness |>relocate(`Ladder score`:lowerwhisker) |>select(`Country name`:`Perceptions of corruption`)head(happiness2)
# A tibble: 6 × 8
`Country name` `Regional indicator` `Logged GDP per capita` `Social support`
<chr> <chr> <dbl> <dbl>
1 Finland Western Europe 10.6 0.954
2 Denmark Western Europe 10.8 0.956
3 Switzerland Western Europe 11.0 0.943
4 Iceland Western Europe 10.8 0.975
5 Norway Western Europe 11.1 0.952
6 Netherlands Western Europe 10.8 0.939
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
# `Freedom to make life choices` <dbl>, Generosity <dbl>,
# `Perceptions of corruption` <dbl>
# A tibble: 6 × 8
country_name regional_indicator logged_gdp_per_capita social_support
<chr> <chr> <dbl> <dbl>
1 Finland Western Europe 10.6 0.954
2 Denmark Western Europe 10.8 0.956
3 Switzerland Western Europe 11.0 0.943
4 Iceland Western Europe 10.8 0.975
5 Norway Western Europe 11.1 0.952
6 Netherlands Western Europe 10.8 0.939
# ℹ 4 more variables: healthy_life_expectancy <dbl>,
# freedom_to_make_life_choices <dbl>, generosity <dbl>,
# perceptions_of_corruption <dbl>
Scatter plot to precede linear regression
In my linear regression model I am going to explore if we can predict generosity through GDP per capita
Through the scatter plot of GDP per capita and generosity I can see there is a very weak negative linear relationship if any between the two variables. Now using linear regression I can see what exactly is the r and p values.
fit1 <-lm(generosity ~ logged_gdp_per_capita, data = happiness2)summary(fit1)
Call:
lm(formula = generosity ~ logged_gdp_per_capita, data = happiness2)
Residuals:
Min 1Q Median 3Q Max
-0.27382 -0.11229 -0.02591 0.08851 0.56603
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.12448 0.09569 1.301 0.195
logged_gdp_per_capita -0.01496 0.01021 -1.465 0.145
Residual standard error: 0.1512 on 151 degrees of freedom
Multiple R-squared: 0.01402, Adjusted R-squared: 0.007489
F-statistic: 2.147 on 1 and 151 DF, p-value: 0.1449
As expected the R square value is very low meaning that most of the variation in the scatter plot cannot be explained by this linear model. As well as the r value the p value of .1449 is fairly high and not very significant. So far a linear model may not be appropriate for this graph.
Regression model
generosity = -0.01496(GDP per capita) + 0.12448
This is the equation of the model. Next I will confirm my suspicions with the diagnostic plots.
Diagnostic plots
autoplot(fit1, 1:4, nrow=2, ncol=2)
The fitted values graph shows that a linear model is not great because the blue line is not very straight even though the plot is fairly balanced and random. The normal QQ plot also shows that a linear model is not very good because the dots are straying quite far at both ends from the line. Overall the diagnostic models confirm that a linear model is probably not appropriate for this scatter plot of GDP per capita and generosity. Now I will make a heat map.
Taking random countries from each region for the heat map
## I am going to plot 3 countries from each region in my heat map so 30 in totalhappiness3 <- happiness2 |>group_by(regional_indicator) |>slice_sample(n =3)
Creating the heat map
In this graph the warmer colors represent higher values while the paler colors represent lower values
happiness4 <- happiness3[order(happiness3$healthy_life_expectancy),]## ordering based on life expectancyrow.names(happiness4) <- happiness4$country_name
Warning: Setting row names on a tibble is deprecated.
## putting the coutnry name into the row namehappiness5 <- happiness4[,3:8]## selecting the numerical values from the table#### I tried everything I could but every time I ended up with numbers instead of country name. I did the exact same thing as the demo qmd but it still did not work :(##happy_matrix <-data.matrix(happiness5)## creating the matrixhappiness_heatmap <-heatmap(happy_matrix, Rowv=NA, Colv=NA, col =heat.colors(30), scale="column", margins=c(20,3),theme_minimal(base_size =14),xlab ="Measure of Happiness",ylab ="Country",labCol =c("GDP per capita","Social Support","Life Expectancy","Freedom to make life choices", "Generosity", "Perceptions of Corruption"),main ="Global Happiness Index 2020: Source: World Happiness report 2020 by the University of Oxford")
Conclusion
In order to tidy the data set I had to lower all the letters in the variable names. I also had to replace the spaces within each variable with an underscore. I later then renamed my variables in order to make them look nicer on my graph.
This visualization is very intriguing. First of all the graph shows that life expectancy has a general positive relationship with GDP per capita which makes sense. Next life expectancy also has a positive relationship with social support which also makes sense. Additionally there is a clear negative relationship between life expectancy and perceptions of corruption which also makes sense. However through this graph it is much harder to find a relationship between life expectancy and freedom to make life choices which I find surprising. Furthermore it also seems a bit hard to justify a nice positive relationship between life expectancy and social support which I also find interesting. Lastly there seems to no relationship between generosity and life expectantcy which is not as surprising considering the results of the linear regression model.
I would have liked to have the countries appear on the y axis of my heat map I really don’t know why it wasn’t working. I honestly struggled making this graph look nice so If I had more time I would have continued to mess around to make the heat map look nicer.