Data 110 Project 1

Author

Andrew George

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.4     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(RColorBrewer)
library(ggfortify)

Introduction

In this project I will be using 8 variables from the World Happiness report 2020 by the University of Oxford. This data set seeks to measure global happiness based on country based on a number of different indicators. I plan to explore the relationship between the different indicators of happiness through my linear regression model and through a heat map.

The first two variables ‘Country Name’ and ‘Regional Indicator’ are self explanatory while ‘Logged GDP per capita’ refers to the GDP per capita during 2020. Social support refers to the percentage of people in a country that said yes to a binary question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them.” Healthy life expectancy refers to the average life expectancy for each country. Freedom to make life choices refers to the percentage of people in a country that said yes to the binary question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?” Generosity refers to “the residual of regressing the national average of GWP responses to the question, ‘Have you donated money to a charity in the past month?’on GDP per capita.” Lastly Perceptions of corruptions refers to the percentage of people in a country who said yes to the binary question “Is corruption widespread throughout the government? However,”Where data for government corruption are missing, the perception of business corruption is used as the overall corruption-perception measure: ’Is corruption widespread within businesses or not?’”

Reading in the data set

setwd("C:/Users/andre/Downloads/Data 110")
happiness <- read_csv("happiness2020.csv")

Rows: 153 Columns: 20
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (2): Country name, Regional indicator
dbl (18): Ladder score, Standard error of ladder score, upperwhisker, lowerw...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(happiness)

# A tibble: 6 × 20
  `Country name` `Regional indicator` `Ladder score` Standard error of ladder …¹
  <chr>          <chr>                         <dbl>                       <dbl>
1 Finland        Western Europe                 7.81                      0.0312
2 Denmark        Western Europe                 7.65                      0.0335
3 Switzerland    Western Europe                 7.56                      0.0350
4 Iceland        Western Europe                 7.50                      0.0596
5 Norway         Western Europe                 7.49                      0.0348
6 Netherlands    Western Europe                 7.45                      0.0278
# ℹ abbreviated name: ¹`Standard error of ladder score`
# ℹ 16 more variables: upperwhisker <dbl>, lowerwhisker <dbl>,
#   `Logged GDP per capita` <dbl>, `Social support` <dbl>,
#   `Healthy life expectancy` <dbl>, `Freedom to make life choices` <dbl>,
#   Generosity <dbl>, `Perceptions of corruption` <dbl>,
#   `Ladder score in Dystopia` <dbl>, `Explained by: Log GDP per capita` <dbl>,
#   `Explained by: Social support` <dbl>, …

Selecting the 8 variables I am going to use

happiness2 <- happiness |>
  relocate(`Ladder score`:lowerwhisker) |>
    select(`Country name`:`Perceptions of corruption`)
head(happiness2)

# A tibble: 6 × 8
  `Country name` `Regional indicator` `Logged GDP per capita` `Social support`
  <chr>          <chr>                                  <dbl>            <dbl>
1 Finland        Western Europe                          10.6            0.954
2 Denmark        Western Europe                          10.8            0.956
3 Switzerland    Western Europe                          11.0            0.943
4 Iceland        Western Europe                          10.8            0.975
5 Norway         Western Europe                          11.1            0.952
6 Netherlands    Western Europe                          10.8            0.939
# ℹ 4 more variables: `Healthy life expectancy` <dbl>,
#   `Freedom to make life choices` <dbl>, Generosity <dbl>,
#   `Perceptions of corruption` <dbl>

Cleaning the data

names(happiness2) <- tolower(names(happiness2))
## lowering names
names(happiness2) <- gsub(" ","_",names(happiness2)) 
## replacing spaces with underscores
head(happiness2)

# A tibble: 6 × 8
  country_name regional_indicator logged_gdp_per_capita social_support
  <chr>        <chr>                              <dbl>          <dbl>
1 Finland      Western Europe                      10.6          0.954
2 Denmark      Western Europe                      10.8          0.956
3 Switzerland  Western Europe                      11.0          0.943
4 Iceland      Western Europe                      10.8          0.975
5 Norway       Western Europe                      11.1          0.952
6 Netherlands  Western Europe                      10.8          0.939
# ℹ 4 more variables: healthy_life_expectancy <dbl>,
#   freedom_to_make_life_choices <dbl>, generosity <dbl>,
#   perceptions_of_corruption <dbl>

Scatter plot to precede linear regression

In my linear regression model I am going to explore if we can predict generosity through GDP per capita

ggplot(happiness2, aes(logged_gdp_per_capita, generosity)) +
  geom_point() +
  geom_smooth(method = "lm")

`geom_smooth()` using formula = 'y ~ x'

Through the scatter plot of GDP per capita and generosity I can see there is a very weak negative linear relationship if any between the two variables. Now using linear regression I can see what exactly is the r and p values.

Linear regression

cor(happiness2$logged_gdp_per_capita, happiness2$generosity)

[1] -0.1183994

fit1 <- lm(generosity ~ logged_gdp_per_capita, data = happiness2)
summary(fit1)


Call:
lm(formula = generosity ~ logged_gdp_per_capita, data = happiness2)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27382 -0.11229 -0.02591  0.08851  0.56603 

Coefficients:
                      Estimate Std. Error t value Pr(>|t|)
(Intercept)            0.12448    0.09569   1.301    0.195
logged_gdp_per_capita -0.01496    0.01021  -1.465    0.145

Residual standard error: 0.1512 on 151 degrees of freedom
Multiple R-squared:  0.01402,   Adjusted R-squared:  0.007489 
F-statistic: 2.147 on 1 and 151 DF,  p-value: 0.1449

As expected the R square value is very low meaning that most of the variation in the scatter plot cannot be explained by this linear model. As well as the r value the p value of .1449 is fairly high and not very significant. So far a linear model may not be appropriate for this graph.

Regression model

generosity = -0.01496(GDP per capita) + 0.12448

This is the equation of the model. Next I will confirm my suspicions with the diagnostic plots.

Diagnostic plots

autoplot(fit1, 1:4, nrow=2, ncol=2)

The fitted values graph shows that a linear model is not great because the blue line is not very straight even though the plot is fairly balanced and random. The normal QQ plot also shows that a linear model is not very good because the dots are straying quite far at both ends from the line. Overall the diagnostic models confirm that a linear model is probably not appropriate for this scatter plot of GDP per capita and generosity. Now I will make a heat map.

Taking random countries from each region for the heat map

## I am going to plot 3 countries from each region in my heat map so 30 in total
happiness3 <- happiness2 |>
  group_by(regional_indicator) |>
  slice_sample(n = 3)

Creating the heat map

In this graph the warmer colors represent higher values while the paler colors represent lower values

happiness4 <- happiness3[order(happiness3$healthy_life_expectancy),]
## ordering based on life expectancy
row.names(happiness4) <- happiness4$country_name

Warning: Setting row names on a tibble is deprecated.

## putting the coutnry name into the row name
happiness5 <- happiness4[,3:8]
## selecting the numerical values from the table
##
## I tried everything I could but every time I ended up with numbers instead of country name. I did the exact same thing as the demo qmd but it still did not work :(
##
happy_matrix <- data.matrix(happiness5)
## creating the matrix
happiness_heatmap <- heatmap(happy_matrix, 
                       Rowv=NA, 
                       Colv=NA, 
                       col = heat.colors(30), 
                       scale="column", 
                       margins=c(20,3),
                       theme_minimal(base_size = 14),
                       xlab = "Measure of Happiness",
                       ylab = "Country",
                       labCol = c("GDP per capita","Social Support","Life Expectancy","Freedom to make life choices", "Generosity", "Perceptions of Corruption"),
                       main = "Global Happiness Index 2020: Source: World Happiness report 2020 by the University of Oxford")

Conclusion

In order to tidy the data set I had to lower all the letters in the variable names. I also had to replace the spaces within each variable with an underscore. I later then renamed my variables in order to make them look nicer on my graph.
This visualization is very intriguing. First of all the graph shows that life expectancy has a general positive relationship with GDP per capita which makes sense. Next life expectancy also has a positive relationship with social support which also makes sense. Additionally there is a clear negative relationship between life expectancy and perceptions of corruption which also makes sense. However through this graph it is much harder to find a relationship between life expectancy and freedom to make life choices which I find surprising. Furthermore it also seems a bit hard to justify a nice positive relationship between life expectancy and social support which I also find interesting. Lastly there seems to no relationship between generosity and life expectantcy which is not as surprising considering the results of the linear regression model.
I would have liked to have the countries appear on the y axis of my heat map I really don’t know why it wasn’t working. I honestly struggled making this graph look nice so If I had more time I would have continued to mess around to make the heat map look nicer.