HW4.R
data analysis for HW4 from Summer 2022.
potential research questions and operations
Ken Docekal
Last modified - July 2022

I Reading-In and Descriptive Statistics

library(readr)
library(tidyverse)
library(dplyr)
library(hexbin)
setwd("/Users/kendocekal/Documents/DACSS/hw4")
happy <- read.csv("/Users/kendocekal/Documents/DACSS/hw4/2019.csv")
view(happy)

Mean, Median, and Standard Deviation - shown for all variables

summarize_all(happy, mean, na.rm = TRUE)
## Warning in mean.default(Country.or.region, na.rm = TRUE): argument is not
## numeric or logical: returning NA
##   Overall.rank Country.or.region    Score GDP.per.capita Social.support
## 1         78.5                NA 5.407096      0.9051474       1.208814
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1               0.7252436                    0.3925705  0.1848462
##   Perceptions.of.corruption
## 1                 0.1106026
summarize_all(happy, median, na.rm = TRUE)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
##   Overall.rank Country.or.region  Score GDP.per.capita Social.support
## 1         78.5                NA 5.3795           0.96         1.2715
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                   0.789                        0.417     0.1775
##   Perceptions.of.corruption
## 1                    0.0855
summarize_all(happy, sd, na.rm = TRUE)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
##   Overall.rank Country.or.region   Score GDP.per.capita Social.support
## 1     45.17743                NA 1.11312      0.3983895      0.2991914
##   Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                0.242124                    0.1432895 0.09525444
##   Perceptions.of.corruption
## 1                0.09453784

Selecting Relevant Variables

select(happy, Score, GDP.per.capita, Perceptions.of.corruption)%>% 
  arrange(happy, desc(Score))%>%
  slice(1:20)
##    Score GDP.per.capita Perceptions.of.corruption
## 1  7.769          1.340                     0.393
## 2  7.600          1.383                     0.410
## 3  7.554          1.488                     0.341
## 4  7.494          1.380                     0.118
## 5  7.488          1.396                     0.298
## 6  7.480          1.452                     0.343
## 7  7.343          1.387                     0.373
## 8  7.307          1.303                     0.380
## 9  7.278          1.365                     0.308
## 10 7.246          1.376                     0.226
## 11 7.228          1.372                     0.290
## 12 7.167          1.034                     0.093
## 13 7.139          1.276                     0.082
## 14 7.090          1.609                     0.316
## 15 7.054          1.333                     0.278
## 16 7.021          1.499                     0.310
## 17 6.985          1.373                     0.265
## 18 6.923          1.356                     0.210
## 19 6.892          1.433                     0.128
## 20 6.852          1.269                     0.036

II Examining distributions - univariate

The visualizations below show what the distribtion of each variable looks like amongst all countries as well as the concentration of scores.

Happiness Scores shows a relatively standard distribution with two peaks near the center

ggplot(data = happy) +
  geom_histogram(mapping = aes(x = Score), binwidth = 0.5)

GDP per capita shows some skewing with most countries in the upper half of scores

ggplot(data = happy) +
  geom_histogram(mapping = aes(x = GDP.per.capita), binwidth = 0.5)

Perception of Corruption shows significant concentration the the lower end of the bar graph

ggplot(data = happy) +
  geom_histogram(mapping = aes(x = Perceptions.of.corruption), binwidth = 0.5)

Greater detail is helpful due to the concentration of counts behind .25 so we can zoom in

ggplot(data = happy) +
  geom_histogram(mapping = aes(x = Perceptions.of.corruption), binwidth = 0.1)

This option provides an even closer look which shows the extent of skewing and concentration of countries with lower corruption perception scores

ggplot(happy) + 
  geom_histogram(mapping = aes(x = Perceptions.of.corruption), binwidth = 0.001) +
  coord_cartesian(ylim = c(0, 4))

This set of visualizations is limited as it does not tell us anything about the relationship between other variables in the dataset. We are also lacking information on other differences between countries which may correlate with the results such as country location or economic development level.

III Examining relationships - bivariate

We can look at the relationships between variables with the following visualizations

The relationship between happiness score and gdp through a scatter plot shows a relatively linear relationship

ggplot(data = happy) +
  geom_count(mapping = aes(x = Score, y = GDP.per.capita))

This can be improved by showing counts which gives a better understanding of score concentration by country at the same time

ggplot(data = happy) +
  geom_hex(mapping = aes(x = Score, y = GDP.per.capita))

The graph of corruption perception and happiness score shows a much less positive relationship with some outliners as well a concentration of happiness counts around 5

ggplot(data = happy) +
  geom_hex(mapping = aes(x = Score, y = Perceptions.of.corruption))

Grouping by score and corresponding GDP with a boxplot shows skewness and outliers with middle happiness scores more skewed while high scores show much less variation in GDP although with some outliers

ggplot(data = happy, mapping = aes(x = Score, y = GDP.per.capita)) + 
  geom_boxplot(mapping = aes(group = cut_width(Score, 1)))

Comparing corruption perception and happiness scores reveal similar ranges of perception amongst most low and middle happiness countries (with some exterem outliers) but substantially increased perception amongst higher happiness countries

ggplot(data = happy, mapping = aes(x = Score, y = Perceptions.of.corruption)) + 
  geom_boxplot(mapping = aes(group = cut_width(Score, 1)))

When comparing corruption perception and GDP, perception levels remaind relatively similar until the highest GDP group which sees a substatial increase though with a wide range

ggplot(data = happy, mapping = aes(x = GDP.per.capita, y = Perceptions.of.corruption)) + 
  geom_boxplot(mapping = aes(group = cut_width(GDP.per.capita, .25)))

Limitations of this set of visualizations include lack of information about significace levels which would give a better understand on result reliability. The metrics used, such as GDP per capita, may also be misleading to a naive reader as the measurement does not represent actual measured GDP points but rather GDP as measured by the survey’s metric. This makes information like projected increase in happiness score by raw GDP increase difficult to calculate.

Visualizations for the final project can be improved by closer analysis based on variable subgroups, such as by grouping for different GDP ranges, and examining how other variables interact with each other differently per subgroup. Broadening the dataset to include other potentially relevant variables, such a media freedom score or national GINI coefficient may also result in more robust and informative visuals.