library(readr)
library(tidyverse)
library(dplyr)
library(hexbin)
setwd("/Users/kendocekal/Documents/DACSS/hw4")
happy <- read.csv("/Users/kendocekal/Documents/DACSS/hw4/2019.csv")
view(happy)
Mean, Median, and Standard Deviation - shown for all variables
summarize_all(happy, mean, na.rm = TRUE)
## Warning in mean.default(Country.or.region, na.rm = TRUE): argument is not
## numeric or logical: returning NA
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1 78.5 NA 5.407096 0.9051474 1.208814
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 0.7252436 0.3925705 0.1848462
## Perceptions.of.corruption
## 1 0.1106026
summarize_all(happy, median, na.rm = TRUE)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1 78.5 NA 5.3795 0.96 1.2715
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 0.789 0.417 0.1775
## Perceptions.of.corruption
## 1 0.0855
summarize_all(happy, sd, na.rm = TRUE)
## Warning in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm =
## na.rm): NAs introduced by coercion
## Overall.rank Country.or.region Score GDP.per.capita Social.support
## 1 45.17743 NA 1.11312 0.3983895 0.2991914
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1 0.242124 0.1432895 0.09525444
## Perceptions.of.corruption
## 1 0.09453784
Selecting Relevant Variables
select(happy, Score, GDP.per.capita, Perceptions.of.corruption)%>%
arrange(happy, desc(Score))%>%
slice(1:20)
## Score GDP.per.capita Perceptions.of.corruption
## 1 7.769 1.340 0.393
## 2 7.600 1.383 0.410
## 3 7.554 1.488 0.341
## 4 7.494 1.380 0.118
## 5 7.488 1.396 0.298
## 6 7.480 1.452 0.343
## 7 7.343 1.387 0.373
## 8 7.307 1.303 0.380
## 9 7.278 1.365 0.308
## 10 7.246 1.376 0.226
## 11 7.228 1.372 0.290
## 12 7.167 1.034 0.093
## 13 7.139 1.276 0.082
## 14 7.090 1.609 0.316
## 15 7.054 1.333 0.278
## 16 7.021 1.499 0.310
## 17 6.985 1.373 0.265
## 18 6.923 1.356 0.210
## 19 6.892 1.433 0.128
## 20 6.852 1.269 0.036
The visualizations below show what the distribtion of each variable looks like amongst all countries as well as the concentration of scores.
Happiness Scores shows a relatively standard distribution with two peaks near the center
ggplot(data = happy) +
geom_histogram(mapping = aes(x = Score), binwidth = 0.5)
GDP per capita shows some skewing with most countries in the upper half of scores
ggplot(data = happy) +
geom_histogram(mapping = aes(x = GDP.per.capita), binwidth = 0.5)
Perception of Corruption shows significant concentration the the lower end of the bar graph
ggplot(data = happy) +
geom_histogram(mapping = aes(x = Perceptions.of.corruption), binwidth = 0.5)
Greater detail is helpful due to the concentration of counts behind .25 so we can zoom in
ggplot(data = happy) +
geom_histogram(mapping = aes(x = Perceptions.of.corruption), binwidth = 0.1)
This option provides an even closer look which shows the extent of skewing and concentration of countries with lower corruption perception scores
ggplot(happy) +
geom_histogram(mapping = aes(x = Perceptions.of.corruption), binwidth = 0.001) +
coord_cartesian(ylim = c(0, 4))
This set of visualizations is limited as it does not tell us anything about the relationship between other variables in the dataset. We are also lacking information on other differences between countries which may correlate with the results such as country location or economic development level.
We can look at the relationships between variables with the following visualizations
The relationship between happiness score and gdp through a scatter plot shows a relatively linear relationship
ggplot(data = happy) +
geom_count(mapping = aes(x = Score, y = GDP.per.capita))
This can be improved by showing counts which gives a better understanding of score concentration by country at the same time
ggplot(data = happy) +
geom_hex(mapping = aes(x = Score, y = GDP.per.capita))
The graph of corruption perception and happiness score shows a much less positive relationship with some outliners as well a concentration of happiness counts around 5
ggplot(data = happy) +
geom_hex(mapping = aes(x = Score, y = Perceptions.of.corruption))
Grouping by score and corresponding GDP with a boxplot shows skewness and outliers with middle happiness scores more skewed while high scores show much less variation in GDP although with some outliers
ggplot(data = happy, mapping = aes(x = Score, y = GDP.per.capita)) +
geom_boxplot(mapping = aes(group = cut_width(Score, 1)))
Comparing corruption perception and happiness scores reveal similar ranges of perception amongst most low and middle happiness countries (with some exterem outliers) but substantially increased perception amongst higher happiness countries
ggplot(data = happy, mapping = aes(x = Score, y = Perceptions.of.corruption)) +
geom_boxplot(mapping = aes(group = cut_width(Score, 1)))
When comparing corruption perception and GDP, perception levels remaind relatively similar until the highest GDP group which sees a substatial increase though with a wide range
ggplot(data = happy, mapping = aes(x = GDP.per.capita, y = Perceptions.of.corruption)) +
geom_boxplot(mapping = aes(group = cut_width(GDP.per.capita, .25)))
Limitations of this set of visualizations include lack of information about significace levels which would give a better understand on result reliability. The metrics used, such as GDP per capita, may also be misleading to a naive reader as the measurement does not represent actual measured GDP points but rather GDP as measured by the survey’s metric. This makes information like projected increase in happiness score by raw GDP increase difficult to calculate.
Visualizations for the final project can be improved by closer analysis based on variable subgroups, such as by grouping for different GDP ranges, and examining how other variables interact with each other differently per subgroup. Broadening the dataset to include other potentially relevant variables, such a media freedom score or national GINI coefficient may also result in more robust and informative visuals.