Felicity Draper, s3742570
Last updated: 14 October, 2020
RPubs Link:
-This study aims to investigate if happiness levels are higher in Western Europe than in Eastern Europe - A two-sample t-test will be used to determine if there is a statistically significant difference between mean scores of the two regions
Alt text
# Import and Preprocess Data
happy<- read.csv("2018.csv")
country_regions<- read.csv("countries_of_the_world.csv")
happy_score<- happy[, 2:3]# Subset for required columns
regions<- country_regions[, 1:2]
colnames(happy_score)<- c("Country", "Score")# Name columns
regions$Region<- str_trim(regions$Region, side = "right") # Manipulate strings to be same
regions$Country<- str_trim(regions$Country, side = "right")
happy_region<- inner_join(happy_score, regions) # merge to one data set
happy_region$Region<- str_to_title(happy_region$Region) #fix all caps
head(happy_region) # print and check## 'data.frame': 144 obs. of 3 variables:
## $ Country: chr "Finland" "Norway" "Denmark" "Iceland" ...
## $ Score : num 7.63 7.59 7.55 7.5 7.49 ...
## $ Region : chr "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
# Filter for only East and West Europe
east<- happy_region %>% filter(Region == "Eastern Europe")
west<- happy_region %>% filter(Region == "Western Europe")
east_west<- rbind(east, west)
head(east_west)#Box plot to show median, IQR and possible outliers
east_west %>% boxplot(Score ~ Region, data = ., main="Box Plot of Happiness Score by Region",
ylab="Region", xlab="Happiness Score", horizontal=TRUE, col = "skyblue")# Side by Side histogram to show distribution, using lattice package
east_west %>% histogram(~ Score|Region, col="skyblue", data=., xlab="Happiness Score", breaks = 7)# Happiness Level Statistics by Region (East or West Europe)
east_west %>% group_by(Region) %>% summarise(Min = min(Score,na.rm = TRUE),
Q1 = quantile(Score,probs = .25,na.rm = TRUE),
Median = median(Score, na.rm = TRUE),
Q3 = quantile(Score,probs = .75,na.rm = TRUE),
Max = max(Score,na.rm = TRUE),
Mean = mean(Score, na.rm = TRUE),
SD = sd(Score, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Score))) -> table1
knitr::kable(table1)| Region | Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|---|
| Eastern Europe | 4.586 | 5.253 | 5.620 | 6.0355 | 6.711 | 5.631182 | 0.6188673 | 11 | 0 |
| Western Europe | 5.358 | 6.558 | 6.977 | 7.4640 | 7.632 | 6.885263 | 0.6974815 | 19 | 0 |
Hypothesis Statement
Null Hypothsis: \[H_0: \mu_1 = \mu_2 \] (There is no difference in means) Alternative Hypothesis: \[H_0: \mu_1 < \mu_2 \](The true difference in means is less than 0)
The Alternative Hypothesis can be understood as the true mean for happiness score in Eastern Europe is less than that in Western Europe.
# Testing for normality using Q-Q Plot
east$Score %>% qqPlot(dist="norm", main = "Eastern Europe Normality Test") # Eastern Europe Countries## [1] 1 11
west$Score %>% qqPlot(dist = "norm", main = "Western Europe Normality Test") # Western Europe Countries## [1] 19 18
# Testing for homogeneity of variance using Levene's test
leveneTest(Score ~ Region, data = east_west)The p-value in 0.898. As p > 0.05, we fail to reject the null hypothesis. It is safe to assume equal variance.
Based on these results a two-sample t-test assuming equal variance has been performed.
# Two-sample t-test
t.test(
Score ~ Region,
data = east_west,
var.equal = TRUE,
alternative = "less"
)##
## Two Sample t-test
##
## data: Score by Region
## t = -4.937, df = 28, p-value = 1.647e-05
## alternative hypothesis: true difference in means is less than 0
## 95 percent confidence interval:
## -Inf -0.821965
## sample estimates:
## mean in group Eastern Europe mean in group Western Europe
## 5.631182 6.885263
p-value < 0.001. As p-value < 0.05. Decision is made to reject H0.
The t-test score is -4.937. The t-critical level is found to be -1.701 using the following R-function.
## [1] -1.701131
As the t-test score is more extreme then the t-critical level a decision is made to reject H0.
Based on both the p-value and the t-test a decision is made to reject the null hypothesis. The 95% CI for the difference in the mean is reported as [-Inf, -0.812]. The sample mean difference is (6.88 - 5.63) -1.25, which falls in the 95% CI, so we reject the Null Hypothesis.
Limitations include:
Survey format relies on people’s self reporting, may over or underestimate their happiness as compared to others
Head of the household may not report in a way that reflects other members
Questionnaire developed by people within Western European culture, may reflect values that do not accurately describe happiness for Eastern Europeans
Future investigation into the factors creating this difference in happiness levels would be interesting
A mean difference of -1.25 points is found between Happiness Scores in Western and Eastern Europe.
The results of the two-sample t-test found this difference to be statistically significant.
This indicates that the population in Western Europe reported higher levels of overall happiness than the population in Eastern Europe, based on the 2018 Gallup World Poll
“World Happiness Report”, Sustainable Development Solutions Network, 2018, accessed from https://www.kaggle.com/unsdsn/world-happiness
“Countries of the World”, Fernando Lasso, 2018, accessed from https://www.kaggle.com/fernandol/countries-of-the-world
Map of Europe, Google Maps, 2020, accessed from https://www.google.com/maps/place/Europe/