library(tidyverse)
library(readxl)
library(GGally)
library(corrplot)
library(broom)
library(patchwork)
library(glmnet)
library(ggplot2)
library(infer)
happiness_data <- read_excel("C:/Users/Khandker/Documents/Final Project/World_happiness_data_region.xlsx")
str(happiness_data)
## tibble [1,969 × 14] (S3: tbl_df/tbl/data.frame)
## $ Year : num [1:1969] 2024 2023 2022 2021 2020 ...
## $ Rank : num [1:1969] 1 143 137 146 150 153 154 145 141 154 ...
## $ Country name : chr [1:1969] "Finland" "Afghanistan" "Afghanistan" "Afghanistan" ...
## $ Ladder score : num [1:1969] 7.74 1.72 1.86 2.4 2.52 ...
## $ upperwhisker : num [1:1969] 7.81 1.77 1.92 2.47 2.6 ...
## $ lowerwhisker : num [1:1969] 7.66 1.67 1.79 2.34 2.45 ...
## $ Explained by: Log GDP per capita : num [1:1969] 1.749 0.628 0.645 0.758 0.37 ...
## $ Explained by: Social support : num [1:1969] 1.78 0 0 0 0 ...
## $ Explained by: Healthy life expectancy : num [1:1969] 0.824 0.242 0.087 0.289 0.126 ...
## $ Explained by: Freedom to make life choices: num [1:1969] 0.986 0 0 0 0 0 NA NA NA NA ...
## $ Explained by: Generosity : num [1:1969] 0.11 0.091 0.093 0.089 0.122 ...
## $ Explained by: Perceptions of corruption : num [1:1969] 0.502 0.088 0.059 0.005 0.01 ...
## $ Dystopia + residual : num [1:1969] 1.782 0.672 0.976 1.263 1.895 ...
## $ Region : chr [1:1969] "Western Europe" "Asia-Pacific" "Asia-Pacific" "Asia-Pacific" ...
colnames(happiness_data)
## [1] "Year"
## [2] "Rank"
## [3] "Country name"
## [4] "Ladder score"
## [5] "upperwhisker"
## [6] "lowerwhisker"
## [7] "Explained by: Log GDP per capita"
## [8] "Explained by: Social support"
## [9] "Explained by: Healthy life expectancy"
## [10] "Explained by: Freedom to make life choices"
## [11] "Explained by: Generosity"
## [12] "Explained by: Perceptions of corruption"
## [13] "Dystopia + residual"
## [14] "Region"
The column names of the original dataset is too large that might obscure the essential information in the plots. Therefore, the column names are renamed with shorter names as below:
happiness_data <- happiness_data %>%
rename(
Year = Year,
Rank = Rank,
Country = `Country name`,
Score = `Ladder score`,
UpperWhisker = upperwhisker,
LowerWhisker = lowerwhisker,
GDP = `Explained by: Log GDP per capita`,
SocialSupport = `Explained by: Social support`,
Health = `Explained by: Healthy life expectancy`,
Freedom = `Explained by: Freedom to make life choices`,
Generosity = `Explained by: Generosity`,
Corruption = `Explained by: Perceptions of corruption`,
DystopiaResidual = `Dystopia + residual`,
Region = `Region`
)
colnames(happiness_data)
## [1] "Year" "Rank" "Country" "Score"
## [5] "UpperWhisker" "LowerWhisker" "GDP" "SocialSupport"
## [9] "Health" "Freedom" "Generosity" "Corruption"
## [13] "DystopiaResidual" "Region"
happiness_allVar <- happiness_data %>%
select(
GDP,
SocialSupport,
Health,
Freedom,
Generosity,
Corruption,
Rank,
Score,
Year,
Rank,
Country,
UpperWhisker,
LowerWhisker,
DystopiaResidual,
Region
) %>%
na.omit()
#ggpairs(happiness_subset)
happiness_subset <- happiness_data %>%
select(
GDP,
SocialSupport,
Health,
Freedom,
Generosity,
Corruption,
Rank,
Score,
) %>%
na.omit()
ggpairs(happiness_subset)
From the ggpairs plot, we can see that Rank is highly
correlated to Score (-0.98). This means Rank
is the inverse representation of Score. A country with
higher Score should have lower Rank. Other
explanatory variables such as GDP,
SocialSupport, Health, Freedom,
Corruption has higher correlation with Score.
Therefore, these variables could be useful to explain Score
of a country. On the other hand, Generosity has the lowest
correlation (0.043) with Score.
To find how GDP is related to Score, a linear model can be used as below:
lm_gdp <- lm(Score ~ GDP, data = happiness_subset)
happiness_augmented <- augment(lm_gdp)
summary(lm_gdp)
##
## Call:
## lm(formula = Score ~ GDP, data = happiness_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4711 -0.4444 0.0054 0.5490 2.1836
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.49940 0.07793 44.90 <2e-16 ***
## GDP 1.66807 0.05969 27.95 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8158 on 866 degrees of freedom
## Multiple R-squared: 0.4742, Adjusted R-squared: 0.4736
## F-statistic: 781.1 on 1 and 866 DF, p-value: < 2.2e-16
The linear model equation is: \[
\hat{\text{Score}} = 3.49940 + 1.66807 \times \text{GDP}
\] From the model equation and summary it can be said that for
one unit increase in GDP, Score increases by
1.66807. The p-value (2e-16) is much lower than alpha = 0.05, indicating
GDP to be a significant predictor for Score.
The R-square value reveals that around 47.42% of Score can
be interpreted by GDP.
A linear regression line and model diagnosis plots are presented below for validation. The residual plot shows that the model preserves linearity since the fitted values are evenly scattered on both sides of horizontal axis, and shows no pattern. The histogram and Q-Q plot also confirms the normality condition. Since the residual plot does not show any cone or funnel shapes, variability condition is also met.
# Create each plot
p1 <- ggplot(data = happiness_subset, aes(x = GDP, y = Score)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle("GDP vs Happiness Score") +
theme_minimal()
p2 <- ggplot(data = happiness_augmented, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals vs Fitted") +
theme_minimal()
p3 <- ggplot(data = happiness_augmented, aes(x = .resid)) +
geom_histogram(bins = 25, fill = "steelblue", color = "black") +
xlab("Residuals") +
ggtitle("Histogram of Residuals") +
theme_minimal()
p4 <- ggplot(data = happiness_augmented, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ Plot of Residuals") +
theme_minimal()
# Combine the four plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## `geom_smooth()` using formula = 'y ~ x'
To find how Health is related to Score, a linear model can be used as below:
lm_health <- lm(Score ~ Health, data = happiness_subset)
happiness_hlth_augmented <- augment(lm_health)
summary(lm_health)
##
## Call:
## lm(formula = Score ~ Health, data = happiness_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.88593 -0.55045 0.09366 0.60998 2.29471
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.73624 0.07568 49.37 <2e-16 ***
## Health 3.31411 0.12896 25.70 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8474 on 866 degrees of freedom
## Multiple R-squared: 0.4327, Adjusted R-squared: 0.432
## F-statistic: 660.4 on 1 and 866 DF, p-value: < 2.2e-16
The linear model equation is: \[ \hat{\text{Score}} = 3.73624 + 3.31411 \times \text{Health} \] 2x2 plot:
# Create each plot
p1 <- ggplot(data = happiness_subset, aes(x = Health, y = Score)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle("Health vs Happiness Score") +
theme_minimal()
p2 <- ggplot(data = happiness_hlth_augmented, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals vs Fitted") +
theme_minimal()
p3 <- ggplot(data = happiness_hlth_augmented, aes(x = .resid)) +
geom_histogram(bins = 25, fill = "steelblue", color = "black") +
xlab("Residuals") +
ggtitle("Histogram of Residuals") +
theme_minimal()
p4 <- ggplot(data = happiness_hlth_augmented, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ Plot of Residuals") +
theme_minimal()
# Combine the four plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## `geom_smooth()` using formula = 'y ~ x'
To find how Freedom is related to Score, a linear model can be used as below:
lm_freedom <- lm(Score ~ Freedom, data = happiness_subset)
happiness_freedm_augmented <- augment(lm_freedom)
summary(lm_freedom)
##
## Call:
## lm(formula = Score ~ Freedom, data = happiness_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89050 -0.54943 0.05824 0.66053 2.07614
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.6293 0.1055 34.39 <2e-16 ***
## Freedom 3.3828 0.1784 18.96 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9457 on 866 degrees of freedom
## Multiple R-squared: 0.2933, Adjusted R-squared: 0.2925
## F-statistic: 359.4 on 1 and 866 DF, p-value: < 2.2e-16
The linear model equation is: \[ \hat{\text{Score}} = 3.6293 + 3.3828 \times \text{Freedom} \] 2x2 plot:
# Create each plot
p1 <- ggplot(data = happiness_subset, aes(x = Freedom, y = Score)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle("Freedom vs Happiness Score") +
theme_minimal()
p2 <- ggplot(data = happiness_freedm_augmented, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals vs Fitted") +
theme_minimal()
p3 <- ggplot(data = happiness_freedm_augmented, aes(x = .resid)) +
geom_histogram(bins = 25, fill = "steelblue", color = "black") +
xlab("Residuals") +
ggtitle("Histogram of Residuals") +
theme_minimal()
p4 <- ggplot(data = happiness_freedm_augmented, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ Plot of Residuals") +
theme_minimal()
# Combine the four plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## `geom_smooth()` using formula = 'y ~ x'
To find how Corruption is related to Score, a linear model can be used as below:
lm_corruption <- lm(Score ~ Corruption, data = happiness_subset)
happiness_corrupt_augmented <- augment(lm_corruption)
summary(lm_corruption)
##
## Call:
## lm(formula = Score ~ Corruption, data = happiness_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1337 -0.6641 0.2520 0.7821 1.9682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.9523 0.0538 92.04 <2e-16 ***
## Corruption 4.0401 0.2865 14.10 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.015 on 866 degrees of freedom
## Multiple R-squared: 0.1867, Adjusted R-squared: 0.1858
## F-statistic: 198.8 on 1 and 866 DF, p-value: < 2.2e-16
The linear model equation is: \[ \hat{\text{Score}} = 4.9523 + 4.0401 \times \text{Corruption} \] 2x2 plot:
# Create each plot
p1 <- ggplot(data = happiness_subset, aes(x = Corruption, y = Score)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle("Corruption vs Happiness Score") +
theme_minimal()
p2 <- ggplot(data = happiness_corrupt_augmented, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals vs Fitted") +
theme_minimal()
p3 <- ggplot(data = happiness_corrupt_augmented, aes(x = .resid)) +
geom_histogram(bins = 25, fill = "steelblue", color = "black") +
xlab("Residuals") +
ggtitle("Histogram of Residuals") +
theme_minimal()
p4 <- ggplot(data = happiness_corrupt_augmented, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ Plot of Residuals") +
theme_minimal()
# Combine the four plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## `geom_smooth()` using formula = 'y ~ x'
To find how Generosity is related to Score, a linear model can be used as below:
lm_generosity <- lm(Score ~ Generosity, data = happiness_subset)
happiness_generosity_augmented <- augment(lm_generosity)
summary(lm_generosity)
##
## Call:
## lm(formula = Score ~ Generosity, data = happiness_subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1272 -0.7936 0.1232 0.8327 2.3237
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.44970 0.07797 69.899 <2e-16 ***
## Generosity 0.55346 0.43989 1.258 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.124 on 866 degrees of freedom
## Multiple R-squared: 0.001825, Adjusted R-squared: 0.000672
## F-statistic: 1.583 on 1 and 866 DF, p-value: 0.2087
The linear model equation is: \[ \hat{\text{Score}} = 5.44970 + 0.55346 \times \text{Generosity} \] 2x2 plot:
# Create each plot
p1 <- ggplot(data = happiness_subset, aes(x = Generosity, y = Score)) +
geom_jitter() +
geom_smooth(method = "lm") +
ggtitle("Generosity vs Happiness Score") +
theme_minimal()
p2 <- ggplot(data = happiness_generosity_augmented, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals vs Fitted") +
theme_minimal()
p3 <- ggplot(data = happiness_generosity_augmented, aes(x = .resid)) +
geom_histogram(bins = 25, fill = "steelblue", color = "black") +
xlab("Residuals") +
ggtitle("Histogram of Residuals") +
theme_minimal()
p4 <- ggplot(data = happiness_generosity_augmented, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ Plot of Residuals") +
theme_minimal()
# Combine the four plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
## `geom_smooth()` using formula = 'y ~ x'
mlr_full <- lm(Score ~ GDP
+ SocialSupport + Health + Freedom + Generosity + Corruption, data = happiness_allVar)
mlr_full__augmented <- augment(mlr_full)
summary(mlr_full)
##
## Call:
## lm(formula = Score ~ GDP + SocialSupport + Health + Freedom +
## Generosity + Corruption, data = happiness_allVar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.92287 -0.35664 0.05344 0.40467 1.58561
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.04314 0.09075 22.513 < 2e-16 ***
## GDP 0.70679 0.06492 10.888 < 2e-16 ***
## SocialSupport 0.80622 0.08492 9.494 < 2e-16 ***
## Health 1.51755 0.11657 13.019 < 2e-16 ***
## Freedom 1.00947 0.14272 7.073 3.13e-12 ***
## Generosity 1.51052 0.25418 5.943 4.07e-09 ***
## Corruption 0.93746 0.20023 4.682 3.30e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6075 on 861 degrees of freedom
## Multiple R-squared: 0.7101, Adjusted R-squared: 0.7081
## F-statistic: 351.5 on 6 and 861 DF, p-value: < 2.2e-16
mlr_full <- lm(Score ~ GDP
+ SocialSupport + Health + Freedom + Generosity + Corruption + Region, data = happiness_allVar)
mlr_full__augmented <- augment(mlr_full)
summary(mlr_full)
##
## Call:
## lm(formula = Score ~ GDP + SocialSupport + Health + Freedom +
## Generosity + Corruption + Region, data = happiness_allVar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.79917 -0.30293 0.03624 0.35746 1.51265
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.29888 0.09348 24.593 < 2e-16 ***
## GDP 0.65591 0.06573 9.978 < 2e-16 ***
## SocialSupport 0.63606 0.07859 8.094 1.98e-15 ***
## Health 1.12739 0.12289 9.174 < 2e-16 ***
## Freedom 0.86195 0.13351 6.456 1.80e-10 ***
## Generosity 1.71891 0.24605 6.986 5.68e-12 ***
## Corruption 1.04664 0.21004 4.983 7.58e-07 ***
## RegionAsia-Pacific -0.02774 0.06456 -0.430 0.66752
## RegionEastern Europe 0.44253 0.08317 5.321 1.32e-07 ***
## RegionEurope 0.26254 0.17394 1.509 0.13159
## RegionLatin America 0.68826 0.07553 9.113 < 2e-16 ***
## RegionNorth America 0.65019 0.17951 3.622 0.00031 ***
## RegionWestern Europe 0.61473 0.09965 6.169 1.06e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5525 on 855 degrees of freedom
## Multiple R-squared: 0.7619, Adjusted R-squared: 0.7585
## F-statistic: 227.9 on 12 and 855 DF, p-value: < 2.2e-16
happiness_allVar$RegionSimplified <- as.character(happiness_allVar$Region)
happiness_allVar$RegionSimplified[happiness_allVar$Region %in% c("Asia-Pacific", "Europe")] <- "Other"
happiness_allVar$RegionSimplified <- as.factor(happiness_allVar$RegionSimplified)
# Fit new model
model_reduced <- lm(Score ~ GDP + SocialSupport + Health + Freedom +
Generosity + Corruption + RegionSimplified, data = happiness_allVar)
summary(model_reduced)
##
## Call:
## lm(formula = Score ~ GDP + SocialSupport + Health + Freedom +
## Generosity + Corruption + RegionSimplified, data = happiness_allVar)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.81062 -0.30973 0.03266 0.35671 1.65391
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.28902 0.09342 24.502 < 2e-16 ***
## GDP 0.66099 0.06575 10.053 < 2e-16 ***
## SocialSupport 0.64644 0.07846 8.239 6.46e-16 ***
## Health 1.13201 0.12301 9.203 < 2e-16 ***
## Freedom 0.86242 0.13367 6.452 1.85e-10 ***
## Generosity 1.73371 0.24620 7.042 3.89e-12 ***
## Corruption 0.99153 0.20791 4.769 2.18e-06 ***
## RegionSimplifiedEastern Europe 0.43114 0.08302 5.193 2.58e-07 ***
## RegionSimplifiedLatin America 0.68055 0.07549 9.015 < 2e-16 ***
## RegionSimplifiedNorth America 0.64449 0.17970 3.586 0.000354 ***
## RegionSimplifiedOther -0.01987 0.06448 -0.308 0.758002
## RegionSimplifiedWestern Europe 0.61225 0.09976 6.137 1.28e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5532 on 856 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.7579
## F-statistic: 247.8 on 11 and 856 DF, p-value: < 2.2e-16
2x2 plot:
# Create each plot
p1 <- NULL
p2 <- ggplot(data = mlr_full__augmented, aes(x = .fitted, y = .resid)) +
geom_point() +
geom_hline(yintercept = 0, linetype = "dashed") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals vs Fitted") +
theme_minimal()
p3 <- ggplot(data = mlr_full__augmented, aes(x = .resid)) +
geom_histogram(bins = 25, fill = "steelblue", color = "black") +
xlab("Residuals") +
ggtitle("Histogram of Residuals") +
theme_minimal()
p4 <- ggplot(data = mlr_full__augmented, aes(sample = .resid)) +
stat_qq() +
stat_qq_line() +
ggtitle("QQ Plot of Residuals") +
theme_minimal()
# Combine the four plots into a 2x2 grid
(p1 | p2) / (p3 | p4)
# Define high and low-income groups based on GDP
median_gdp <- median(happiness_allVar$GDP, na.rm = TRUE)
# Create IncomeGroup variable: High and Low based on GDP
happiness_inf <- happiness_allVar %>%
mutate(IncomeGroup = ifelse(GDP > median_gdp, "High", "Low"))
ggplot(happiness_inf, aes(x = IncomeGroup, y = Score, fill = IncomeGroup)) +
geom_boxplot() +
labs(title = "Happiness Score by Income Group",
x = "Income Group", y = "Happiness Score") +
theme_minimal()
• Independence: we assume World_happiness_data collection method ensures independent sampling. The data has been collected randomly and the sample size is small (less than 10%) relative to the population size.
• Sample size: we can look into the sample size using the following code:
The distribution looks nearly normal. According to the income group sizes, the “High” and “Low” groups both have above 30 observations (433 and 435, respectively), meeting the sample size requirement.
happiness_inf %>%
group_by(IncomeGroup) %>%
summarise(mean_score = mean(Score, na.rm = TRUE), n = n())
#gdp_data <- na.omit(happiness_allVar$GDP)
# Create a data frame for plotting
gdp_df <- data.frame(GDP = happiness_allVar$GDP)
# Plot histogram with overlaid normal distribution curve
ggplot(gdp_df, aes(x = GDP)) +
geom_histogram(aes(y = ..density..), bins = 30, fill = "skyblue", color = "black") +
stat_function(fun = dnorm,
args = list(mean = mean(happiness_allVar$GDP), sd = sd(happiness_allVar$GDP)),
color = "red", size = 1) +
labs(title = "Distribution of GDP with Normal Curve",
x = "GDP",
y = "Density") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
### Define hypothesis
Null Hypothesis (H0): There is no significant difference in the mean happiness scores between high-income and low-income countries. \[ H_0: \mu_{\text{high}} = \mu_{\text{low}} \] Alternative Hypothesis (H1): There is a significant difference in the mean happiness scores between high-income and low-income countries. \[ H_1: \mu_{\text{high}} \ne \mu_{\text{low}} \]
obs_diff <- happiness_inf %>%
specify(Score ~ IncomeGroup) %>%
calculate(stat = "diff in means",
order = c("High", "Low"))
obs_diff
set.seed(1234)
null_dist <- happiness_inf %>%
specify(Score ~ IncomeGroup) %>%
hypothesize(null = "independence") %>%
generate(reps = 1000, type = "permute") %>%
calculate(stat = "diff in means", order = c("High", "Low"))
ggplot(data = null_dist, aes(x = stat)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
p_value <- suppressWarnings(null_dist %>%
get_p_value(obs_stat = obs_diff, direction = "two_sided"))
p_value
diff_ci <- null_dist %>%
get_confidence_interval(level = 0.95, type = "percentile")
diff_ci
This strongly suggests that we reject the null hypothesis in favor of the alternative hypothesis, which implies that there is a significant difference in happiness scores between high-income and low-income countries.