Import necessary libraries and load the dataset:
library(tidyverse)
library(lubridate)
library(skimr)
library(dplyr)
whr <- read.csv("data/WHR2023.csv")
Explore the structure and summary of the dataset:
str(whr) # View the structure of the dataset
## 'data.frame': 137 obs. of 19 variables:
## $ Country.name : chr "Finland" "Denmark" "Iceland" "Israel" ...
## $ Ladder.score : num 7.8 7.59 7.53 7.47 7.4 ...
## $ Standard.error.of.ladder.score : num 0.036 0.041 0.049 0.032 0.029 0.037 0.044 0.043 0.069 0.038 ...
## $ upperwhisker : num 7.88 7.67 7.62 7.54 7.46 ...
## $ lowerwhisker : num 7.73 7.51 7.43 7.41 7.35 ...
## $ Logged.GDP.per.capita : num 10.8 11 10.9 10.6 10.9 ...
## $ Social.support : num 0.969 0.954 0.983 0.943 0.93 0.939 0.943 0.92 0.879 0.952 ...
## $ Healthy.life.expectancy : num 71.2 71.2 72 72.7 71.5 ...
## $ Freedom.to.make.life.choices : num 0.961 0.934 0.936 0.809 0.887 0.948 0.947 0.891 0.915 0.887 ...
## $ Generosity : num -0.019 0.134 0.211 -0.023 0.213 0.165 0.141 0.027 0.024 0.175 ...
## $ Perceptions.of.corruption : num 0.182 0.196 0.668 0.708 0.379 0.202 0.283 0.266 0.345 0.271 ...
## $ Ladder.score.in.Dystopia : num 1.78 1.78 1.78 1.78 1.78 ...
## $ Explained.by..Log.GDP.per.capita : num 1.89 1.95 1.93 1.83 1.94 ...
## $ Explained.by..Social.support : num 1.58 1.55 1.62 1.52 1.49 ...
## $ Explained.by..Healthy.life.expectancy : num 0.535 0.537 0.559 0.577 0.545 0.562 0.544 0.582 0.549 0.513 ...
## $ Explained.by..Freedom.to.make.life.choices: num 0.772 0.734 0.738 0.569 0.672 0.754 0.752 0.678 0.71 0.672 ...
## $ Explained.by..Generosity : num 0.126 0.208 0.25 0.124 0.251 0.225 0.212 0.151 0.149 0.23 ...
## $ Explained.by..Perceptions.of.corruption : num 0.535 0.525 0.187 0.158 0.394 0.52 0.463 0.475 0.418 0.471 ...
## $ Dystopia...residual : num 2.36 2.08 2.25 2.69 2.11 ...
head(whr) # Preview the first few rows
summary(whr) # Summary statistics of the dataset
## Country.name Ladder.score Standard.error.of.ladder.score
## Length:137 Min. :1.859 Min. :0.02900
## Class :character 1st Qu.:4.724 1st Qu.:0.04700
## Mode :character Median :5.684 Median :0.06000
## Mean :5.540 Mean :0.06472
## 3rd Qu.:6.334 3rd Qu.:0.07700
## Max. :7.804 Max. :0.14700
##
## upperwhisker lowerwhisker Logged.GDP.per.capita Social.support
## Min. :1.923 Min. :1.795 Min. : 5.527 Min. :0.3410
## 1st Qu.:4.980 1st Qu.:4.496 1st Qu.: 8.591 1st Qu.:0.7220
## Median :5.797 Median :5.529 Median : 9.567 Median :0.8270
## Mean :5.667 Mean :5.413 Mean : 9.450 Mean :0.7991
## 3rd Qu.:6.441 3rd Qu.:6.243 3rd Qu.:10.540 3rd Qu.:0.8960
## Max. :7.875 Max. :7.733 Max. :11.660 Max. :0.9830
##
## Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## Min. :51.53 Min. :0.3820 Min. :-0.25400
## 1st Qu.:60.65 1st Qu.:0.7240 1st Qu.:-0.07400
## Median :65.84 Median :0.8010 Median : 0.00100
## Mean :64.97 Mean :0.7874 Mean : 0.02243
## 3rd Qu.:69.41 3rd Qu.:0.8740 3rd Qu.: 0.11700
## Max. :77.28 Max. :0.9610 Max. : 0.53100
## NA's :1
## Perceptions.of.corruption Ladder.score.in.Dystopia
## Min. :0.1460 Min. :1.778
## 1st Qu.:0.6680 1st Qu.:1.778
## Median :0.7740 Median :1.778
## Mean :0.7254 Mean :1.778
## 3rd Qu.:0.8460 3rd Qu.:1.778
## Max. :0.9290 Max. :1.778
##
## Explained.by..Log.GDP.per.capita Explained.by..Social.support
## Min. :0.000 Min. :0.000
## 1st Qu.:1.099 1st Qu.:0.962
## Median :1.449 Median :1.227
## Mean :1.407 Mean :1.156
## 3rd Qu.:1.798 3rd Qu.:1.401
## Max. :2.200 Max. :1.620
##
## Explained.by..Healthy.life.expectancy
## Min. :0.0000
## 1st Qu.:0.2485
## Median :0.3895
## Mean :0.3662
## 3rd Qu.:0.4875
## Max. :0.7020
## NA's :1
## Explained.by..Freedom.to.make.life.choices Explained.by..Generosity
## Min. :0.000 Min. :0.0000
## 1st Qu.:0.455 1st Qu.:0.0970
## Median :0.557 Median :0.1370
## Mean :0.540 Mean :0.1485
## 3rd Qu.:0.656 3rd Qu.:0.1990
## Max. :0.772 Max. :0.4220
##
## Explained.by..Perceptions.of.corruption Dystopia...residual
## Min. :0.0000 Min. :-0.110
## 1st Qu.:0.0600 1st Qu.: 1.555
## Median :0.1110 Median : 1.849
## Mean :0.1459 Mean : 1.778
## 3rd Qu.:0.1870 3rd Qu.: 2.079
## Max. :0.5610 Max. : 2.955
## NA's :1
skim(whr) # Detailed skim of the dataset
| Name | whr |
| Number of rows | 137 |
| Number of columns | 19 |
| _______________________ | |
| Column type frequency: | |
| character | 1 |
| numeric | 18 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Country.name | 0 | 1 | 4 | 25 | 0 | 137 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Ladder.score | 0 | 1.00 | 5.54 | 1.14 | 1.86 | 4.72 | 5.68 | 6.33 | 7.80 | ▁▂▆▇▃ |
| Standard.error.of.ladder.score | 0 | 1.00 | 0.06 | 0.02 | 0.03 | 0.05 | 0.06 | 0.08 | 0.15 | ▆▇▃▁▁ |
| upperwhisker | 0 | 1.00 | 5.67 | 1.12 | 1.92 | 4.98 | 5.80 | 6.44 | 7.88 | ▁▂▆▇▃ |
| lowerwhisker | 0 | 1.00 | 5.41 | 1.16 | 1.79 | 4.50 | 5.53 | 6.24 | 7.73 | ▁▂▆▇▃ |
| Logged.GDP.per.capita | 0 | 1.00 | 9.45 | 1.21 | 5.53 | 8.59 | 9.57 | 10.54 | 11.66 | ▁▃▆▇▆ |
| Social.support | 0 | 1.00 | 0.80 | 0.13 | 0.34 | 0.72 | 0.83 | 0.90 | 0.98 | ▁▂▃▆▇ |
| Healthy.life.expectancy | 1 | 0.99 | 64.97 | 5.75 | 51.53 | 60.65 | 65.84 | 69.41 | 77.28 | ▃▃▇▇▂ |
| Freedom.to.make.life.choices | 0 | 1.00 | 0.79 | 0.11 | 0.38 | 0.72 | 0.80 | 0.87 | 0.96 | ▁▁▃▇▇ |
| Generosity | 0 | 1.00 | 0.02 | 0.14 | -0.25 | -0.07 | 0.00 | 0.12 | 0.53 | ▃▇▅▁▁ |
| Perceptions.of.corruption | 0 | 1.00 | 0.73 | 0.18 | 0.15 | 0.67 | 0.77 | 0.85 | 0.93 | ▁▁▁▅▇ |
| Ladder.score.in.Dystopia | 0 | 1.00 | 1.78 | 0.00 | 1.78 | 1.78 | 1.78 | 1.78 | 1.78 | ▁▁▇▁▁ |
| Explained.by..Log.GDP.per.capita | 0 | 1.00 | 1.41 | 0.43 | 0.00 | 1.10 | 1.45 | 1.80 | 2.20 | ▁▃▆▇▆ |
| Explained.by..Social.support | 0 | 1.00 | 1.16 | 0.33 | 0.00 | 0.96 | 1.23 | 1.40 | 1.62 | ▁▂▃▆▇ |
| Explained.by..Healthy.life.expectancy | 1 | 0.99 | 0.37 | 0.16 | 0.00 | 0.25 | 0.39 | 0.49 | 0.70 | ▃▃▇▇▂ |
| Explained.by..Freedom.to.make.life.choices | 0 | 1.00 | 0.54 | 0.15 | 0.00 | 0.46 | 0.56 | 0.66 | 0.77 | ▁▁▃▇▇ |
| Explained.by..Generosity | 0 | 1.00 | 0.15 | 0.08 | 0.00 | 0.10 | 0.14 | 0.20 | 0.42 | ▃▇▅▁▁ |
| Explained.by..Perceptions.of.corruption | 0 | 1.00 | 0.15 | 0.13 | 0.00 | 0.06 | 0.11 | 0.19 | 0.56 | ▇▅▁▁▁ |
| Dystopia…residual | 1 | 0.99 | 1.78 | 0.50 | -0.11 | 1.56 | 1.85 | 2.08 | 2.96 | ▁▂▅▇▂ |
Check for missing data:
missing_data <- colSums(is.na(whr))
missing_percent <- (missing_data/nrow(whr))*100
missing_df <- data.frame(
variable = names(missing_data),
missing_percent = missing_percent
)
Visualize missing data:
ggplot(missing_df, aes(x = reorder(variable, missing_percent),
y = missing_percent)) +
geom_bar(stat = "identity") +
coord_flip() +
theme_minimal() +
labs(title = "Percentage of Missing Values by Variable",
x = "Variables", y = "Missing Percentage")
Create a GDP category: Classify countries into High GDP or Low GDP based on whether their Logged.GDP.per.capita is above or below the median value. Hint: Use the median() function to find the median GDP, and ifelse() to categorize the countries. Clean the data: Remove any rows where the happiness score Ladder.score is missing
whr_clean <- whr %>%
mutate(# Create a GDP variable
GDP = ifelse(Logged.GDP.per.capita >= median(whr$Logged.GDP.per.capita), "High GDP", "Low GDP")) %>%
filter(!is.na(Ladder.score)) # Remove rows with missing ladder score values
Calculate average happiness scores: Group the dataset by GDP category (high vs. low GDP) and calculate the average happiness score Ladder.score for each group.
country_stats <- whr_clean %>%
group_by(GDP) %>%
summarise(
Avg_Happiness_Score = mean(Ladder.score)) %>%
arrange(desc(Avg_Happiness_Score))
Create a box plot: Create a box plot that compares the happiness scores Ladder.score between high and low GDP countries. Use ggplot2 to create the plot.s)
Box Plot (Happiness Score by GDP):
box_plot <- ggplot(whr_clean,
aes(x = GDP, y = Ladder.score)) +
geom_boxplot() +
geom_jitter(alpha = 0.1) +
theme_minimal() +
labs(title = "Happiness Score by GDP",
x = "GDP", y = "Happiness Score")
print(box_plot)
Perform a t-test: Perform a t-test to compare the average happiness scores between high and low GDP countries. Interpret the result briefly (focus on the p-value).
t_test_result <- t.test(Ladder.score ~ GDP, data = whr_clean)
print(t_test_result)
##
## Welch Two Sample t-test
##
## data: Ladder.score by GDP
## t = 10.779, df = 130.93, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High GDP and group Low GDP is not equal to 0
## 95 percent confidence interval:
## 1.262241 1.829667
## sample estimates:
## mean in group High GDP mean in group Low GDP
## 6.307130 4.761176