1. Setting up the Environment

Import necessary libraries and load the dataset:

library(tidyverse)
library(lubridate)
library(skimr)
library(dplyr)
whr <- read.csv("data/WHR2023.csv")

2. Initial Exploration of the Dataset

Explore the structure and summary of the dataset:

str(whr)  # View the structure of the dataset
## 'data.frame':    137 obs. of  19 variables:
##  $ Country.name                              : chr  "Finland" "Denmark" "Iceland" "Israel" ...
##  $ Ladder.score                              : num  7.8 7.59 7.53 7.47 7.4 ...
##  $ Standard.error.of.ladder.score            : num  0.036 0.041 0.049 0.032 0.029 0.037 0.044 0.043 0.069 0.038 ...
##  $ upperwhisker                              : num  7.88 7.67 7.62 7.54 7.46 ...
##  $ lowerwhisker                              : num  7.73 7.51 7.43 7.41 7.35 ...
##  $ Logged.GDP.per.capita                     : num  10.8 11 10.9 10.6 10.9 ...
##  $ Social.support                            : num  0.969 0.954 0.983 0.943 0.93 0.939 0.943 0.92 0.879 0.952 ...
##  $ Healthy.life.expectancy                   : num  71.2 71.2 72 72.7 71.5 ...
##  $ Freedom.to.make.life.choices              : num  0.961 0.934 0.936 0.809 0.887 0.948 0.947 0.891 0.915 0.887 ...
##  $ Generosity                                : num  -0.019 0.134 0.211 -0.023 0.213 0.165 0.141 0.027 0.024 0.175 ...
##  $ Perceptions.of.corruption                 : num  0.182 0.196 0.668 0.708 0.379 0.202 0.283 0.266 0.345 0.271 ...
##  $ Ladder.score.in.Dystopia                  : num  1.78 1.78 1.78 1.78 1.78 ...
##  $ Explained.by..Log.GDP.per.capita          : num  1.89 1.95 1.93 1.83 1.94 ...
##  $ Explained.by..Social.support              : num  1.58 1.55 1.62 1.52 1.49 ...
##  $ Explained.by..Healthy.life.expectancy     : num  0.535 0.537 0.559 0.577 0.545 0.562 0.544 0.582 0.549 0.513 ...
##  $ Explained.by..Freedom.to.make.life.choices: num  0.772 0.734 0.738 0.569 0.672 0.754 0.752 0.678 0.71 0.672 ...
##  $ Explained.by..Generosity                  : num  0.126 0.208 0.25 0.124 0.251 0.225 0.212 0.151 0.149 0.23 ...
##  $ Explained.by..Perceptions.of.corruption   : num  0.535 0.525 0.187 0.158 0.394 0.52 0.463 0.475 0.418 0.471 ...
##  $ Dystopia...residual                       : num  2.36 2.08 2.25 2.69 2.11 ...
head(whr)  # Preview the first few rows
summary(whr)  # Summary statistics of the dataset
##  Country.name        Ladder.score   Standard.error.of.ladder.score
##  Length:137         Min.   :1.859   Min.   :0.02900               
##  Class :character   1st Qu.:4.724   1st Qu.:0.04700               
##  Mode  :character   Median :5.684   Median :0.06000               
##                     Mean   :5.540   Mean   :0.06472               
##                     3rd Qu.:6.334   3rd Qu.:0.07700               
##                     Max.   :7.804   Max.   :0.14700               
##                                                                   
##   upperwhisker    lowerwhisker   Logged.GDP.per.capita Social.support  
##  Min.   :1.923   Min.   :1.795   Min.   : 5.527        Min.   :0.3410  
##  1st Qu.:4.980   1st Qu.:4.496   1st Qu.: 8.591        1st Qu.:0.7220  
##  Median :5.797   Median :5.529   Median : 9.567        Median :0.8270  
##  Mean   :5.667   Mean   :5.413   Mean   : 9.450        Mean   :0.7991  
##  3rd Qu.:6.441   3rd Qu.:6.243   3rd Qu.:10.540        3rd Qu.:0.8960  
##  Max.   :7.875   Max.   :7.733   Max.   :11.660        Max.   :0.9830  
##                                                                        
##  Healthy.life.expectancy Freedom.to.make.life.choices   Generosity      
##  Min.   :51.53           Min.   :0.3820               Min.   :-0.25400  
##  1st Qu.:60.65           1st Qu.:0.7240               1st Qu.:-0.07400  
##  Median :65.84           Median :0.8010               Median : 0.00100  
##  Mean   :64.97           Mean   :0.7874               Mean   : 0.02243  
##  3rd Qu.:69.41           3rd Qu.:0.8740               3rd Qu.: 0.11700  
##  Max.   :77.28           Max.   :0.9610               Max.   : 0.53100  
##  NA's   :1                                                              
##  Perceptions.of.corruption Ladder.score.in.Dystopia
##  Min.   :0.1460            Min.   :1.778           
##  1st Qu.:0.6680            1st Qu.:1.778           
##  Median :0.7740            Median :1.778           
##  Mean   :0.7254            Mean   :1.778           
##  3rd Qu.:0.8460            3rd Qu.:1.778           
##  Max.   :0.9290            Max.   :1.778           
##                                                    
##  Explained.by..Log.GDP.per.capita Explained.by..Social.support
##  Min.   :0.000                    Min.   :0.000               
##  1st Qu.:1.099                    1st Qu.:0.962               
##  Median :1.449                    Median :1.227               
##  Mean   :1.407                    Mean   :1.156               
##  3rd Qu.:1.798                    3rd Qu.:1.401               
##  Max.   :2.200                    Max.   :1.620               
##                                                               
##  Explained.by..Healthy.life.expectancy
##  Min.   :0.0000                       
##  1st Qu.:0.2485                       
##  Median :0.3895                       
##  Mean   :0.3662                       
##  3rd Qu.:0.4875                       
##  Max.   :0.7020                       
##  NA's   :1                            
##  Explained.by..Freedom.to.make.life.choices Explained.by..Generosity
##  Min.   :0.000                              Min.   :0.0000          
##  1st Qu.:0.455                              1st Qu.:0.0970          
##  Median :0.557                              Median :0.1370          
##  Mean   :0.540                              Mean   :0.1485          
##  3rd Qu.:0.656                              3rd Qu.:0.1990          
##  Max.   :0.772                              Max.   :0.4220          
##                                                                     
##  Explained.by..Perceptions.of.corruption Dystopia...residual
##  Min.   :0.0000                          Min.   :-0.110     
##  1st Qu.:0.0600                          1st Qu.: 1.555     
##  Median :0.1110                          Median : 1.849     
##  Mean   :0.1459                          Mean   : 1.778     
##  3rd Qu.:0.1870                          3rd Qu.: 2.079     
##  Max.   :0.5610                          Max.   : 2.955     
##                                          NA's   :1
skim(whr)  # Detailed skim of the dataset
Data summary
Name whr
Number of rows 137
Number of columns 19
_______________________
Column type frequency:
character 1
numeric 18
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Country.name 0 1 4 25 0 137 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Ladder.score 0 1.00 5.54 1.14 1.86 4.72 5.68 6.33 7.80 ▁▂▆▇▃
Standard.error.of.ladder.score 0 1.00 0.06 0.02 0.03 0.05 0.06 0.08 0.15 ▆▇▃▁▁
upperwhisker 0 1.00 5.67 1.12 1.92 4.98 5.80 6.44 7.88 ▁▂▆▇▃
lowerwhisker 0 1.00 5.41 1.16 1.79 4.50 5.53 6.24 7.73 ▁▂▆▇▃
Logged.GDP.per.capita 0 1.00 9.45 1.21 5.53 8.59 9.57 10.54 11.66 ▁▃▆▇▆
Social.support 0 1.00 0.80 0.13 0.34 0.72 0.83 0.90 0.98 ▁▂▃▆▇
Healthy.life.expectancy 1 0.99 64.97 5.75 51.53 60.65 65.84 69.41 77.28 ▃▃▇▇▂
Freedom.to.make.life.choices 0 1.00 0.79 0.11 0.38 0.72 0.80 0.87 0.96 ▁▁▃▇▇
Generosity 0 1.00 0.02 0.14 -0.25 -0.07 0.00 0.12 0.53 ▃▇▅▁▁
Perceptions.of.corruption 0 1.00 0.73 0.18 0.15 0.67 0.77 0.85 0.93 ▁▁▁▅▇
Ladder.score.in.Dystopia 0 1.00 1.78 0.00 1.78 1.78 1.78 1.78 1.78 ▁▁▇▁▁
Explained.by..Log.GDP.per.capita 0 1.00 1.41 0.43 0.00 1.10 1.45 1.80 2.20 ▁▃▆▇▆
Explained.by..Social.support 0 1.00 1.16 0.33 0.00 0.96 1.23 1.40 1.62 ▁▂▃▆▇
Explained.by..Healthy.life.expectancy 1 0.99 0.37 0.16 0.00 0.25 0.39 0.49 0.70 ▃▃▇▇▂
Explained.by..Freedom.to.make.life.choices 0 1.00 0.54 0.15 0.00 0.46 0.56 0.66 0.77 ▁▁▃▇▇
Explained.by..Generosity 0 1.00 0.15 0.08 0.00 0.10 0.14 0.20 0.42 ▃▇▅▁▁
Explained.by..Perceptions.of.corruption 0 1.00 0.15 0.13 0.00 0.06 0.11 0.19 0.56 ▇▅▁▁▁
Dystopia…residual 1 0.99 1.78 0.50 -0.11 1.56 1.85 2.08 2.96 ▁▂▅▇▂

Check for missing data:

missing_data <- colSums(is.na(whr))
missing_percent <- (missing_data/nrow(whr))*100
missing_df <- data.frame(
    variable = names(missing_data),
    missing_percent = missing_percent
)

Visualize missing data:

ggplot(missing_df, aes(x = reorder(variable, missing_percent),
                    y = missing_percent)) +
    geom_bar(stat = "identity") +
    coord_flip() +
    theme_minimal() +
    labs(title = "Percentage of Missing Values by Variable",
        x = "Variables", y = "Missing Percentage")

3. Data Cleaning and Transformation

Create a GDP category: Classify countries into High GDP or Low GDP based on whether their Logged.GDP.per.capita is above or below the median value. Hint: Use the median() function to find the median GDP, and ifelse() to categorize the countries. Clean the data: Remove any rows where the happiness score Ladder.score is missing

whr_clean <- whr %>%
    mutate(# Create a GDP variable
    GDP = ifelse(Logged.GDP.per.capita >= median(whr$Logged.GDP.per.capita), "High GDP", "Low GDP")) %>%
    filter(!is.na(Ladder.score))  # Remove rows with missing ladder score values

2. Data Summarization

Calculate average happiness scores: Group the dataset by GDP category (high vs. low GDP) and calculate the average happiness score Ladder.score for each group.

country_stats <- whr_clean %>%
    group_by(GDP) %>%
    summarise(
        Avg_Happiness_Score = mean(Ladder.score)) %>%
    arrange(desc(Avg_Happiness_Score))

3. Data Visualization

Create a box plot: Create a box plot that compares the happiness scores Ladder.score between high and low GDP countries. Use ggplot2 to create the plot.s)

Box Plot (Happiness Score by GDP):

box_plot <- ggplot(whr_clean,
                aes(x = GDP, y = Ladder.score)) +
    geom_boxplot() +
    geom_jitter(alpha = 0.1) +
    theme_minimal() +
    labs(title = "Happiness Score by GDP",
        x = "GDP", y = "Happiness Score")
print(box_plot)

4. Statistical Analysis

Perform a t-test: Perform a t-test to compare the average happiness scores between high and low GDP countries. Interpret the result briefly (focus on the p-value).

t_test_result <- t.test(Ladder.score ~ GDP, data = whr_clean)
print(t_test_result)
## 
##  Welch Two Sample t-test
## 
## data:  Ladder.score by GDP
## t = 10.779, df = 130.93, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group High GDP and group Low GDP is not equal to 0
## 95 percent confidence interval:
##  1.262241 1.829667
## sample estimates:
## mean in group High GDP  mean in group Low GDP 
##               6.307130               4.761176