Every year, Gallup World Poll releases a publication about the world’s happiness score which they named “World Happiness Report”. Joining experts from various expertise, they measure the happiness score for every country in the world and rank them from the happiest country to the least happy country.

By studying this data, we could find what’s matters when we talk about people’s happiness. Further, we can push the government (in the respective country) to make the right public policies and focus on the sectors which have been proven to have a significant relationship with the country’s happiness index.

About the Report

By using Multiple Regression Model, we would find out what variables provided in the data which have significant effects on the Happiness Score.

You can access the dataset here

1. Data Extraction and Cleaning

1.1 Read the dataset

First, we’re going to call all the library packages we would need for this study

library(ggplot2)
library(ggthemes)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble  3.1.6     v dplyr   1.0.8
## v tidyr   1.2.0     v stringr 1.4.0
## v readr   2.1.2     v forcats 0.5.1
## v purrr   0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag()    masks stats::lag()
library(dplyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
## 
##     src, summarize
## The following object is masked from 'package:plotly':
## 
##     subplot
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
library(psych)
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit
## The following object is masked from 'package:Hmisc':
## 
##     describe
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

Second, we’re going to create a new variable called happiness_index to store our dataset

happiness_index <- read.csv("world happiness report.csv")
head(happiness_index)
##       Country Regional_indicator Ladder_score Standard_error_of_ladder_score
## 1     Finland     Western Europe        7.842                          0.032
## 2     Denmark     Western Europe        7.620                          0.035
## 3 Switzerland     Western Europe        7.571                          0.036
## 4     Iceland     Western Europe        7.554                          0.059
## 5 Netherlands     Western Europe        7.464                          0.027
## 6      Norway     Western Europe        7.392                          0.035
##   upperwhisker lowerwhisker GDP_per_capita Social_support
## 1        7.904        7.780         10.775          0.954
## 2        7.687        7.552         10.933          0.954
## 3        7.643        7.500         11.117          0.942
## 4        7.670        7.438         10.878          0.983
## 5        7.518        7.410         10.932          0.942
## 6        7.462        7.323         11.053          0.954
##   Healthy_life_expectancy Freedom_to_make_life_choices Generosity
## 1                    72.0                        0.949     -0.098
## 2                    72.7                        0.946      0.030
## 3                    74.4                        0.919      0.025
## 4                    73.0                        0.955      0.160
## 5                    72.4                        0.913      0.175
## 6                    73.3                        0.960      0.093
##   Perceptions_of_corruption Ladder_score_in_Dystopia
## 1                     0.186                     2.43
## 2                     0.179                     2.43
## 3                     0.292                     2.43
## 4                     0.673                     2.43
## 5                     0.338                     2.43
## 6                     0.270                     2.43
##   Explained_by_Log_GDP_per_capita Explained_by_Social_support
## 1                           1.446                       1.106
## 2                           1.502                       1.108
## 3                           1.566                       1.079
## 4                           1.482                       1.172
## 5                           1.501                       1.079
## 6                           1.543                       1.108
##   Explained_by_Healthy_life_expectancy
## 1                                0.741
## 2                                0.763
## 3                                0.816
## 4                                0.772
## 5                                0.753
## 6                                0.782
##   Explained_by_Freedom_to_make_life_choices Explained_by_Generosity
## 1                                     0.691                   0.124
## 2                                     0.686                   0.208
## 3                                     0.653                   0.204
## 4                                     0.698                   0.293
## 5                                     0.647                   0.302
## 6                                     0.703                   0.249
##   Explained_by_Perceptions_of_corruption Dystopia_._residual
## 1                                  0.481               3.253
## 2                                  0.485               2.868
## 3                                  0.413               2.839
## 4                                  0.170               2.967
## 5                                  0.384               2.798
## 6                                  0.427               2.580
summary(happiness_index)
##    Country          Regional_indicator  Ladder_score  
##  Length:149         Length:149         Min.   :2.523  
##  Class :character   Class :character   1st Qu.:4.852  
##  Mode  :character   Mode  :character   Median :5.534  
##                                        Mean   :5.533  
##                                        3rd Qu.:6.255  
##                                        Max.   :7.842  
##  Standard_error_of_ladder_score  upperwhisker    lowerwhisker  
##  Min.   :0.02600                Min.   :2.596   Min.   :2.449  
##  1st Qu.:0.04300                1st Qu.:4.991   1st Qu.:4.706  
##  Median :0.05400                Median :5.625   Median :5.413  
##  Mean   :0.05875                Mean   :5.648   Mean   :5.418  
##  3rd Qu.:0.07000                3rd Qu.:6.344   3rd Qu.:6.128  
##  Max.   :0.17300                Max.   :7.904   Max.   :7.780  
##  GDP_per_capita   Social_support   Healthy_life_expectancy
##  Min.   : 6.635   Min.   :0.4630   Min.   :48.48          
##  1st Qu.: 8.541   1st Qu.:0.7500   1st Qu.:59.80          
##  Median : 9.569   Median :0.8320   Median :66.60          
##  Mean   : 9.432   Mean   :0.8147   Mean   :64.99          
##  3rd Qu.:10.421   3rd Qu.:0.9050   3rd Qu.:69.60          
##  Max.   :11.647   Max.   :0.9830   Max.   :76.95          
##  Freedom_to_make_life_choices   Generosity       Perceptions_of_corruption
##  Min.   :0.3820               Min.   :-0.28800   Min.   :0.0820           
##  1st Qu.:0.7180               1st Qu.:-0.12600   1st Qu.:0.6670           
##  Median :0.8040               Median :-0.03600   Median :0.7810           
##  Mean   :0.7916               Mean   :-0.01513   Mean   :0.7274           
##  3rd Qu.:0.8770               3rd Qu.: 0.07900   3rd Qu.:0.8450           
##  Max.   :0.9700               Max.   : 0.54200   Max.   :0.9390           
##  Ladder_score_in_Dystopia Explained_by_Log_GDP_per_capita
##  Min.   :2.43             Min.   :0.0000                 
##  1st Qu.:2.43             1st Qu.:0.6660                 
##  Median :2.43             Median :1.0250                 
##  Mean   :2.43             Mean   :0.9772                 
##  3rd Qu.:2.43             3rd Qu.:1.3230                 
##  Max.   :2.43             Max.   :1.7510                 
##  Explained_by_Social_support Explained_by_Healthy_life_expectancy
##  Min.   :0.0000              Min.   :0.0000                      
##  1st Qu.:0.6470              1st Qu.:0.3570                      
##  Median :0.8320              Median :0.5710                      
##  Mean   :0.7933              Mean   :0.5202                      
##  3rd Qu.:0.9960              3rd Qu.:0.6650                      
##  Max.   :1.1720              Max.   :0.8970                      
##  Explained_by_Freedom_to_make_life_choices Explained_by_Generosity
##  Min.   :0.0000                            Min.   :0.000          
##  1st Qu.:0.4090                            1st Qu.:0.105          
##  Median :0.5140                            Median :0.164          
##  Mean   :0.4987                            Mean   :0.178          
##  3rd Qu.:0.6030                            3rd Qu.:0.239          
##  Max.   :0.7160                            Max.   :0.541          
##  Explained_by_Perceptions_of_corruption Dystopia_._residual
##  Min.   :0.0000                         Min.   :0.648      
##  1st Qu.:0.0600                         1st Qu.:2.138      
##  Median :0.1010                         Median :2.509      
##  Mean   :0.1351                         Mean   :2.430      
##  3rd Qu.:0.1740                         3rd Qu.:2.794      
##  Max.   :0.5470                         Max.   :3.482
str(happiness_index)
## 'data.frame':    149 obs. of  20 variables:
##  $ Country                                  : chr  "Finland" "Denmark" "Switzerland" "Iceland" ...
##  $ Regional_indicator                       : chr  "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ Ladder_score                             : num  7.84 7.62 7.57 7.55 7.46 ...
##  $ Standard_error_of_ladder_score           : num  0.032 0.035 0.036 0.059 0.027 0.035 0.036 0.037 0.04 0.036 ...
##  $ upperwhisker                             : num  7.9 7.69 7.64 7.67 7.52 ...
##  $ lowerwhisker                             : num  7.78 7.55 7.5 7.44 7.41 ...
##  $ GDP_per_capita                           : num  10.8 10.9 11.1 10.9 10.9 ...
##  $ Social_support                           : num  0.954 0.954 0.942 0.983 0.942 0.954 0.934 0.908 0.948 0.934 ...
##  $ Healthy_life_expectancy                  : num  72 72.7 74.4 73 72.4 73.3 72.7 72.6 73.4 73.3 ...
##  $ Freedom_to_make_life_choices             : num  0.949 0.946 0.919 0.955 0.913 0.96 0.945 0.907 0.929 0.908 ...
##  $ Generosity                               : num  -0.098 0.03 0.025 0.16 0.175 0.093 0.086 -0.034 0.134 0.042 ...
##  $ Perceptions_of_corruption                : num  0.186 0.179 0.292 0.673 0.338 0.27 0.237 0.386 0.242 0.481 ...
##  $ Ladder_score_in_Dystopia                 : num  2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 ...
##  $ Explained_by_Log_GDP_per_capita          : num  1.45 1.5 1.57 1.48 1.5 ...
##  $ Explained_by_Social_support              : num  1.11 1.11 1.08 1.17 1.08 ...
##  $ Explained_by_Healthy_life_expectancy     : num  0.741 0.763 0.816 0.772 0.753 0.782 0.763 0.76 0.785 0.782 ...
##  $ Explained_by_Freedom_to_make_life_choices: num  0.691 0.686 0.653 0.698 0.647 0.703 0.685 0.639 0.665 0.64 ...
##  $ Explained_by_Generosity                  : num  0.124 0.208 0.204 0.293 0.302 0.249 0.244 0.166 0.276 0.215 ...
##  $ Explained_by_Perceptions_of_corruption   : num  0.481 0.485 0.413 0.17 0.384 0.427 0.448 0.353 0.445 0.292 ...
##  $ Dystopia_._residual                      : num  3.25 2.87 2.84 2.97 2.8 ...

Before the extraction, this dataset has 149 rows(observations) and 20 columns (variables).

1.2 Data Extraction

Using all the variables is not necessary. We could single out every variable that isn’t needed in the model and only leave variables in the dataset that we need.

happiness_index$Standard_error_of_ladder_score <- NULL
happiness_index$upperwhisker <- NULL
happiness_index$lowerwhisker <- NULL
happiness_index_num <- happiness_index[,1:9]

str(happiness_index_num)
## 'data.frame':    149 obs. of  9 variables:
##  $ Country                     : chr  "Finland" "Denmark" "Switzerland" "Iceland" ...
##  $ Regional_indicator          : chr  "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
##  $ Ladder_score                : num  7.84 7.62 7.57 7.55 7.46 ...
##  $ GDP_per_capita              : num  10.8 10.9 11.1 10.9 10.9 ...
##  $ Social_support              : num  0.954 0.954 0.942 0.983 0.942 0.954 0.934 0.908 0.948 0.934 ...
##  $ Healthy_life_expectancy     : num  72 72.7 74.4 73 72.4 73.3 72.7 72.6 73.4 73.3 ...
##  $ Freedom_to_make_life_choices: num  0.949 0.946 0.919 0.955 0.913 0.96 0.945 0.907 0.929 0.908 ...
##  $ Generosity                  : num  -0.098 0.03 0.025 0.16 0.175 0.093 0.086 -0.034 0.134 0.042 ...
##  $ Perceptions_of_corruption   : num  0.186 0.179 0.292 0.673 0.338 0.27 0.237 0.386 0.242 0.481 ...

After the extraction, our dataset left with 149 rows(observations) and 9 columns(variables) with the statistical summaries shown below.

summary(happiness_index_num)
##    Country          Regional_indicator  Ladder_score   GDP_per_capita  
##  Length:149         Length:149         Min.   :2.523   Min.   : 6.635  
##  Class :character   Class :character   1st Qu.:4.852   1st Qu.: 8.541  
##  Mode  :character   Mode  :character   Median :5.534   Median : 9.569  
##                                        Mean   :5.533   Mean   : 9.432  
##                                        3rd Qu.:6.255   3rd Qu.:10.421  
##                                        Max.   :7.842   Max.   :11.647  
##  Social_support   Healthy_life_expectancy Freedom_to_make_life_choices
##  Min.   :0.4630   Min.   :48.48           Min.   :0.3820              
##  1st Qu.:0.7500   1st Qu.:59.80           1st Qu.:0.7180              
##  Median :0.8320   Median :66.60           Median :0.8040              
##  Mean   :0.8147   Mean   :64.99           Mean   :0.7916              
##  3rd Qu.:0.9050   3rd Qu.:69.60           3rd Qu.:0.8770              
##  Max.   :0.9830   Max.   :76.95           Max.   :0.9700              
##    Generosity       Perceptions_of_corruption
##  Min.   :-0.28800   Min.   :0.0820           
##  1st Qu.:-0.12600   1st Qu.:0.6670           
##  Median :-0.03600   Median :0.7810           
##  Mean   :-0.01513   Mean   :0.7274           
##  3rd Qu.: 0.07900   3rd Qu.:0.8450           
##  Max.   : 0.54200   Max.   :0.9390

1.3 Checking for Missing Values

colSums(is.na(happiness_index_num))
##                      Country           Regional_indicator 
##                            0                            0 
##                 Ladder_score               GDP_per_capita 
##                            0                            0 
##               Social_support      Healthy_life_expectancy 
##                            0                            0 
## Freedom_to_make_life_choices                   Generosity 
##                            0                            0 
##    Perceptions_of_corruption 
##                            0

Missing value is not found in the dataset

2. Exploratory Data Analysis (EDA)

2.1 Ladder Score (Happiness Score)

ggplot(data=happiness_index_num, aes(x=Ladder_score,
                                      fill=Regional_indicator)) +
  geom_density(alpha=0.7, col=FALSE) +
  scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
                               "#AFB83B", "#6ACCBC", "#FF8D85",
                               "#EB96EB", "#299438", "#B8B8B8",
                               "#158FAD")) +
  labs(title="Distribution of Happiness Score",
       x="Ladder Score",
       fill="Regional",
       y=NULL) +
  theme_gdocs()

This plot shows us the distribution of the happiness score for every regions. The higher the ladder score, the happier people tend to be.

2.2 GDP per Capita

Top 10 GDP per Capita

ggplot(data=happiness_index_num %>% top_n(10, GDP_per_capita),
       aes(x=GDP_per_capita,
           y=Country,
           fill=Regional_indicator)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title = "Top 10 GDP per Capita",
       x = "GDP per Capita",
       y = NULL,
       fill = "Regional") +
  scale_fill_manual(values = c("#FAD000", "#6ACCBC", "#FF8D85",
                               "#299438", "#158FAD")) +
  theme_fivethirtyeight()+
  theme(axis.title = element_text(),
        axis.text.x = element_text(angle = 15))

Bottom 10 GDP per Capita

ggplot(data=happiness_index_num %>% top_n(-10, GDP_per_capita),
       aes(x=GDP_per_capita,
           y=Country,
           fill=Regional_indicator)) +
  geom_bar(stat="identity", position="dodge") +
  labs(title = "Bottom 10 GDP per Capita",
       x = "GDP per Capita",
       y = NULL,
       fill = "Regional") +
  scale_fill_manual(values = c("#AFB83B", "#B8B8B8")) +
  theme_fivethirtyeight()+
  theme(axis.title = element_text(),
        axis.text.x = element_text(angle = 15))

GDP per Capita is used by economists, usually to measure the standard of living of people in a country.

These plots show us that 6/10 of countries with the highest GDP are from Western Europe and 9/10 of countries with the lowest GDP are from Sub-Saharan Africa.

2.3 Social Support Index

Top 10 Social Support

ggplot(data=happiness_index_num %>% top_n(10, Social_support),
       aes(x=Country,
           y=Social_support,
           fill=Regional_indicator)) +
  geom_bar(position="dodge", stat="identity") +
  scale_fill_manual(values = c("#B8255F", "#FF9933", "#FF8D85",
                               "#158FAD")) +
  labs(title = "Top 10 Social Support",
       x = NULL,
       y = "Social Support",
       fill = "Regional") +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        axis.text.x = element_text(angle = 15))

Bottom 10 Social Support

ggplot(data=happiness_index_num %>% top_n(-10, Social_support),
       aes(x=Country,
           y=Social_support,
           fill=Regional_indicator)) +
  geom_bar(position="dodge", stat="identity") +
  scale_fill_manual(values = c("#AFB83B", "#6ACCBC", "#EB96EB",
                               "#B8B8B8")) +
  labs(title = "Bottom 10 Social Support",
       x = NULL,
       y = "Social Support",
       fill = "Regional") +
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        axis.text.x = element_text(angle = 15))

Social Support value is used to answer a simple question “Do you have someone you could rely on when you need help?”.

High Social Support index means the people feel secure enough because they indeed have someone (could be family, friends, public organization, etc) they could ask for help whenever they need it.

2.4 Healthy Life Expectancy

ggplot(data=happiness_index_num, aes(x = Regional_indicator,
                                     y = Healthy_life_expectancy,
                                     fill = Regional_indicator)) +
  geom_boxplot() +
  geom_point(alpha = 0.7, color = "darkblue", position = "jitter") +
  labs(title = "Healthy Life Expectancy Index",
       subtitle = "Distribution of World's Healthy Life Expectancy Index",
       x = "Regional Indicator",
       y = "",
       fill = "Regional") +
  scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
                               "#AFB83B", "#6ACCBC", "#FF8D85",
                               "#EB96EB", "#299438", "#B8B8B8",
                               "#158FAD"))+
  theme_fivethirtyeight()+
  theme(axis.title = element_text(),
        axis.text.x = element_text(angle = 45))

The term ‘health expectancy’ is used to describe, within a person’s life expectancy, the expected years spent in various health states, such as years with disability.

2.5 Freedom to Make Life Choices

ggplot(data=happiness_index_num, aes(x=Freedom_to_make_life_choices,
                                     fill=Regional_indicator)) +
  geom_histogram(bins=10, alpha=1.5) +
  labs(title = "Freedom to Make Life Choices",
       x = "Freedom to Make Life Choices",
       y = "",
       fill = "Regional") +
  scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
                               "#AFB83B", "#6ACCBC", "#FF8D85",
                               "#EB96EB", "#299438", "#B8B8B8",
                               "#158FAD"))+
  theme_fivethirtyeight()

This variable shows us how much people in a country could have the freedom to make choices regarding how they’re going to lead their own life. The higher the value, the better.

2.6 Generosity

ggplot(data=happiness_index_num, aes(x=Generosity,
                                     y=Regional_indicator,
                                     fill=Regional_indicator)) +
  geom_violin(scale="count",
              alpha=0.7,
              width=3,
              position=position_dodge(width=0.5)) +
  geom_point(color="red",
             position="jitter",
             show.legend=F) +
  labs(title = "Generosity Index",
       subtitle = "Distribution of Generosity Index for each Region",
       x = "Generosity Index",
       y = "",
       fill = "Regional") +
  scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
                               "#AFB83B", "#6ACCBC", "#FF8D85",
                               "#EB96EB", "#299438", "#B8B8B8",
                               "#158FAD"))+
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
## Warning: position_dodge requires non-overlapping x intervals

Samples from all countries were asked if they have donated their money in the past month. The high Generosity index indicates more people answer “Yes” than “No” to that question.

2.7 Perceptions of Corruption

ggplot(data=happiness_index_num, aes(x=Perceptions_of_corruption,
                                     y=Regional_indicator,
                                     fill=Regional_indicator)) +
  geom_violin(scale="count",
              alpha=0.7,
              width=3,
              position=position_dodge(width=0.5)) +
  geom_point(color="forestgreen",
             position="jitter",
             show.legend=F) +
  labs(title = "Perceptions of Corruption Index",
       subtitle = "Distribution of Perceptions of Corruption for each Region",
       x = "PoC Index",
       y = "",
       fill = "Regional") +
  scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
                               "#AFB83B", "#6ACCBC", "#FF8D85",
                               "#EB96EB", "#299438", "#B8B8B8",
                               "#158FAD"))+
  theme_fivethirtyeight() +
  theme(axis.title = element_text(),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())
## Warning: position_dodge requires non-overlapping x intervals

This variable measures people’s perceptions of corruption levels in their own country. A high PoC index means the corruption in both public and business sectors is high.

3. Machine Learning Design

3.1 Transforming Data into PCA Format

set.seed(2022)
happiness_index_num2 <- happiness_index_num[,3:9] #to single out the char variables

pcaout <- prcomp(happiness_index_num2, scale=T)

3.2 Computing and Ploting VE, PVE, and Cum PVE

ve <- pcaout$sdev^2
plot(ve)

pve <- ve/sum(ve)
plot(pve,
     xlab = "Principal Component",
     ylab = "PVE",
     ylim = c(0,1),
     type = "b")

pve
## [1] 0.55907243 0.18418147 0.10117376 0.07406394 0.03585500 0.02762710 0.01802630
total_3 <- 0.55907243 + 0.18418147 + 0.10117376
total_3
## [1] 0.8444277
total_4 <- total_3 + 0.07406394
total_4
## [1] 0.9184916

We’re going to use 4 PCA variables for it could represent 92% of our data.

new_feat <- pcaout$x[,1:4]

new_feat_df <- data.frame(new_feat)
new_happiness_df <- cbind(happiness_index_num2$Ladder_score,
                          new_feat_df)
head(new_happiness_df)
##   happiness_index_num2$Ladder_score       PC1       PC2        PC3        PC4
## 1                             7.842 -3.782594 0.9903290  1.8989054 -0.3461422
## 2                             7.620 -3.760957 1.5886547  1.5601997  0.1177404
## 3                             7.571 -3.642531 1.1341103  1.1569381  0.3099945
## 4                             7.554 -3.134687 0.9595203 -1.1374648  0.1328065
## 5                             7.464 -3.263927 1.8175064  0.4948283  0.6596040
## 6                             7.392 -3.646376 1.6735290  0.9596761  0.1892724
colnames(new_happiness_df)[1] <- "LadderScore"
colnames(new_happiness_df)
## [1] "LadderScore" "PC1"         "PC2"         "PC3"         "PC4"

Train and Test Data Split

row = dim(new_happiness_df) [1]
column = dim(new_happiness_df) [2]

trainidx_pca <- sample(row, row*0.75)
trainidx_pca[1:10]
##  [1]  55  75   6 123  14   7  93 112   1  51
training_data_pca <- new_happiness_df[trainidx_pca,]
testing_data_pca <- new_happiness_df[-trainidx_pca,]

4. Modelling

4.1 Multiple Linear Regression-1st Model

mlr_pca <- lm(formula = LadderScore~.,
          data = training_data_pca)
summary(mlr_pca)
## 
## Call:
## lm(formula = LadderScore ~ ., data = training_data_pca)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.11609 -0.20220 -0.00111  0.22986  0.94014 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.54271    0.03801 145.835   <2e-16 ***
## PC1         -0.49651    0.01933 -25.692   <2e-16 ***
## PC2          0.04280    0.03184   1.344   0.1818    
## PC3         -0.09121    0.04376  -2.084   0.0395 *  
## PC4          0.04856    0.05705   0.851   0.3966    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3977 on 106 degrees of freedom
## Multiple R-squared:  0.8639, Adjusted R-squared:  0.8588 
## F-statistic: 168.2 on 4 and 106 DF,  p-value: < 2.2e-16

The level of significance of this model is 85%

Eliminating Extreme Values

cutoff <- 4/((nrow(training_data_pca)-length(mlr_pca$coefficients)-2))
plot(mlr_pca, which = 4, cook.levels = cutoff)

plot(mlr_pca, which = 5, cook.levels = cutoff)

training_data_pca2 <- training_data_pca[-which(rownames
                                               (training_data_pca)
                                       %in% c("32","147","126")),]

4.2 Multiple Linear Regression-2nd Model

mlr2_pca <- lm(formula = LadderScore~., 
           data = training_data_pca2)
summary(mlr2_pca)
## 
## Call:
## lm(formula = LadderScore ~ ., data = training_data_pca2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09855 -0.20177 -0.00831  0.23519  0.79088 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.57422    0.03513 158.682   <2e-16 ***
## PC1         -0.49254    0.01801 -27.345   <2e-16 ***
## PC2          0.09995    0.03135   3.188   0.0019 ** 
## PC3         -0.02365    0.04442  -0.532   0.5956    
## PC4          0.04175    0.05261   0.793   0.4293    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3615 on 103 degrees of freedom
## Multiple R-squared:  0.8843, Adjusted R-squared:  0.8798 
## F-statistic: 196.7 on 4 and 103 DF,  p-value: < 2.2e-16

After eliminating the extreme values, the significance level increases to 88%.

Eliminating Extreme Values - 2nd Attempt

cutoff <- 4/((nrow(training_data_pca2)-length(mlr2_pca$coefficients)-2))
plot(mlr2_pca, which = 4, cook.levels = cutoff)

plot(mlr2_pca, which = 5, cook.levels = cutoff)

training_data_pca3 <- training_data_pca2[-which(rownames
                                                (training_data_pca)
                                               %in% c("97","139","99")),]

4.3 Multiple Linear Regression-3rd Model

mlr3_pca <- lm(formula = LadderScore~., 
               data = training_data_pca3)
summary(mlr3_pca)
## 
## Call:
## lm(formula = LadderScore ~ ., data = training_data_pca3)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.09319 -0.20082 -0.00858  0.22673  0.77250 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  5.585427   0.035137 158.963  < 2e-16 ***
## PC1         -0.494907   0.017856 -27.717  < 2e-16 ***
## PC2          0.099278   0.030941   3.209  0.00179 ** 
## PC3         -0.006439   0.044762  -0.144  0.88591    
## PC4          0.043899   0.052533   0.836  0.40534    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3567 on 100 degrees of freedom
## Multiple R-squared:  0.8898, Adjusted R-squared:  0.8854 
## F-statistic: 201.9 on 4 and 100 DF,  p-value: < 2.2e-16

After 2nd attempt of eliminating extreme values and remodeling, the significance level doesn’t add up much so this is going to be our final model.

Next, we want to make sure that our Target Variable has linear relationships with each Independent Variable. The more linear the Target Variable with the Independent Variables, the more accurate our result is going to be.

These crPlots could help us to see the relationships between our Target Variable and Independent Variables

crPlots(mlr3_pca)

5. Checking for Multicolinearity

vif(mlr3_pca) 
##      PC1      PC2      PC3      PC4 
## 1.019623 1.013313 1.004008 1.010215

With multicolinearity, we want to make sure there is no linear relationships between the X variables.

vif < 5 indicates the X variables don’t have linear relationships and we are good to go.

6. Testing and Visualizing our Actual and Predicting Model

actual_pca <- testing_data_pca$LadderScore
pred.mlr_pca <- predict(mlr3_pca, testing_data_pca)

score_df_pca <- data.frame(actual_pca,pred.mlr_pca)
score_df_pca
##     actual_pca pred.mlr_pca
## 10       7.268     7.163054
## 13       7.155     7.005813
## 21       6.690     6.724584
## 22       6.647     6.551215
## 25       6.561     6.638393
## 27       6.491     6.437176
## 30       6.435     5.821737
## 37       6.309     5.837322
## 43       6.172     5.940958
## 44       6.166     6.194837
## 46       6.140     5.762657
## 47       6.106     6.144448
## 52       6.012     5.767886
## 61       5.880     5.656836
## 65       5.766     5.570233
## 70       5.677     5.561843
## 71       5.653     5.784540
## 77       5.477     6.529812
## 82       5.345     5.707926
## 85       5.306     4.443409
## 89       5.198     5.963814
## 91       5.142     4.521592
## 101      5.025     5.225796
## 104      4.948     5.261702
## 110      4.875     5.282775
## 114      4.830     5.210215
## 115      4.794     4.750475
## 116      4.759     4.431292
## 117      4.723     4.257964
## 118      4.721     5.148685
## 131      4.289     4.098085
## 138      3.849     3.885144
## 140      3.775     3.581440
## 142      3.623     4.752849
## 144      3.600     3.982021
## 145      3.512     3.928257
## 146      3.467     4.820939
## 148      3.145     4.104291
ggplot(data=score_df_pca, aes(x=actual_pca, y=pred.mlr_pca)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE, color="red") +
  labs(title = "Actual VS Prediction Model")
## `geom_smooth()` using formula 'y ~ x'

7. The Correlation between ‘Actual’ and ‘Prediction’

cor(score_df_pca$actual_pca, score_df_pca$pred.mlr_pca)
## [1] 0.8924991

The correlation between our ‘Actual’ and ‘Prediction’ model is 89%

8. Performance Test

performance_pca <- function(prediction, actual, method){
  e <- prediction - actual
  se <- e^2
  mse <- mean(se)
  rmse <- sqrt(mse)
  
  result <- paste("Method:", method,
                  "\n MSE:", round(mse,3),
                  "\n RMSE:", round(rmse,3))
  cat(result)
}

performance_pca(pred.mlr_pca,actual_pca,"Multiple Linear Regression")
## Method: Multiple Linear Regression 
##  MSE: 0.246 
##  RMSE: 0.496

9. Summarize

From this study, we could tell that 89% Happiness Score could be explained by GDP per Capita, Social Support, Healthy Life Expectancy, Generosity, Perceptions of Corruption, and Freedom to Make Life Choices.

Powerpoint could be accessed here