Every year, Gallup World Poll releases a publication about the world’s happiness score which they named “World Happiness Report”. Joining experts from various expertise, they measure the happiness score for every country in the world and rank them from the happiest country to the least happy country.
By studying this data, we could find what’s matters when we talk about people’s happiness. Further, we can push the government (in the respective country) to make the right public policies and focus on the sectors which have been proven to have a significant relationship with the country’s happiness index.
By using Multiple Regression Model, we would find out what variables provided in the data which have significant effects on the Happiness Score.
You can access the dataset here
First, we’re going to call all the library packages we would need for this study
library(ggplot2)
library(ggthemes)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v tibble 3.1.6 v dplyr 1.0.8
## v tidyr 1.2.0 v stringr 1.4.0
## v readr 2.1.2 v forcats 0.5.1
## v purrr 0.3.4
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks plotly::filter(), stats::filter()
## x dplyr::lag() masks stats::lag()
library(dplyr)
library(Hmisc)
## Loading required package: lattice
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:base':
##
## format.pval, units
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:car':
##
## logit
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Second, we’re going to create a new variable called happiness_index to store our dataset
happiness_index <- read.csv("world happiness report.csv")
head(happiness_index)
## Country Regional_indicator Ladder_score Standard_error_of_ladder_score
## 1 Finland Western Europe 7.842 0.032
## 2 Denmark Western Europe 7.620 0.035
## 3 Switzerland Western Europe 7.571 0.036
## 4 Iceland Western Europe 7.554 0.059
## 5 Netherlands Western Europe 7.464 0.027
## 6 Norway Western Europe 7.392 0.035
## upperwhisker lowerwhisker GDP_per_capita Social_support
## 1 7.904 7.780 10.775 0.954
## 2 7.687 7.552 10.933 0.954
## 3 7.643 7.500 11.117 0.942
## 4 7.670 7.438 10.878 0.983
## 5 7.518 7.410 10.932 0.942
## 6 7.462 7.323 11.053 0.954
## Healthy_life_expectancy Freedom_to_make_life_choices Generosity
## 1 72.0 0.949 -0.098
## 2 72.7 0.946 0.030
## 3 74.4 0.919 0.025
## 4 73.0 0.955 0.160
## 5 72.4 0.913 0.175
## 6 73.3 0.960 0.093
## Perceptions_of_corruption Ladder_score_in_Dystopia
## 1 0.186 2.43
## 2 0.179 2.43
## 3 0.292 2.43
## 4 0.673 2.43
## 5 0.338 2.43
## 6 0.270 2.43
## Explained_by_Log_GDP_per_capita Explained_by_Social_support
## 1 1.446 1.106
## 2 1.502 1.108
## 3 1.566 1.079
## 4 1.482 1.172
## 5 1.501 1.079
## 6 1.543 1.108
## Explained_by_Healthy_life_expectancy
## 1 0.741
## 2 0.763
## 3 0.816
## 4 0.772
## 5 0.753
## 6 0.782
## Explained_by_Freedom_to_make_life_choices Explained_by_Generosity
## 1 0.691 0.124
## 2 0.686 0.208
## 3 0.653 0.204
## 4 0.698 0.293
## 5 0.647 0.302
## 6 0.703 0.249
## Explained_by_Perceptions_of_corruption Dystopia_._residual
## 1 0.481 3.253
## 2 0.485 2.868
## 3 0.413 2.839
## 4 0.170 2.967
## 5 0.384 2.798
## 6 0.427 2.580
summary(happiness_index)
## Country Regional_indicator Ladder_score
## Length:149 Length:149 Min. :2.523
## Class :character Class :character 1st Qu.:4.852
## Mode :character Mode :character Median :5.534
## Mean :5.533
## 3rd Qu.:6.255
## Max. :7.842
## Standard_error_of_ladder_score upperwhisker lowerwhisker
## Min. :0.02600 Min. :2.596 Min. :2.449
## 1st Qu.:0.04300 1st Qu.:4.991 1st Qu.:4.706
## Median :0.05400 Median :5.625 Median :5.413
## Mean :0.05875 Mean :5.648 Mean :5.418
## 3rd Qu.:0.07000 3rd Qu.:6.344 3rd Qu.:6.128
## Max. :0.17300 Max. :7.904 Max. :7.780
## GDP_per_capita Social_support Healthy_life_expectancy
## Min. : 6.635 Min. :0.4630 Min. :48.48
## 1st Qu.: 8.541 1st Qu.:0.7500 1st Qu.:59.80
## Median : 9.569 Median :0.8320 Median :66.60
## Mean : 9.432 Mean :0.8147 Mean :64.99
## 3rd Qu.:10.421 3rd Qu.:0.9050 3rd Qu.:69.60
## Max. :11.647 Max. :0.9830 Max. :76.95
## Freedom_to_make_life_choices Generosity Perceptions_of_corruption
## Min. :0.3820 Min. :-0.28800 Min. :0.0820
## 1st Qu.:0.7180 1st Qu.:-0.12600 1st Qu.:0.6670
## Median :0.8040 Median :-0.03600 Median :0.7810
## Mean :0.7916 Mean :-0.01513 Mean :0.7274
## 3rd Qu.:0.8770 3rd Qu.: 0.07900 3rd Qu.:0.8450
## Max. :0.9700 Max. : 0.54200 Max. :0.9390
## Ladder_score_in_Dystopia Explained_by_Log_GDP_per_capita
## Min. :2.43 Min. :0.0000
## 1st Qu.:2.43 1st Qu.:0.6660
## Median :2.43 Median :1.0250
## Mean :2.43 Mean :0.9772
## 3rd Qu.:2.43 3rd Qu.:1.3230
## Max. :2.43 Max. :1.7510
## Explained_by_Social_support Explained_by_Healthy_life_expectancy
## Min. :0.0000 Min. :0.0000
## 1st Qu.:0.6470 1st Qu.:0.3570
## Median :0.8320 Median :0.5710
## Mean :0.7933 Mean :0.5202
## 3rd Qu.:0.9960 3rd Qu.:0.6650
## Max. :1.1720 Max. :0.8970
## Explained_by_Freedom_to_make_life_choices Explained_by_Generosity
## Min. :0.0000 Min. :0.000
## 1st Qu.:0.4090 1st Qu.:0.105
## Median :0.5140 Median :0.164
## Mean :0.4987 Mean :0.178
## 3rd Qu.:0.6030 3rd Qu.:0.239
## Max. :0.7160 Max. :0.541
## Explained_by_Perceptions_of_corruption Dystopia_._residual
## Min. :0.0000 Min. :0.648
## 1st Qu.:0.0600 1st Qu.:2.138
## Median :0.1010 Median :2.509
## Mean :0.1351 Mean :2.430
## 3rd Qu.:0.1740 3rd Qu.:2.794
## Max. :0.5470 Max. :3.482
str(happiness_index)
## 'data.frame': 149 obs. of 20 variables:
## $ Country : chr "Finland" "Denmark" "Switzerland" "Iceland" ...
## $ Regional_indicator : chr "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
## $ Ladder_score : num 7.84 7.62 7.57 7.55 7.46 ...
## $ Standard_error_of_ladder_score : num 0.032 0.035 0.036 0.059 0.027 0.035 0.036 0.037 0.04 0.036 ...
## $ upperwhisker : num 7.9 7.69 7.64 7.67 7.52 ...
## $ lowerwhisker : num 7.78 7.55 7.5 7.44 7.41 ...
## $ GDP_per_capita : num 10.8 10.9 11.1 10.9 10.9 ...
## $ Social_support : num 0.954 0.954 0.942 0.983 0.942 0.954 0.934 0.908 0.948 0.934 ...
## $ Healthy_life_expectancy : num 72 72.7 74.4 73 72.4 73.3 72.7 72.6 73.4 73.3 ...
## $ Freedom_to_make_life_choices : num 0.949 0.946 0.919 0.955 0.913 0.96 0.945 0.907 0.929 0.908 ...
## $ Generosity : num -0.098 0.03 0.025 0.16 0.175 0.093 0.086 -0.034 0.134 0.042 ...
## $ Perceptions_of_corruption : num 0.186 0.179 0.292 0.673 0.338 0.27 0.237 0.386 0.242 0.481 ...
## $ Ladder_score_in_Dystopia : num 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 2.43 ...
## $ Explained_by_Log_GDP_per_capita : num 1.45 1.5 1.57 1.48 1.5 ...
## $ Explained_by_Social_support : num 1.11 1.11 1.08 1.17 1.08 ...
## $ Explained_by_Healthy_life_expectancy : num 0.741 0.763 0.816 0.772 0.753 0.782 0.763 0.76 0.785 0.782 ...
## $ Explained_by_Freedom_to_make_life_choices: num 0.691 0.686 0.653 0.698 0.647 0.703 0.685 0.639 0.665 0.64 ...
## $ Explained_by_Generosity : num 0.124 0.208 0.204 0.293 0.302 0.249 0.244 0.166 0.276 0.215 ...
## $ Explained_by_Perceptions_of_corruption : num 0.481 0.485 0.413 0.17 0.384 0.427 0.448 0.353 0.445 0.292 ...
## $ Dystopia_._residual : num 3.25 2.87 2.84 2.97 2.8 ...
Before the extraction, this dataset has 149 rows(observations) and 20 columns (variables).
Using all the variables is not necessary. We could single out every variable that isn’t needed in the model and only leave variables in the dataset that we need.
happiness_index$Standard_error_of_ladder_score <- NULL
happiness_index$upperwhisker <- NULL
happiness_index$lowerwhisker <- NULL
happiness_index_num <- happiness_index[,1:9]
str(happiness_index_num)
## 'data.frame': 149 obs. of 9 variables:
## $ Country : chr "Finland" "Denmark" "Switzerland" "Iceland" ...
## $ Regional_indicator : chr "Western Europe" "Western Europe" "Western Europe" "Western Europe" ...
## $ Ladder_score : num 7.84 7.62 7.57 7.55 7.46 ...
## $ GDP_per_capita : num 10.8 10.9 11.1 10.9 10.9 ...
## $ Social_support : num 0.954 0.954 0.942 0.983 0.942 0.954 0.934 0.908 0.948 0.934 ...
## $ Healthy_life_expectancy : num 72 72.7 74.4 73 72.4 73.3 72.7 72.6 73.4 73.3 ...
## $ Freedom_to_make_life_choices: num 0.949 0.946 0.919 0.955 0.913 0.96 0.945 0.907 0.929 0.908 ...
## $ Generosity : num -0.098 0.03 0.025 0.16 0.175 0.093 0.086 -0.034 0.134 0.042 ...
## $ Perceptions_of_corruption : num 0.186 0.179 0.292 0.673 0.338 0.27 0.237 0.386 0.242 0.481 ...
After the extraction, our dataset left with 149 rows(observations) and 9 columns(variables) with the statistical summaries shown below.
summary(happiness_index_num)
## Country Regional_indicator Ladder_score GDP_per_capita
## Length:149 Length:149 Min. :2.523 Min. : 6.635
## Class :character Class :character 1st Qu.:4.852 1st Qu.: 8.541
## Mode :character Mode :character Median :5.534 Median : 9.569
## Mean :5.533 Mean : 9.432
## 3rd Qu.:6.255 3rd Qu.:10.421
## Max. :7.842 Max. :11.647
## Social_support Healthy_life_expectancy Freedom_to_make_life_choices
## Min. :0.4630 Min. :48.48 Min. :0.3820
## 1st Qu.:0.7500 1st Qu.:59.80 1st Qu.:0.7180
## Median :0.8320 Median :66.60 Median :0.8040
## Mean :0.8147 Mean :64.99 Mean :0.7916
## 3rd Qu.:0.9050 3rd Qu.:69.60 3rd Qu.:0.8770
## Max. :0.9830 Max. :76.95 Max. :0.9700
## Generosity Perceptions_of_corruption
## Min. :-0.28800 Min. :0.0820
## 1st Qu.:-0.12600 1st Qu.:0.6670
## Median :-0.03600 Median :0.7810
## Mean :-0.01513 Mean :0.7274
## 3rd Qu.: 0.07900 3rd Qu.:0.8450
## Max. : 0.54200 Max. :0.9390
colSums(is.na(happiness_index_num))
## Country Regional_indicator
## 0 0
## Ladder_score GDP_per_capita
## 0 0
## Social_support Healthy_life_expectancy
## 0 0
## Freedom_to_make_life_choices Generosity
## 0 0
## Perceptions_of_corruption
## 0
Missing value is not found in the dataset
ggplot(data=happiness_index_num, aes(x=Ladder_score,
fill=Regional_indicator)) +
geom_density(alpha=0.7, col=FALSE) +
scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
"#AFB83B", "#6ACCBC", "#FF8D85",
"#EB96EB", "#299438", "#B8B8B8",
"#158FAD")) +
labs(title="Distribution of Happiness Score",
x="Ladder Score",
fill="Regional",
y=NULL) +
theme_gdocs()
This plot shows us the distribution of the happiness score for every regions. The higher the ladder score, the happier people tend to be.
ggplot(data=happiness_index_num %>% top_n(10, GDP_per_capita),
aes(x=GDP_per_capita,
y=Country,
fill=Regional_indicator)) +
geom_bar(stat="identity", position="dodge") +
labs(title = "Top 10 GDP per Capita",
x = "GDP per Capita",
y = NULL,
fill = "Regional") +
scale_fill_manual(values = c("#FAD000", "#6ACCBC", "#FF8D85",
"#299438", "#158FAD")) +
theme_fivethirtyeight()+
theme(axis.title = element_text(),
axis.text.x = element_text(angle = 15))
ggplot(data=happiness_index_num %>% top_n(-10, GDP_per_capita),
aes(x=GDP_per_capita,
y=Country,
fill=Regional_indicator)) +
geom_bar(stat="identity", position="dodge") +
labs(title = "Bottom 10 GDP per Capita",
x = "GDP per Capita",
y = NULL,
fill = "Regional") +
scale_fill_manual(values = c("#AFB83B", "#B8B8B8")) +
theme_fivethirtyeight()+
theme(axis.title = element_text(),
axis.text.x = element_text(angle = 15))
GDP per Capita is used by economists, usually to measure the standard of living of people in a country.
These plots show us that 6/10 of countries with the highest GDP are from Western Europe and 9/10 of countries with the lowest GDP are from Sub-Saharan Africa.
ggplot(data=happiness_index_num, aes(x = Regional_indicator,
y = Healthy_life_expectancy,
fill = Regional_indicator)) +
geom_boxplot() +
geom_point(alpha = 0.7, color = "darkblue", position = "jitter") +
labs(title = "Healthy Life Expectancy Index",
subtitle = "Distribution of World's Healthy Life Expectancy Index",
x = "Regional Indicator",
y = "",
fill = "Regional") +
scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
"#AFB83B", "#6ACCBC", "#FF8D85",
"#EB96EB", "#299438", "#B8B8B8",
"#158FAD"))+
theme_fivethirtyeight()+
theme(axis.title = element_text(),
axis.text.x = element_text(angle = 45))
The term ‘health expectancy’ is used to describe, within a person’s life expectancy, the expected years spent in various health states, such as years with disability.
ggplot(data=happiness_index_num, aes(x=Freedom_to_make_life_choices,
fill=Regional_indicator)) +
geom_histogram(bins=10, alpha=1.5) +
labs(title = "Freedom to Make Life Choices",
x = "Freedom to Make Life Choices",
y = "",
fill = "Regional") +
scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
"#AFB83B", "#6ACCBC", "#FF8D85",
"#EB96EB", "#299438", "#B8B8B8",
"#158FAD"))+
theme_fivethirtyeight()
This variable shows us how much people in a country could have the freedom to make choices regarding how they’re going to lead their own life. The higher the value, the better.
ggplot(data=happiness_index_num, aes(x=Generosity,
y=Regional_indicator,
fill=Regional_indicator)) +
geom_violin(scale="count",
alpha=0.7,
width=3,
position=position_dodge(width=0.5)) +
geom_point(color="red",
position="jitter",
show.legend=F) +
labs(title = "Generosity Index",
subtitle = "Distribution of Generosity Index for each Region",
x = "Generosity Index",
y = "",
fill = "Regional") +
scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
"#AFB83B", "#6ACCBC", "#FF8D85",
"#EB96EB", "#299438", "#B8B8B8",
"#158FAD"))+
theme_fivethirtyeight() +
theme(axis.title = element_text(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
## Warning: position_dodge requires non-overlapping x intervals
Samples from all countries were asked if they have donated their money in the past month. The high Generosity index indicates more people answer “Yes” than “No” to that question.
ggplot(data=happiness_index_num, aes(x=Perceptions_of_corruption,
y=Regional_indicator,
fill=Regional_indicator)) +
geom_violin(scale="count",
alpha=0.7,
width=3,
position=position_dodge(width=0.5)) +
geom_point(color="forestgreen",
position="jitter",
show.legend=F) +
labs(title = "Perceptions of Corruption Index",
subtitle = "Distribution of Perceptions of Corruption for each Region",
x = "PoC Index",
y = "",
fill = "Regional") +
scale_fill_manual(values = c("#B8255F", "#FF9933", "#FAD000",
"#AFB83B", "#6ACCBC", "#FF8D85",
"#EB96EB", "#299438", "#B8B8B8",
"#158FAD"))+
theme_fivethirtyeight() +
theme(axis.title = element_text(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())
## Warning: position_dodge requires non-overlapping x intervals
This variable measures people’s perceptions of corruption levels in their own country. A high PoC index means the corruption in both public and business sectors is high.
set.seed(2022)
happiness_index_num2 <- happiness_index_num[,3:9] #to single out the char variables
pcaout <- prcomp(happiness_index_num2, scale=T)
ve <- pcaout$sdev^2
plot(ve)
pve <- ve/sum(ve)
plot(pve,
xlab = "Principal Component",
ylab = "PVE",
ylim = c(0,1),
type = "b")
pve
## [1] 0.55907243 0.18418147 0.10117376 0.07406394 0.03585500 0.02762710 0.01802630
total_3 <- 0.55907243 + 0.18418147 + 0.10117376
total_3
## [1] 0.8444277
total_4 <- total_3 + 0.07406394
total_4
## [1] 0.9184916
We’re going to use 4 PCA variables for it could represent 92% of our data.
new_feat <- pcaout$x[,1:4]
new_feat_df <- data.frame(new_feat)
new_happiness_df <- cbind(happiness_index_num2$Ladder_score,
new_feat_df)
head(new_happiness_df)
## happiness_index_num2$Ladder_score PC1 PC2 PC3 PC4
## 1 7.842 -3.782594 0.9903290 1.8989054 -0.3461422
## 2 7.620 -3.760957 1.5886547 1.5601997 0.1177404
## 3 7.571 -3.642531 1.1341103 1.1569381 0.3099945
## 4 7.554 -3.134687 0.9595203 -1.1374648 0.1328065
## 5 7.464 -3.263927 1.8175064 0.4948283 0.6596040
## 6 7.392 -3.646376 1.6735290 0.9596761 0.1892724
colnames(new_happiness_df)[1] <- "LadderScore"
colnames(new_happiness_df)
## [1] "LadderScore" "PC1" "PC2" "PC3" "PC4"
row = dim(new_happiness_df) [1]
column = dim(new_happiness_df) [2]
trainidx_pca <- sample(row, row*0.75)
trainidx_pca[1:10]
## [1] 55 75 6 123 14 7 93 112 1 51
training_data_pca <- new_happiness_df[trainidx_pca,]
testing_data_pca <- new_happiness_df[-trainidx_pca,]
mlr_pca <- lm(formula = LadderScore~.,
data = training_data_pca)
summary(mlr_pca)
##
## Call:
## lm(formula = LadderScore ~ ., data = training_data_pca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.11609 -0.20220 -0.00111 0.22986 0.94014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.54271 0.03801 145.835 <2e-16 ***
## PC1 -0.49651 0.01933 -25.692 <2e-16 ***
## PC2 0.04280 0.03184 1.344 0.1818
## PC3 -0.09121 0.04376 -2.084 0.0395 *
## PC4 0.04856 0.05705 0.851 0.3966
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3977 on 106 degrees of freedom
## Multiple R-squared: 0.8639, Adjusted R-squared: 0.8588
## F-statistic: 168.2 on 4 and 106 DF, p-value: < 2.2e-16
The level of significance of this model is 85%
cutoff <- 4/((nrow(training_data_pca)-length(mlr_pca$coefficients)-2))
plot(mlr_pca, which = 4, cook.levels = cutoff)
plot(mlr_pca, which = 5, cook.levels = cutoff)
training_data_pca2 <- training_data_pca[-which(rownames
(training_data_pca)
%in% c("32","147","126")),]
mlr2_pca <- lm(formula = LadderScore~.,
data = training_data_pca2)
summary(mlr2_pca)
##
## Call:
## lm(formula = LadderScore ~ ., data = training_data_pca2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09855 -0.20177 -0.00831 0.23519 0.79088
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.57422 0.03513 158.682 <2e-16 ***
## PC1 -0.49254 0.01801 -27.345 <2e-16 ***
## PC2 0.09995 0.03135 3.188 0.0019 **
## PC3 -0.02365 0.04442 -0.532 0.5956
## PC4 0.04175 0.05261 0.793 0.4293
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3615 on 103 degrees of freedom
## Multiple R-squared: 0.8843, Adjusted R-squared: 0.8798
## F-statistic: 196.7 on 4 and 103 DF, p-value: < 2.2e-16
After eliminating the extreme values, the significance level increases to 88%.
cutoff <- 4/((nrow(training_data_pca2)-length(mlr2_pca$coefficients)-2))
plot(mlr2_pca, which = 4, cook.levels = cutoff)
plot(mlr2_pca, which = 5, cook.levels = cutoff)
training_data_pca3 <- training_data_pca2[-which(rownames
(training_data_pca)
%in% c("97","139","99")),]
mlr3_pca <- lm(formula = LadderScore~.,
data = training_data_pca3)
summary(mlr3_pca)
##
## Call:
## lm(formula = LadderScore ~ ., data = training_data_pca3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.09319 -0.20082 -0.00858 0.22673 0.77250
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.585427 0.035137 158.963 < 2e-16 ***
## PC1 -0.494907 0.017856 -27.717 < 2e-16 ***
## PC2 0.099278 0.030941 3.209 0.00179 **
## PC3 -0.006439 0.044762 -0.144 0.88591
## PC4 0.043899 0.052533 0.836 0.40534
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3567 on 100 degrees of freedom
## Multiple R-squared: 0.8898, Adjusted R-squared: 0.8854
## F-statistic: 201.9 on 4 and 100 DF, p-value: < 2.2e-16
After 2nd attempt of eliminating extreme values and remodeling, the significance level doesn’t add up much so this is going to be our final model.
Next, we want to make sure that our Target Variable has linear relationships with each Independent Variable. The more linear the Target Variable with the Independent Variables, the more accurate our result is going to be.
These crPlots could help us to see the relationships between our Target Variable and Independent Variables
crPlots(mlr3_pca)
vif(mlr3_pca)
## PC1 PC2 PC3 PC4
## 1.019623 1.013313 1.004008 1.010215
With multicolinearity, we want to make sure there is no linear relationships between the X variables.
vif < 5 indicates the X variables don’t have linear relationships and we are good to go.
actual_pca <- testing_data_pca$LadderScore
pred.mlr_pca <- predict(mlr3_pca, testing_data_pca)
score_df_pca <- data.frame(actual_pca,pred.mlr_pca)
score_df_pca
## actual_pca pred.mlr_pca
## 10 7.268 7.163054
## 13 7.155 7.005813
## 21 6.690 6.724584
## 22 6.647 6.551215
## 25 6.561 6.638393
## 27 6.491 6.437176
## 30 6.435 5.821737
## 37 6.309 5.837322
## 43 6.172 5.940958
## 44 6.166 6.194837
## 46 6.140 5.762657
## 47 6.106 6.144448
## 52 6.012 5.767886
## 61 5.880 5.656836
## 65 5.766 5.570233
## 70 5.677 5.561843
## 71 5.653 5.784540
## 77 5.477 6.529812
## 82 5.345 5.707926
## 85 5.306 4.443409
## 89 5.198 5.963814
## 91 5.142 4.521592
## 101 5.025 5.225796
## 104 4.948 5.261702
## 110 4.875 5.282775
## 114 4.830 5.210215
## 115 4.794 4.750475
## 116 4.759 4.431292
## 117 4.723 4.257964
## 118 4.721 5.148685
## 131 4.289 4.098085
## 138 3.849 3.885144
## 140 3.775 3.581440
## 142 3.623 4.752849
## 144 3.600 3.982021
## 145 3.512 3.928257
## 146 3.467 4.820939
## 148 3.145 4.104291
ggplot(data=score_df_pca, aes(x=actual_pca, y=pred.mlr_pca)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE, color="red") +
labs(title = "Actual VS Prediction Model")
## `geom_smooth()` using formula 'y ~ x'
cor(score_df_pca$actual_pca, score_df_pca$pred.mlr_pca)
## [1] 0.8924991
The correlation between our ‘Actual’ and ‘Prediction’ model is 89%
performance_pca <- function(prediction, actual, method){
e <- prediction - actual
se <- e^2
mse <- mean(se)
rmse <- sqrt(mse)
result <- paste("Method:", method,
"\n MSE:", round(mse,3),
"\n RMSE:", round(rmse,3))
cat(result)
}
performance_pca(pred.mlr_pca,actual_pca,"Multiple Linear Regression")
## Method: Multiple Linear Regression
## MSE: 0.246
## RMSE: 0.496
From this study, we could tell that 89% Happiness Score could be explained by GDP per Capita, Social Support, Healthy Life Expectancy, Generosity, Perceptions of Corruption, and Freedom to Make Life Choices.
Powerpoint could be accessed here
2.3 Social Support Index
Top 10 Social Support
Bottom 10 Social Support
Social Support value is used to answer a simple question “Do you have someone you could rely on when you need help?”.
High Social Support index means the people feel secure enough because they indeed have someone (could be family, friends, public organization, etc) they could ask for help whenever they need it.