Factors contributing to Happiness Score

Prepared by Ong Jeng Quan

31-May-2021

1.0 Introduction

1.1 Background

Happiness is a subjective term, as every individual has their own definition of happiness, what makes us happy might not apply to others. It associates with how we feel and is one of the most significant values that relate to hope, delight and optimism. It is a collection of fun, a combination of the greatest positive influence and minimum negative sentiment without suffering. The pursuit of happiness is inherent in human life. Do note that it is common to be occupied with negativity occasionally as we are all human and are not emotionless. However, knowing how to deal with negativity effectively is important too, we should always find ways to keep ourselves happy and make the best out of the good times. According to RICH theory, happy people possess four attributes which are resources (a sense of freedom or independence), intimacy (friendship and the ability to get along with others), competence (the abilities and awareness of these skills) and health (alert of and exercise healthy lifestyle) (Garaigordobil, 2015). It is undeniable that happiness plays a significant role in our everyday lives and researches have shown that happiness has a strong correlation with our daily performance, body health and also mental health.

1.2 Research motivation

According to a research from Warwick Business School (2014), people who are happy tend to outperform others by 11% in terms of productivity. Besides, being happy is said to have a strong link with better body health. Thompson (2019) has concluded that people who are optimistic and are satisfied with their lives would tend to live longer, a 50-year-old woman who enjoys her life would live up to 87 years old as compared to 80 years old for someone who feels depressed. In addition, Dwyer (2012) reported that someone who is optimistic could help with preventing heart diseases by 50%, this is due to the engagement of exercising, balanced diets and adequate sleeping hours in their daily routines. These activities have allowed them to stay active, having lesser worries and a lower stress level, which eventually leads to being healthier. Furthermore, we have also reached to a point where mental health is one of the biggest challenges in our societies as it is often neglected by the public. Being happy is always the first step in preventing mental health issues as happy people tend to have a better mental health. Therefore, everyone should start striving for happiness in their own lives and find ways to make themselves feel happy.

1.3 Research objective

It is worthwhile to note that factors affecting happiness have become a major concern in most of the countries as everyone is interested in achieving the greatest degree of happiness in life. Not to mention, happiness is a crucial indicator to a nationâs social and economic development. Hence, we will probe into the factors that contribute to the happiness score, which are GDP per capita, life expectancy, freedom of choices, generosity and social support.

1.4 Research questions

Research Question 1 :
Is there a difference between the factors that contribute to the happiness score of the top 10 happiest countries and the top 10 least happy countries?

Motivation behind this :
We would like to find out if the factors affecting happiness score would be different for the top 10 happiest countries and the top 10 least happy countries. Does GDP per capita only contributes to the happiness score of the top 10 happiest countries ? What about the top 10 least happy countries, is money the key to happiness for them as well ?

Research Question 2 :
What are the factors that contribute to the happiness score of countries in the Western Europe region?

Motivation behind this :
We found out that a majority of the top 10 happiest countries are from the Western Europe region through visualisation. Hence, we would like to explore if the the factors contributing to the happiness score of countries in Western Europe region would differ as compared to countries in the rest of the world in other regions.

Research Question 3 :
What are the most influential factors that contribute to happiness?

Motivation behind this :
We are curious to find out what are the factors that could clearly distinguish between countries with a higher average happiness score as compared to countries with a happiness score lesser than the average.

Usage of different regression models will be carried out to answer our research question.

2.0 Data and methodology

2.1 Data source

The World Happiness Report is released by the United Nations Sustainable Development Solutions Network, where the data is collected involves a total of 150 countries. This report investigates the factors that constitutes happiness and acts as a benchmark to establish public policies. The happiness score ranges from 0 to 10, where 10 represents having the happiest life possible and high satisfaction in life. Each country would compare with an imagined country called Dystopia, it serves as a benchmark for a country’s living standard. In this research, we would be looking at the happiness score from the year 2015 to 2019 and the data is retrieved from Kaggle https://www.kaggle.com/unsdsn/world-happiness.

2.2 Data wrangling

Before we proceed, some data wrangling steps have been taken to ensure our dataset is suitable for the regression analysis. First of all, we have identified the structure of the dataset and it is noted that our dataset contains 767 rows and 10 columns.

# Identify structure of the 'world' dataset
str(world)

## 'data.frame':    767 obs. of  10 variables:
##  $ Overall.rank      : int  8 8 8 8 9 9 10 10 10 11 ...
##  $ Year              : int  2016 2018 2017 2019 2016 2015 2015 2017 2018 2019 ...
##  $ Country           : Factor w/ 163 levels "Afghanistan",..: 104 104 104 104 7 104 7 7 7 7 ...
##  $ Region            : Factor w/ 12 levels "Australia and New Zealand",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Score             : num  7.33 7.32 7.31 7.31 7.31 ...
##  $ GDP_per_capita    : num  1.36 1.27 1.41 1.3 1.44 ...
##  $ Social_Support    : num  1.17 1.6 1.55 1.56 1.1 ...
##  $ Life_expectancy   : num  0.831 0.876 0.817 1.026 0.851 ...
##  $ Freedom_to_choices: num  0.581 0.669 0.614 0.585 0.568 0.639 0.651 0.602 0.647 0.557 ...
##  $ Generosity        : num  0.494 0.365 0.5 0.33 0.474 0.475 0.436 0.478 0.361 0.332 ...

Next, the ‘Year’ variable from our original dataset which represents the year of each observation has been excluded from our dataset as it is not required for our current analysis.

# Removed the 'Year' column from our dataset
world <- world %>%
  select(-Year)

Besides that, we have also renamed two columns in our dataset and created a function to identify if there exists any missing values or empty cells in our dataset.

# Identify the names of the columns in 'world' dataset
names(world)

## [1] "Overall.rank"       "Country"            "Region"            
## [4] "Score"              "GDP_per_capita"     "Social_Support"    
## [7] "Life_expectancy"    "Freedom_to_choices" "Generosity"

# Renaming column 'Overall.rank' and 'Score' in 'world' dataset
world <- world %>% 
  rename(Rank = Overall.rank, Happiness_score = Score, Social_support = Social_Support)

We found that there were no missing values or empty cells in our dataset.

new_table

##   num_missing_values num_empty_cells num_unique
## 1                  0               0        158
## 2                  0               0        163
## 3                  0               0         12
## 4                  0               0        668
## 5                  0               0        603
## 6                  0               0        559
## 7                  0               0        499
## 8                  0               0        410
## 9                  0               0        365

After we have successfully completed the data preprocessing stage, our final dataset will consist of 767 rows and 9 columns in total.

str(world)

## 'data.frame':    767 obs. of  9 variables:
##  $ Rank              : int  8 8 8 8 9 9 10 10 10 11 ...
##  $ Country           : Factor w/ 163 levels "Afghanistan",..: 104 104 104 104 7 104 7 7 7 7 ...
##  $ Region            : Factor w/ 12 levels "Australia and New Zealand",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Happiness_score   : num  7.33 7.32 7.31 7.31 7.31 ...
##  $ GDP_per_capita    : num  1.36 1.27 1.41 1.3 1.44 ...
##  $ Social_support    : num  1.17 1.6 1.55 1.56 1.1 ...
##  $ Life_expectancy   : num  0.831 0.876 0.817 1.026 0.851 ...
##  $ Freedom_to_choices: num  0.581 0.669 0.614 0.585 0.568 0.639 0.651 0.602 0.647 0.557 ...
##  $ Generosity        : num  0.494 0.365 0.5 0.33 0.474 0.475 0.436 0.478 0.361 0.332 ...

Details regarding variables which are finally included in our dataset to answer our research questions are stated below:

Variables	Description
GDP_per_capita	Reflected using the purchasing power parity of a country, it is about the rate of converting the currency of country A into the currency of country B to buy an equivalent amount of goods and services (World Happiness Report, 2020).
Life_expectancy	An approximate value that an individual is expected to live, on average (World Happiness Report, 2020).
Freedom_to_choices	Whether they are satisfied with having the freedom to make life choices, on national average (World Happiness Report, 2020).
Generosity	Whether they have made donations in the past months, on national average (World Happiness Report, 2020).
Social_support	Whether they have anyone to lean on whenever they are in any difficulties, on national average (World Happiness Report, 2020).
Happiness_score	Happiness score of a specific country.
Country	An independent nation with its own government, occupying a particular territory.
Region	The specific region where the country is located.
Rank	An integer value that represents the overall ranking of a country based on the happiness score for each year.

2.3 Methodology

2.3.1 Creation of dataset

A few datasets has been derived from our main dataset we have created earlier to investigate each research question individually.

Research Question 1 :

Firstly, a new dataset has been created that only contains observations where the ‘Country’ is one of the top 10 countries with the highest overall ranking (‘Rank’) for happiness score (top 10 happiest countries).

# Top 10 happiest countries in the world  RQ 1
top10 <- world %>%
  select(c(Country, Rank)) %>%
  group_by(Country) %>%
  summarise(avg_rank = mean(Rank)) %>%
  arrange(avg_rank) %>%
  select(Country) %>%
  head(10)

# Convert results into a vector form 
top10 <- as.vector(unlist(top10$Country))

lm_top10 <- world %>%
  filter(Country %in% c(top10)) %>%
  select(-c(Country, Rank, Region))

The same steps are repeated for our top 10 least happiest countries in the world as well.

# Top 10 least happiest countries in the world  RQ 1
bottom10 <- world %>%
  select(c(Country, Rank)) %>%
  group_by(Country) %>%
  summarise(avg_rank = mean(Rank)) %>%
  arrange(desc(avg_rank)) %>%
  select(Country) %>%
  head(10)

# Convert results into a vector form 
bottom10 <- as.vector(unlist(bottom10$Country))

lm_bottom10 <- world %>%
  filter(Country %in% c(bottom10)) %>%
  select(-c(Country, Rank, Region))

Next, we will fit a linear regression model where all 5 factors of happiness score (GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices, Generosity) are included as our independent variables and ‘Happiness_score’ will be our dependent variable as our base model for this research question.

Research Question 2 :

To investigate this research question, a new dataset will be created that only contains observations where the ‘Region’ is equal to ‘Western Europe’ as we are interested in Western Europe countries.

# Western Europe countries RQ 2
lm_western <- world %>%
  filter(Region == "Western Europe")%>%
  select(-c(Country, Rank, Region))

Next, we will fit a linear regression model where all 5 factors of happiness score are included as our independent variables and ‘Happiness_score’ will be our dependent variable as our base model for this research question.

Research Question 3 :

Furthermore, since the dependent variable (Y) for a logistic regression is in a binary form where it only takes two values. A new dataset has been created where a new dummy variable was created as a new column named ‘Happy’ or when x = 1, if the the happiness score of a country is >= 5.39 (average ‘Happiness_score’); and x = 0 otherwise. We have also convert the data type of ‘Happy’ from an integer data type to a factor data type.

# Obtain average happiness score
avg_happiness_score <- round(mean(world$Happiness_score),2)

# A new dataset has been created with a new column 'happy' RQ 3
lm_world <- world %>%
  mutate(Happy = ifelse(Happiness_score >= avg_happiness_score, 1, 0)) %>%
  select(GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices, Generosity, Happy)

# Change 'Happy' column to factor data type from integer data type
lm_world$Happy <- as.factor(lm_world$Happy)

Next, we will fit a logistic regression model where all 5 factors of happiness score are included as our independent variables and our new dummy variable ‘Happy’ will be our dependent variable as our base model for this research question.

Note that all of our base models for each research question will begin with all 5 predictors of happiness score being included as the independent variables (predictors) and refinement of each model to obtain the best model to answer each research question will be discussed in the upcoming session.

2.3.2 Data Partitioning

Before we fit our model for each research question, we will sample our data set into 80% training and 20% testing data with randomized observations based on the random seed number that we have used. 80% training data will be used to train the model whereas the remaining 20% testing data will be used to evaluate the model performance by using a new set of unseen observations from our testing data.

2.3.3 Feature selection

The best subset selection was performed to identify a subset of all p (predictors) that was believed to be related to our response variable for both of our linear and logistic regression models, and then we will fit a new model using the subset of predictor(s) that was selected from our subset section method. Additionally, both 10-fold and 5-fold cross validation are used to indicate the best number of predictors that we should include in our model as well. By using both of the information from both methods, we can determine which predictors to include in our refined model.

2.3.4 Evaluation Metrics

In order to evaluate the model prediction, several evaluation metrics were used to determine the best model for each regression model from each research question. As for linear regression, evaluation metrics such as R-squared, RMSE, MAE, AIC and BIC value will be used to assess the prediction accuracy and the model interpretability. Model comparisons for making a decision on selecting the best model will be based on the model which has the highest R-squared and the lowest RMSE, MAE, AIC and BIC value.

On the other hand, evaluation metrics such as the residual deviance, AIC, accuracy of model from a confusion matrix and the AUC value from ROC curve will be used to determine the best model for logistic regression. Model comparisons for making a decision on selecting the best model will be based on the model which has the lowest residual deviance and AIC value, the highest value from both accuracy from confusion matrix and AUC value will be chosen as our best performing model for a logistic regression.

Details regarding interpretation for each evaluation metrics is shown below :

Evaluation Metric	Description	Interpretation
R-squared	A measure of how well a model fits to data and it measures the percentage of the variability in the response variable that is explained by the model.	The higher the R-squared, the better the model fits the data.
Root Mean Square Error (RMSE)	Indicates the absolute fit of the model to the data, that is how close the observed data points are to the modelâs predicted values.	The lower the RMSE, the better the model fits the data.
Mean absolute error (MAE)	A measure of how accurate the predictions are. It is the mean difference between the predictions and the true values.	The lower the MAE, the better the model fits the data.
Akaike Information Criterion (AIC)	A mathematical test used to evaluate how well a model fits the data and it penalizes models which have lots of independent variables as a way to avoid over-fitting.	The smaller the AIC, the better the model fits the data.
Bayesian Information Criterion (BIC)	Very similar to AIC. It is a metric of how good a model fits the data and it penalizes overly complex models. The penalty term is larger in BIC than in AIC.	The smaller the BIC, the better the model fits the data.
Residual deviance	Indicates how well the response variable is predicted by the model when the predictors are included. It can also be used to test whether the logistic regression model provides an adequate fit for the data.	The lower the residual deviance, the better the model fits the data.
Confusion matrix (Accuracy)	A table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. It is extremely useful for measuring accuracy. Accuracy is simply a ratio of correctly predicted observation to the total observations.	The higher the accuracy, the more accurate a model performs in prediction.
AUC - ROC curve	A performance measurement for the classification problems at various threshold settings. ROC is a probability curve and AUC represents the degree or measure of separability. It tells how much the model is capable of distinguishing between classes.	The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

3.0 Visualization

3.1 Scatterplot Matrix

Scatter plot matrix is useful in determining if there are any linear correlation between the variables.

Based on the scatter plot above, we can conclude that most of the pairs have no significant relationship between them. However, the pair of âLife_expectancyâ and âGDP_per_capitaâ has a correlation of 0.78, which means that there is a moderately strong positive correlation between the two variables and might even have a high possibility of being collinear, this can be further proved with the correlation proportion in the plot below. Collinearity would reduce the precision of the coefficients and further weaken the statistical power of your model, which could lead to high inaccuracy of the model.

3.2 Correlation Matrix

A correlation matrix is a table which displays the correlation coefficients between variables. The matrix illustrates the correlation between all the possible pairs of values in a table.

Based on the upper part of the correlation matrix shown above, we could see that the pair of âLife_expectancyâ and âGDP_per_capitaâ has a slightly bigger and darker circle as compared to other pairs. This shows that the pair is highly correlated positively, which shows a similar results with the scatter plot matrix above.

3.3 Histogram

A histogram is useful in identifying the frequency distribution and skewness of the variables.

Happiness_score :

From the histogram above, we can identify the distribution of Happiness_score over the years. Besides that, the dark-blue vertical dashed line represents the overall mean of the Happiness_score while the light-blue vertical dashed line represents the overall median of the Happiness_score of all countries over the years. We can see that the distribution of Happiness_score is positively skewed.

GDP_per_capita :

The histogram above shows the distribution of GDP_per_capita over the years. The dark-blue vertical dashed line represents the overall mean of the GDP_per_capita while the light-blue vertical dashed line represents the overall median of the GDP_per_capita of all countries over the years. It is observed that the distribution of GDP_per_capita is negatively skewed.

Social_support :

The histogram above shows the distribution of Social_support over the years. The dark-blue vertical dashed line represents the overall mean of the Social_support while the light-blue vertical dashed line represents the overall median of the Social_support of all countries over the years. It is observed that the distribution of Social_support is negatively skewed.

Life_expectancy :

The histogram above shows the distribution of Life_expectancy over the years. The dark-blue vertical dashed line represents the overall mean of the Life_expectancy while the light-blue vertical dashed line represents the overall median of the Life_expectancy of all countries over the years. It is observed that the distribution of Life_expectancy is negatively skewed.

Freedom to choices :

The histogram above shows the distribution of Freedom_to_choices over the years. The dark-blue vertical dashed line represents the overall mean of the Freedom_to_choices s while the light-blue vertical dashed line represents the overall median of the Freedom_to_choices of all countries over the years. It is observed that the distribution of Freedom_to_choices is negatively skewed.

Generosity :

The histogram above shows the distribution of Generosity over the years. The dark-blue vertical dashed line represents the overall mean of the Generosity while the light-blue vertical dashed line represents the overall median of the Generosity of all countries over the years. It is observed that the distribution of Generosity is positively skewed.

3.4 Box plot

A boxplot is useful in identifying the presence of outliers and variation of the variables.

Happiness_score :

From the boxplot above, we could observe that there are no outliers for Happiness Score. The length of the box is relatively short, this indicates that there is less variation in the data and every data is around the same point. Also, the median is almost in the centre of the box, the top and bottom whiskers are almost the same length, hence we could conclude that the distribution of the data is almost symmetric.

GDP_per_capita :

From the boxplot above, we could observe that there are no outliers for GDP per capita. The length of the box is relatively short, this indicates that there is less variation in the data and every data is around the same point. However, the median is not in the centre of the box, hence we could conclude that the distribution of the data is not symmetric, the median line is slightly rising to the top of the box and this shows that it is negatively skewed.

Social_support :

From the boxplot above, we could observe that there are low outliers for âSocial Supportâ, indicating that there are values that are too low. The length of the box is relatively short, this indicates that there is less variation in the data and every data is around the same point. However, the bottom whisker is much longer than the top one, so we could also say that it is negatively skewed as it is skewed to the left.

Life_expectancy :

From the boxplot above, we could observe that there are no outliers for âLife Expectancyâ, The length of the box is relatively short, this indicates that there is less variation in the data and every data is around the same point. However, the median is not in the centre of the box, hence we could conclude that the distribution of the data is not symmetric, the median line is slightly rising to the top of the box and this shows that it is negatively skewed. Moreover, the bottom whisker is slightly longer than the top, hence it is true that it is negatively skewed.

Freedom to choices :

From the boxplot above, we could observe that there are no outliers for âFreedom to choicesâ, indicating that there are no values that are either too high or too low. The median is not in the centre of the box, hence we could conclude that the distribution of the data is not symmetric, the median line is slightly rising to the top of the box and this shows that it is negatively skewed. Moreover, the bottom whisker is slightly longer than the top, hence it is true that it is negatively skewed.

Generosity :

From the boxplot above, we could observe that there are high outliers for âGenerosityâ, indicating that there are values that are too high. The length of the box is relatively short, this indicates that there is less variation in the data and every data is around the same point. However, the top whisker is slightly longer than the bottom, so we could say that it is positively skewed.

4.0 Analysis Results

4.1 Linear Regression

4.1.1 Research Question 1

Top 10 happiest countries :

Model 1 (lm.top10):

Linear regression assumes a linear relationship between variables. It is an approach for predicting the quantitative response Y based on single or multiple predictor variables X. Our objective is to examine the impact of GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity on Happiness_score for the top 10 happiest countries. Having included five independent variables into our model, we used the multiple linear regression model to address the pertaining question.

# split dataset
set.seed(123)
split <- sample(1:nrow(lm_top10), round(nrow(lm_top10) * 0.8)) # 80-20 split
train_top10 <- lm_top10[split, ]
test_top10 <- lm_top10[-split, ]

Estimation

The estimated model (lm.top10) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.top10 <- lm(Happiness_score ~., data = train_top10)

Interpretation

summary(lm.top10)

## 
## Call:
## lm(formula = Happiness_score ~ ., data = train_top10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16418 -0.06959 -0.00267  0.06118  0.18424 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6.35035    0.76747   8.274 1.17e-09 ***
## GDP_per_capita      0.56045    0.25507   2.197  0.03491 *  
## Social_support     -0.05999    0.11674  -0.514  0.61066    
## Life_expectancy    -0.07528    0.31952  -0.236  0.81516    
## Freedom_to_choices  1.07110    0.55405   1.933  0.06157 .  
## Generosity         -0.63600    0.23109  -2.752  0.00943 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09638 on 34 degrees of freedom
## Multiple R-squared:  0.3748, Adjusted R-squared:  0.2829 
## F-statistic: 4.077 on 5 and 34 DF,  p-value: 0.005247

Estimated Happiness_score = 6.3504 + 0.5605 GDP_per_capita â 0.06 Social_support â 0.0753 Life_expectancy + 1.0711 Freedom_to_choices â 0.6360 Generosity

Ã0 = 6.3504 suggests that if GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity are equal to 0, we would expect the Happiness_score to increase by 6.3504.

Ã1 = 0.5605 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase by 0.5605, holding all other variables constant.

Ã2 = â 0.06 suggests that if Social_support increases by 1, we would expect the Happiness_score to decrease by 0.06, holding all other variables constant.

Ã3 = â 0.0753 suggests that if Life_expectancy increases by 1, we would expect the Happiness_score to decrease by 0.0753, holding all other variables constant.

Ã4 = 1.0711 suggests that if Freedom_to_choices increases by 1, we would expect the Happiness_score to increase by 1.0711, holding all other variables constant.

Ã5 = â 0.6360 suggests that if Generosity increases by 1, we would expect the Happiness_score to decrease by 0.6360, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.top10) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

We use residual plots to check the assumptions of an OLS linear regression model to determine whether the residuals are consistent with random error. Residual plot 1 displayed the residual values on the y-axis and fitted values for estimated model (lm.top10) on the x-axis.

Having included five independent variables into our model, we used the multiple linear regression model to address the pertaining question.

Residual plot and Q-Q plot for Model(lm.top10):

autoplot(lm.top10, which = 1:2, nrow = 2, ncol = 1 )

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals follow a straight line well and this shows that the residuals are normally distributed.

ii) Multicollinearity

The presence of multicollinearity can have a negative impact and can severely limit the conclusions of the research study. To identify if multicollinearity exists in our model (lm.top10), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated. The higher the value of VIF, the higher the correlation between that variable with the other variables.

vif(lm.top10)

##     GDP_per_capita     Social_support    Life_expectancy Freedom_to_choices 
##           1.582743           1.897888           2.589537           2.124132 
##         Generosity 
##           1.547655

From the results, it is observed that all five independent variables (GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity) are not highly correlated because all of them met the criterion cut-off point which is below 10. However, both scatter plot matrix and correlation matrix show that GDP_per_capita and Life_expectancy has a moderately strong positive correlation.

Prediction

confint(lm.top10, level = 0.95)

##                          2.5 %     97.5 %
## (Intercept)         4.79067125  7.9100319
## GDP_per_capita      0.04208937  1.0788179
## Social_support     -0.29722682  0.1772503
## Life_expectancy    -0.72461638  0.5740592
## Freedom_to_choices -0.05486567  2.1970703
## Generosity         -1.10563317 -0.1663627

The 95% confidence interval for GDP_per_capita is 0.0421 to 1.0788. We are 95% confidence that if GDP_per_capita increases by 1, the Happiness_score will increase between 0.0421 and 1.0788. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Social_support, Life_expectancy and Freedom_to_choices are â 0.2972 to 0.1773, â 0.7246 to 0.5741 and â 0.0549 to 2.1971 respectively. These three intervals include zero and thus the slope Ã2, Ã3 and Ã4 are not significantly different from zero at the 5% level of significance and the effects will not be significant at the 0.05 level as well.

The 95% confidence interval for Generosity is â 1.1056 to â 0.1664. We are 95% confident that if Generosity increases by 1, the Happiness_score will decrease between 0.1664 and 1.1056. This interval does not include zero and thus slope Ã5 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.top10 <- train(
  form = Happiness_score~., 
  data = train_top10, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.top10

## Linear Regression 
## 
## 40 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 36, 37, 36, 35, 36, 36, ... 
## Resampling results:
## 
##   RMSE        Rsquared   MAE       
##   0.09414616  0.6331909  0.08258742
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By comparing RMSE and MAE, we can determine whether the forecast contains large but infrequent errors. The larger the difference between RMSE and MAE, the more inconsistent the error size. By using 10-fold cross validation, it was observed that model (lm.top10) has a RMSE of 0.0941 and a MAE of 0.0826. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.0826 indicates that the average difference between the predicted and the actual Happiness_score was 0.0826. Furthermore, it has a R-squared of 0.9579. This suggests that 95.79% of the variation in Happiness_score is explained by the variation in GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity. It also has an AIC value of â 66.1384 and a BIC value of â 54.3162.

Improving the Model

The best subset selection was performed in order to identify a subset of all predictor(s) that was believed to be related to our response variable (Happiness_score).

regfit.2.1 <- regsubsets(Happiness_score~ ., data = train_top10)

par(mfrow=c(2,2))
plot(regfit.2.1, scale = "r2")
plot(regfit.2.1, scale = "adjr2")
plot(regfit.2.1, scale = "Cp")
plot(regfit.2.1, scale = "bic")

For R-squared, it was observed that all the five independent variables (GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity) are the best subset of predictors. As for adjusted R-squared, Cp and Bayesian Information Criterion (BIC), all three of them have indicated that GDP_per_capita, Freedom_to_choices and Generosity were the best subset of predictors.

# create prediction function (global)
predict.regsubsets <- function(object, newdata, id, ...) {
  form <- as.formula(object$call[[2]])
  mat <- model.matrix(form, newdata)
  coefi <- coef(object, id = id)
  xvars <- names(coefi)
  mat[ , xvars] %*% coefi
}

Besides that, 10-fold cross validation was used to identify the best number of predictors that we should include in our model.

# k-fold CV method (2nd method for subset selection)
k <- 10  
set.seed(1)
folds <- sample(1:k, nrow(train_top10), replace = TRUE) 
cv_errors <- matrix(NA, k, 5, dimnames = list(NULL, paste(1:5)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ ., data = train_top10[folds != j, ], nvmax = 10) 
  for(i in 1:5) {
    pred <- predict(best_fit, train_top10[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_top10$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

Based on the results above, it was observed that a total of 3 predictors was the best as it has the lowest mean cross validation error.

Based on the information that we have obtained from the above methods, we have concluded that all three independent variables GDP_per_capita, Freedom_to_choices and Generosity would be selected as our predictors for our next model where Social_support and Life_expectancy will be excluded from our original model.

Model 2 (lm.top10.2):

Our objective is to examine the impact of GDP_per_capita, Freedom_to_choices and Generosity on Happiness_score for the top 10 happiest countries. Having included three independent variables from the subset selection method and 10-fold cross validation method into our model, we used the multiple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.top10.2) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.top10.2 <- lm(Happiness_score ~ GDP_per_capita + Freedom_to_choices + Generosity, data = train_top10)

Interpretation

summary(lm.top10.2)

## 
## Call:
## lm(formula = Happiness_score ~ GDP_per_capita + Freedom_to_choices + 
##     Generosity, data = train_top10)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.168591 -0.075685  0.006639  0.058902  0.189988 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          6.1506     0.4368  14.081 3.26e-16 ***
## GDP_per_capita       0.6016     0.2158   2.788  0.00842 ** 
## Freedom_to_choices   1.0213     0.3999   2.554  0.01503 *  
## Generosity          -0.5790     0.1962  -2.951  0.00555 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09455 on 36 degrees of freedom
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.3099 
## F-statistic: 6.838 on 3 and 36 DF,  p-value: 0.0009173

Estimated Happiness_score = 6.1506 + 0.6016 GDP_per_capita + 1.0213 Freedom_to_choices â 0.5790 Generosity

Ã0 = 6.1506 suggests that if GDP_per_capita, Freedom_to_choices and Generosity are equal to 0, we would expect the Happiness_score to increase by 6.1506.

Ã1 = 0.6016 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase by 0.6016, holding all other variables constant.

Ã2 = 1.0213 suggests that if Freedom_to_choices increases by 1, we would expect the Happiness_score to increase by 1.0213, holding all other variables constant.

Ã3 = â 0.5790 suggests that if Generosity increases by 1, we would expect the Happiness_score to decrease by 0.5790, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.top10.2) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.top10.2):

autoplot(lm.top10.2, which = 1:2, nrow = 2, ncol = 1 )

ii) Multicollinearity

To identify if multicollinearity exists in our model (lm.top10.2), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(lm.top10.2)

##     GDP_per_capita Freedom_to_choices         Generosity 
##           1.177075           1.149950           1.159661

From the results, it is observed that all three independent variables (GDP_per_capita, Freedom_to_choices and Generosity) are not highly correlated because all of them met the criterion cut-off point which is below 10.

Prediction

confint(lm.top10.2, level = 0.95)

##                         2.5 %     97.5 %
## (Intercept)         5.2647310  7.0364827
## GDP_per_capita      0.1639943  1.0392421
## Freedom_to_choices  0.2102981  1.8323842
## Generosity         -0.9769778 -0.1810234

The 95% confidence interval for GDP_per_capita is 0.1640 to 1.0392. We are 95% confident that if GDP_per_capita increases by 1, the Happiness_score will increase between 0.1640 and 1.0392. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Freedom_to_choices is 0.2103 to 1.8324. We are 95% confident that if Freedom_to_choices increases by 1, the Happiness_score will increase between 0.2103 and 1.8324. This interval does not include zero and thus slope Ã2 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Generosity is â 0.9770 to â 0.1810. We are 95% confident that if Generosity increases by 1, the Happiness_score will decrease between 0.1810 and 0.9770. This interval does not include zero and thus slope Ã3 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.top10.2 <- train(
  form = Happiness_score ~ GDP_per_capita + Freedom_to_choices + Generosity, 
  data = train_top10, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.top10.2

## Linear Regression 
## 
## 40 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 36, 36, 36, 37, 36, 36, ... 
## Resampling results:
## 
##   RMSE        Rsquared   MAE       
##   0.09625309  0.5317547  0.08041897
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.top10.2) has a RMSE of 0.0963 and a MAE of 0.0804. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.0804 indicates that the average difference between the predicted and the actual Happiness_score was 0.0804. Furthermore, it has a R-squared of 0.8581. This suggests that 85.81% of the variation in Happiness_score is explained by the variation in GDP_per_capita, Freedom_to_choices and Generosity. It also has an AIC value of â 69.3889 and a BIC value of â 60.9445.

Improving the Model

regfit.2.2 <- regsubsets(Happiness_score ~ GDP_per_capita + Freedom_to_choices + Generosity, data = train_top10)

par(mfrow=c(2,2))
plot(regfit.2.2, scale = "r2")
plot(regfit.2.2, scale = "adjr2")
plot(regfit.2.2, scale = "Cp")
plot(regfit.2.2, scale = "bic")

The best subset selection was performed and it was observed that R-squared, adjusted R-squared, Cp and Bayesian Information Criterion (BIC) have indicated that GDP_per_capita, Freedom_to_choices and Generosity were the best subset of predictors.

# k-fold CV method (2nd method for subset selection)
k <- 10  
set.seed(1)
folds <- sample(1:k, nrow(train_top10), replace = TRUE) 
cv_errors <- matrix(NA, k, 3, dimnames = list(NULL, paste(1:3)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ GDP_per_capita + Freedom_to_choices + Generosity, data = train_top10[folds != j, ], nvmax = 10) 
  for(i in 1:3) {
    pred <- predict(best_fit, train_top10[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_top10$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

By using 10-fold cross validation, it was observed that a total of 3 predictors was the best as it has the lowest mean cross validation error.

Model 3 (lm.top10.3):

Our objective is to examine the impact of GDP_per_capita and Generosity on Happiness_score for the top 10 happiest countries. We have made some assumptions that removing predictors with no relationship with the response variable (Happiness_score) will lead to a more effective model. As predictors that do not have a significant impact on the response variable tend to cause a deterioration in the test error rate. Having included two independent variables with the smallest p-value into our model as our second method, we used the multiple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.top10.3) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.top10.3 <- lm(Happiness_score ~ GDP_per_capita + Generosity, data = train_top10)

Interpretation

summary(lm.top10.3)

## 
## Call:
## lm(formula = Happiness_score ~ GDP_per_capita + Generosity, data = train_top10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.19199 -0.08023 -0.00747  0.09584  0.15389 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      6.9996     0.3038  23.042  < 2e-16 ***
## GDP_per_capita   0.4713     0.2248   2.097  0.04290 *  
## Generosity      -0.6814     0.2059  -3.309  0.00209 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1014 on 37 degrees of freedom
## Multiple R-squared:  0.2476, Adjusted R-squared:  0.2069 
## F-statistic: 6.087 on 2 and 37 DF,  p-value: 0.005185

Estimated Happiness_score = 6.9996 + 0.4713 GDP_per_capita â 0.6814 Generosity

Ã0 = 6.9996 suggests that if GDP_per_capita and Generosity are equal to 0, we would expect the Happiness_score to increase by 6.9996.

Ã1 = 0.4713 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase 0.4713, holding all other variables constant.

Ã2 = â 0.6814 suggests that if Generosity increases by 1, we would expect theHappiness_score to decrease by 0.6814, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.top10.3) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.top10.3):

autoplot(lm.top10.3, which = 1:2, nrow = 2, ncol = 1)

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals deviate slightly at the beginning and towards the end. This shows that the residuals are not normally distributed.

ii) Multicollinearity

To identify if multicollinearity exists in our model (lm.top10.3), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(lm.top10.3)

## GDP_per_capita     Generosity 
##       1.111253       1.111253

From the results, it is observed that all two independent variables (GDP_per_capita and Generosity) are not highly correlated because all of them met the criterion cut-off point which is below 10.

Prediction

confint(lm.top10.3, level = 0.95)

##                      2.5 %     97.5 %
## (Intercept)     6.38408762  7.6150981
## GDP_per_capita  0.01588258  0.9267137
## Generosity     -1.09865002 -0.2641399

The 95% confidence interval for GDP_per_capita is 0.0159 to 0.9267. We are 95% confident that if GDP_per_capita increases by 1, the Happiness_score will increase between 0.0159 and 0.9267. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Generosity is â 1.0987 to â 0.2641. We are 95% confident that if Generosity increases by 1, the Happiness_score will decrease between 0.2641 and 1.0987. This interval does not include zero and thus slope Ã2 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.top10.3 <- train(
  form = Happiness_score ~ GDP_per_capita + Generosity, 
  data = train_top10, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.top10.3

## Linear Regression 
## 
## 40 samples
##  2 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 36, 36, 36, 37, 36, 36, ... 
## Resampling results:
## 
##   RMSE        Rsquared   MAE       
##   0.09996555  0.5070088  0.08767941
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.top10.3) has a RMSE of 0.09997 and a MAE of 0.0877. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.0877 indicates that the average difference between the predicted and the actual Happiness_score was 0.0877. Furthermore, it has a R-squared of 0.9175. This suggests that 91.75% of the variation in Happiness_score is explained by the variation in GDP_per_capita and Generosity. It also has an AIC value of â 64.7281 and a BIC value of â 57.9726.

Improving the Model

regfit.2.3 <- regsubsets(Happiness_score ~ GDP_per_capita + Generosity, data = train_top10)

par(mfrow=c(2,2))
plot(regfit.2.3, scale = "r2")
plot(regfit.2.3, scale = "adjr2")
plot(regfit.2.3, scale = "Cp")
plot(regfit.2.3, scale = "bic")

The best subset selection was performed and it was observed that R-squared, adjusted R-squared, Cp and Bayesian Information Criterion (BIC) have indicated that GDP_per_capita and Generosity were the best subset of predictors.

# k-fold CV method (2nd method for subset selection)
k <- 10  
set.seed(1)
folds <- sample(1:k, nrow(train_top10), replace = TRUE) 
cv_errors <- matrix(NA, k, 2, dimnames = list(NULL, paste(1:2)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ GDP_per_capita + Generosity, data = train_top10[folds != j, ], nvmax = 10) 
  for(i in 1:2) {
    pred <- predict(best_fit, train_top10[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_top10$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

By using 10-fold cross validation, it was observed that a total of 2 predictors was the best as it has the lowest mean cross validation error.

Based on the information that we have obtained from the above methods, we have concluded that all two independent variables GDP_per_capita and Generosity would be considered as predictors for our model.

Model 4 (lm.top10.4):

Our objective is to examine the impact of Generosity on Happiness_score for the top 10 happiest countries. Having included only one independent variable which is adjusted from the p-value method into our model, we used the simple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.top10.4) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.top10.4 <- lm(Happiness_score ~ Generosity, data = train_top10)

Interpretation

summary(lm.top10.4)

## 
## Call:
## lm(formula = Happiness_score ~ Generosity, data = train_top10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.21021 -0.08858  0.01068  0.09240  0.14424 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.6191     0.0738 103.241   <2e-16 ***
## Generosity   -0.5448     0.2039  -2.672    0.011 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1058 on 38 degrees of freedom
## Multiple R-squared:  0.1581, Adjusted R-squared:  0.136 
## F-statistic: 7.139 on 1 and 38 DF,  p-value: 0.01105

Estimated Happiness_score = 7.6191 â 0.5448 Generosity

Ã0 = 7.6191 suggests that if Generosity is equal to 0, we would expect the Happiness_score to increase by 7.6191.

Ã1 = â 0.5448 suggests that if Generosity increases by 1, we would expect the Happiness_score to decrease by 0.5448, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.top10.4) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model (lm.top10.4):

autoplot(lm.top10.4, which = 1:2, nrow = 2, ncol = 1 )

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals deviate slightly at the beginning and towards the end. This shows that the residuals are not normally distributed.

Prediction

confint(lm.top10.4, level = 0.95)

##                  2.5 %     97.5 %
## (Intercept)  7.4696712  7.7684668
## Generosity  -0.9575303 -0.1320038

The 95% confidence interval for Generosity is â 0.9575 to â 0.1320. We are 95% confident that if Generosity increases by 1, the Happiness_score will decrease between 0.1320 and 0.9575. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.top10.4 <- train(
  form = Happiness_score ~ Generosity, 
  data = train_top10, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.top10.4

## Linear Regression 
## 
## 40 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 36, 36, 36, 37, 36, 36, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE       
##   0.1049372  0.4903464  0.09337823
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.top10.4) has a RMSE of 0.1049 and a MAE of 0.0934. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.0934 indicates that the average difference between the predicted and the actual Happiness_score was 0.0934. Furthermore, it has a R-squared of 0.9225. This suggests that 92.25% of the variation in Happiness_score is explained by the variation in Generosity. It also has an AIC value of â 62.2366 and a BIC value of â 57.1700.

Top 10 least happiest countries :

# split for bottom 10
set.seed(123)
split <- sample(1:nrow(lm_bottom10), round(nrow(lm_bottom10) * 0.8)) # 80-20 split
train_bottom10 <- lm_bottom10[split, ]
test_bottom10 <- lm_bottom10[-split, ]

Model 1 (lm.bottom10):

Our objective is to examine the impact of GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity on Happiness_score for the top 10 least happiest countries. Having included all five independent variables into our model, we used the multiple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.bottom10) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.bottom10 <- lm(Happiness_score ~., data = train_bottom10)

Interpretation

summary(lm.bottom10)

## 
## Call:
## lm(formula = Happiness_score ~ ., data = train_bottom10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50935 -0.28452  0.00991  0.25540  0.91697 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          3.3204     0.2288  14.514 6.96e-16 ***
## GDP_per_capita       0.4872     0.4530   1.076    0.290    
## Social_support       0.3632     0.2269   1.601    0.119    
## Life_expectancy     -0.2551     0.5487  -0.465    0.645    
## Freedom_to_choices   0.1810     0.3810   0.475    0.638    
## Generosity          -0.7439     0.7753  -0.960    0.344    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3573 on 33 degrees of freedom
## Multiple R-squared:  0.1865, Adjusted R-squared:  0.06329 
## F-statistic: 1.514 on 5 and 33 DF,  p-value: 0.2124

Estimated Happiness_score = 3.3204 + 0.4872 GDP_per_capita + 0.3632 Social_support â 0.2551 Life_expectancy + 0.1810 Freedom_to_choices â 0.7439 Generosity

Ã0 = 3.3204 suggests that if GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity are equal to 0, we would expect the Happiness_score to increase by 3.3204.

Ã1 = 0.4872 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase by 0.4872, holding all other variables constant.

Ã2 = 0.3632 suggests that if the Social_support increases by 1, we would expect the Happiness_score to increase by 0.3632, holding all other variables constant.

Ã3 = â 0.2551 suggests that if the Life_expectancy increases by 1, we would expect the Happiness_score to decrease by 0.2551, holding all other variables constant.

Ã4 = 0.1810 suggests that if the Freedom_to_choices increases by 1, we would expect the Happiness_score to increase by 0.1810, holding all other variables constant.

Ã5 = â 0.7439 suggests that if the Generosity increases by 1, we would expect the Happiness_score to decrease by 0.7439, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.bottom10) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.bottom10):

autoplot(lm.bottom10, which = 1:2, nrow = 2, ncol = 1 )

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals deviate slightly at the beginning and towards the end. This shows that the residuals are not normally distributed.

ii) Multicollinearity

To identify if multicollinearity exists in our model (lm.bottom10), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(lm.bottom10)

##     GDP_per_capita     Social_support    Life_expectancy Freedom_to_choices 
##           2.459384           1.475577           2.113764           1.225632 
##         Generosity 
##           1.658770

Prediction

confint(lm.bottom10, level = 0.95)

##                          2.5 %    97.5 %
## (Intercept)         2.85495497 3.7858197
## GDP_per_capita     -0.43440708 1.4087866
## Social_support     -0.09839511 0.8247848
## Life_expectancy    -1.37146300 0.8611739
## Freedom_to_choices -0.59414031 0.9560817
## Generosity         -2.32113431 0.8333956

The 95% confidence interval for GDP_per_capita is â 0.4344 to 1.4088. We are 95% confident that if GDP_per_capita increases by 1, the Happiness_score will either decrease by 0.4344 or increase by 1.4088.

The 95% confidence interval for Social_support is â 0.0984 to 0.8248. We are 95% confident that if Social_support increases by 1, the Happiness_score will either decrease by 0.0984 or increase by 0.8248.

The 95% confidence interval for Life_expectancy is â 1.3715 to 0.8612. We are 95% confident that if Life_expectancy increases by 1, the Happiness_score will either decrease by 1.3715 or increase by 0.8612.

The 95% confidence interval for Freedom_to_choices is â 0.5941 to 0.9561. We are 95% confident that if Freedom_to_choices increases by 1, the Happiness_score will either decrease by 0.5941 or increase by 0.9561.

The 95% confidence interval for Generosity is â 2.3211 to 0.8334. We are 95% confident that if Generosity increases by 1, the Happiness_score will either decrease by 2.3211 or increase by 0.8334.

All these five intervals include zero and thus the slope Ã1, Ã2, Ã3, Ã4 and Ã5 are not significantly different from zero at the 5% level of significance and the effects will not be significant at the 0.05 level as well.

Evaluation

cv_lm.bottom10 <- train(
  form = Happiness_score~., 
  data = train_bottom10, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.bottom10

## Linear Regression 
## 
## 39 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3653142  0.2794597  0.3121033
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.bottom10) has a RMSE of 0.3653 and a MAE of 0.3121. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.3121 indicates that the average difference between the predicted and the actual Happiness_score was 0.3121. Furthermore, it has a R-squared of 0.9876. This suggests that 98.76% of the variation in Happiness_score is explained by the variation in GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity. It also has an AIC value of 37.8817 and a BIC value of 49.5267.

Improving the Model

regfit.2.3 <- regsubsets(Happiness_score~ ., data = train_bottom10)

par(mfrow=c(2,2))
plot(regfit.2.3, scale = "r2")
plot(regfit.2.3, scale = "adjr2")
plot(regfit.2.3, scale = "Cp")
plot(regfit.2.3, scale = "bic")

The best subset selection was performed. For R-squared, it was observed that all the five independent variables (GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity) are the best subset of predictors. As for adjusted R-squared, Cp and Bayesian Information Criterion (BIC), all three of them have indicated that Social_support was the best subset of predictors.

# k-fold CV method (2nd method for subset selection)
k <- 5 
set.seed(1)
folds <- sample(1:k, nrow(train_bottom10), replace = TRUE) 
cv_errors <- matrix(NA, k, 5, dimnames = list(NULL, paste(1:5)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ ., data = train_bottom10[folds != j, ], nvmax = 10) 
  for(i in 1:5) {
    pred <- predict(best_fit, train_bottom10[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_bottom10$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

By using 10-fold cross validation, it was observed that to include only one predictor was the best as it has the lowest mean cross validation error.

Based on the information that we have obtained from the above methods, we have concluded that one independent variable Social_support would be selected as our predictor for our next model where GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity will be excluded from our original model.

Model 2 (lm.bottom10.2):

Our objective is to examine the impact of Social_support on Happiness_score for the top 10 least happiest countries. Having included only one independent variable into our model, we used the simple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.bottom10.2) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.bottom10.2 <- lm(Happiness_score ~ Social_support, data = train_bottom10)

Interpretation

summary(lm.bottom10.2)

## 
## Call:
## lm(formula = Happiness_score ~ Social_support, data = train_bottom10)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59052 -0.28765  0.07949  0.20367  0.94779 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.2084     0.1113  28.818   <2e-16 ***
## Social_support   0.4557     0.1807   2.523   0.0161 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3456 on 37 degrees of freedom
## Multiple R-squared:  0.1467, Adjusted R-squared:  0.1237 
## F-statistic: 6.363 on 1 and 37 DF,  p-value: 0.01608

Estimated Happiness_score = 3.2084 + 0.4557 Social_support

Ã0 = 3.2084 suggests that if Social_support is equal to 0, we would expect the Happiness_score to increase by 3.2084.

Ã1 = 0.4557 suggests that if the Social_support increases by 1, we would expect the Happiness_score to increase by 0.4557, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.bottom10.2) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.bottom10.2):

autoplot(lm.bottom10.2, which = 1:2, nrow = 2, ncol = 1 )

Prediction

confint(lm.bottom10.2, level = 0.95)

##                     2.5 %    97.5 %
## (Intercept)    2.98285193 3.4340155
## Social_support 0.08966412 0.8217365

The 95% confidence interval for Social_support is 0.0897 to 0.8217. We are 95% confident that if Social_support increases by 1, the Happiness_score will increase between 0.0897 and 0.8217. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.bottom10.2 <- train(
  form = Happiness_score ~ Social_support, 
  data = train_bottom10, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.bottom10.2

## Linear Regression 
## 
## 39 samples
##  1 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.3360635  0.565561  0.2894052
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.bottom10.2) has a RMSE of 0.3361 and a MAE of 0.2894. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.2894 indicates that the average difference between the predicted and the actual Happiness_score was 0.2894. Furthermore, it has a R-squared of 0.9959. This suggests that 99.59% of the variation in Happiness_score is explained by the variation in Social_support. It also has an AIC value of 31.7447 and a BIC value of 36.7354.

Best model selection :

Comparing goodness of fit between models :

Top 10 happiest countries

	lm.top10	lm.top10.2	lm.top10.3	lm.top10.4
R-squared	95.79%	85.81%	91.75%	92.25%
RMSE	0.0941	0.0963	0.1000	0.1049
MAE	0.0826	0.0804	0.0877	0.0934
AIC	-66.14	-69.39	-64.73	-62.24
BIC	-54.32	-60.94	-57.97	-57.17

Based on the table above, it is observed that Model (lm.top10) has the highest R-squared value (95.79%) and the lowest RMSE value (0.0941) as compared to other models.

Besides that, the RMSE, MAE and AIC values in Model (lm.top10.3) and Model (lm.top10.4) are higher than in Model (lm.top10) and Model (lm.top10.2), thus both Model (lm.top10.3) and Model (lm.top10.4) do not fit the data well.

Although the MAE, AIC and BIC values in Model (lm.top10.2) are the smallest which are 0.0804, â 69.39 and â 60.94 respectively, the variables (GDP_per_capita, Freedom_to_choices and Generosity) in Model (lm.top10.2) are not highly correlated. Furthermore, Model (lm.top10.2) has the lowest R-squared value (85.81%) which is approximately 10% lower than the R-squared in Model (lm.top10) which indicates that Model (lm.top10.2) do not perform better than model (lm.top10). Hence, we can conclude that Model (lm.top10) is our best model from the 4 models that we have generated in predicting the Happiness_score for the top 10 happiest countries as it has the largest R-squared value and the smallest RMSE value. These metrics have indicated that the data could fit Model (lm.top10) well where it is best to use all 5 variables which are GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity.

Top 10 least happiest countries

	lm.bottom10	lm.bottom10.2
R-squared	98.76%	99.59%
RMSE	0.3653	0.3361
MAE	0.3121	0.2894
AIC	37.88	37.74
BIC	49.53	36.74

By comparing between Model (lm.bottom10) and Model (lm.bottom10.2), we can see that Model (lm.bottom10.2) has a higher R-squared value of 99.59% than Model (lm.bottom10). This indicates that Model (lm.bottom10.2) is better in terms of explaining the variation in Happiness_score.

Besides that, the RMSE, MAE, AIC and BIC values in Model (lm.bottom10.2) are 0.3361, 0.2894, 37.74 and 36.74 respectively which are smaller than (lm.bottom10.2) as shown in the table above. Hence, we can clearly see that Model (lm.bottom10.2) is the best model in predicting the Happiness_score for the top 10 least happiest countries where it is best to only use Social_support as the predictor for this model.

4.1.2 Research Question 2

Model 1 (lm.western):

Our objective is to examine the impact of GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity on Happiness_score for the countries in Western Europe region. Having included five independent variables into our model, we used the multiple linear regression model to address the pertaining question.

# split data
set.seed(123)
split <- sample(1:nrow(lm_western), round(nrow(lm_western) * 0.8)) # 80-20 split
train_western <- lm_western[split, ]
test_western <- lm_western[-split, ]

Estimation

The estimated model (lm.western) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.western <- lm(Happiness_score ~., data = train_western)

Interpretation

summary(lm.western)

## 
## Call:
## lm(formula = Happiness_score ~ ., data = train_western)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.90620 -0.28174  0.05924  0.28728  0.69152 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.6037     0.8155  -0.740 0.461448    
## GDP_per_capita       2.3344     0.4126   5.658 2.58e-07 ***
## Social_support       0.4663     0.2457   1.898 0.061501 .  
## Life_expectancy      2.2455     0.6535   3.436 0.000959 ***
## Freedom_to_choices   2.1746     0.4199   5.179 1.77e-06 ***
## Generosity           1.3392     0.4240   3.159 0.002275 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3762 on 76 degrees of freedom
## Multiple R-squared:  0.7827, Adjusted R-squared:  0.7684 
## F-statistic: 54.75 on 5 and 76 DF,  p-value: < 2.2e-16

Estimated Happiness_score = â 0.6037 + 2.3344 GDP_per_capita + 0.4663 Social_support + 2.2455 Life_expectancy + 2.1746 Freedom_to_choices + 1.3392 Generosity

Ã0 = -0.6037 suggests that if GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity are equal to 0, we would expect the Happiness_score to decrease by 0.6037.

Ã1 = 2.3344 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase by 2.3344, holding all other variables constant.

Ã2 = 0.4663 suggests that if the Social_support increases by 1, we would expect the Happiness_score to increase by 0.4663, holding all other variables constant.

Ã3 = 2.2455 suggests that if the Life_expectancy increases by 1, we would expect the Happiness_score to increase by 2.2455, holding all other variables constant.

Ã4 = 2.1746 suggests that if the Freedom_to_choices increases by 1, we would expect the Happiness_score to increase by 2.1746, holding all other variables constant.

Ã5 = 1.3392 suggests that if the Generosity increases by 1, we would expect the Happiness_score to increase by 1.3392, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.western) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.western):

autoplot(lm.western, which = 1:2, nrow = 2, ncol = 1 )

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals deviate slightly towards the end. This shows that the residuals are not normally distributed.

ii) Multicollinearity

To identify if multicollinearity exists in our model (lm.western), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(lm.western)

##     GDP_per_capita     Social_support    Life_expectancy Freedom_to_choices 
##           1.393302           1.668647           1.337811           2.337401 
##         Generosity 
##           1.797192

Prediction

confint(lm.western, level = 0.95)

##                          2.5 %    97.5 %
## (Intercept)        -2.22789749 1.0205799
## GDP_per_capita      1.51261570 3.1562197
## Social_support     -0.02302934 0.9557007
## Life_expectancy     0.94397226 3.5470490
## Freedom_to_choices  1.33837023 3.0108981
## Generosity          0.49477505 2.1836590

The 95% confidence interval for GDP_per_capita is 1.5126 to 3.1562. We are 95% confident that if GDP_per_capita increases by 1, the Happiness_score will increase between 1.5126 and 3.1562. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Social_support is â 0.0230 to 0.9557 respectively. This interval includes zero and thus the slope Ã2 is not significantly different from zero at the 5% level of significance and the effects will not be significant at the 0.05 level as well.

The 95% confidence interval for Life_expectancy is 0.9440 to 3.5470. We are 95% confident that if Life_expectancy increases by 1, the Happiness_score will increase between 0.9440 and 3.5470. This interval does not include zero and thus slope Ã3 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Freedom_to_choices is 1.3384 to 3.0109. We are 95% confident that if Freedom_to_choices increases by 1, the Happiness_score will increase between 1.3384 and 3.0109. This interval does not include zero and thus slope Ã4 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Generosity is 0.4948 to 2.1837. We are 95% confident that if Generosity increases by 1, the Happiness_score will increase between 0.4948 and 2.1837. This interval does not include zero and thus slope Ã5 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.western <- train(
  form = Happiness_score~., 
  data = train_western, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.western

## Linear Regression 
## 
## 82 samples
##  5 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 74, 74, 74, 73, 74, 74, ... 
## Resampling results:
## 
##   RMSE       Rsquared  MAE      
##   0.3856888  0.791963  0.3176116
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that the model (lm.western) has a RMSE of 0.3857 and a MAE of 0.3176. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.3176 indicates that the average difference between the predicted and the actual Happiness_score was 0.3176. Furthermore, it has a R-squared of 0.7920. This suggests that 79.20% of the variation in Happiness_score is explained by the variation in GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity. It also has an AIC value of 80.1321 and a BIC value of 96.9791.

Improving the Model

regfit.3.1 <- regsubsets(Happiness_score~ ., data = train_western)

par(mfrow=c(2,2))
plot(regfit.3.1, scale = "r2")
plot(regfit.3.1, scale = "adjr2")
plot(regfit.3.1, scale = "Cp")
plot(regfit.3.1, scale = "bic")

The best subset selection was performed. For R-squared, adjusted R-squared and Cp, all three of them have indicated that GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity were the best subset of predictors. As for Bayesian Information Criterion (BIC), it was observed that four independent variables (GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity) are the best subset of predictors.

# k-fold CV method (2nd method for subset selection)
k <- 10  
set.seed(1)
folds <- sample(1:k, nrow(train_western), replace = TRUE) 
cv_errors <- matrix(NA, k, 5, dimnames = list(NULL, paste(1:5)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ ., data = train_western[folds != j, ], nvmax = 10) 
  for(i in 1:5) {
    pred <- predict(best_fit, train_western[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_western$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

By using 10-fold cross validation, it was observed that a total of 5 predictors were the best as it has the lowest mean cross validation error.

Based on the information that we have obtained from the above methods, we have concluded that four independent variables GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity would be selected as our predictors for our next model where Social_support will be excluded from our original model.

Model 2 (lm.western.2):

Our objective is to examine the impact of GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity on Happiness_score for the countries in Western Europe region. Having included four independent variables from the subset selection method and 10-fold cross validation method into our model, we used the multiple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.western.2) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.western.2 <- lm(Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices + Generosity, data = train_western)

Interpretation

summary(lm.western.2)

## 
## Call:
## lm(formula = Happiness_score ~ GDP_per_capita + Life_expectancy + 
##     Freedom_to_choices + Generosity, data = train_western)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86317 -0.26073  0.09358  0.26050  0.64164 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.8330     0.8200  -1.016  0.31288    
## GDP_per_capita       2.5115     0.4087   6.145 3.27e-08 ***
## Life_expectancy      2.7650     0.6034   4.583 1.74e-05 ***
## Freedom_to_choices   2.5049     0.3885   6.447 9.07e-09 ***
## Generosity           1.2452     0.4281   2.908  0.00474 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3825 on 77 degrees of freedom
## Multiple R-squared:  0.7724, Adjusted R-squared:  0.7606 
## F-statistic: 65.33 on 4 and 77 DF,  p-value: < 2.2e-16

Estimated Happiness_score = â 0.8330 + 2.5115 GDP_per_capita + 2.7650 Life_expectancy + 2.5049 Freedom_to_choices + 1.2452 Generosity

Ã0 = -0.8330 suggests that if GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity are equal to 0, we would expect the Happiness_score to decrease by 0.8330.

Ã1 = 2.5115 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase 2.5115, holding all other variables constant.

Ã2 = 2.7650 suggests that if the Life_expectancy increases by 1, we would expect the Happiness_score to increase by 2.7650, holding all other variables constant.

Ã3 = 2.5049 suggests that if the Freedom_to_choices increases by 1, we would expect the Happiness_score to increase by 2.5049, holding all other variables constant.

Ã4 = 1.2452 suggests that if the Generosity increases by 1, we would expect the Happiness_score to increase by 1.2452, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.western.2) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.western.2):

autoplot(lm.western.2, which = 1:2, nrow = 2, ncol = 1 )

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals deviate slightly towards the end. This shows that the residuals are not normally distributed.

ii) Multicollinearity

To identify if multicollinearity exists in our model (lm.western.2), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(lm.western.2)

##     GDP_per_capita    Life_expectancy Freedom_to_choices         Generosity 
##           1.322098           1.103131           1.935925           1.772675

From the results, it is observed that all four independent variables (GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity) are not highly correlated because all of them met the criterion cut-off point which is below 10. However, both scatter plot matrix and correlation matrix show that GDP_per_capita and Life_expectancy has a moderately strong positive correlation.

Prediction

confint(lm.western.2, level = 0.95)

##                         2.5 %    97.5 %
## (Intercept)        -2.4659007 0.7998601
## GDP_per_capita      1.6976830 3.3252255
## Life_expectancy     1.5635529 3.9664160
## Freedom_to_choices  1.7312519 3.2785608
## Generosity          0.3926921 2.0977654

The 95% confidence interval for GDP_per_capita is 1.6977 to 3.3252. We are 95% confident that if GDP_per_capita increases by 1, the Happiness_score will increase between 1.6977 and 3.3252. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Life_expectancy is 1.5636 to 3.9664. We are 95% confident that if Life_expectancy increases by 1, the Happiness_scoree will increase between 1.5636 and 3.9664. This interval does not include zero and thus slope Ã2 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Freedom_to_choices is 1.7313 to 3.2786. We are 95% confident that if Freedom_to_choices increases by 1, the Happiness_score will increase between 1.7313 and 3.2786. This interval does not include zero and thus slope Ã3 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Generosity is 0.3927 to 2.0978. We are 95% confident that if Generosity increases by 1, the Happiness_score will increase between 0.3927 and 2.0978. This interval does not include zero and thus slope Ã4 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.western.2 <- train(
  form = Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices + Generosity, 
  data = train_western, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.western.2

## Linear Regression 
## 
## 82 samples
##  4 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 74, 74, 74, 73, 74, 74, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3813104  0.7993535  0.3191809
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.western.2) has a RMSE of 0.3813 and a MAE of 0.3192. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.3192 indicates that the average difference between the predicted and the actual Happiness_score was 0.3192. Furthermore, it has a R-squared of 0.7994. This suggests that 79.94% of the variation in Happiness_score is explained by the variation in GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity. It also has an AIC value of 81.9294 and a BIC value of 96.3697.

Improving the Model

regfit.2.2 <- regsubsets(Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices + Generosity, data = train_western)

par(mfrow=c(2,2))
plot(regfit.2.2, scale = "r2")
plot(regfit.2.2, scale = "adjr2")
plot(regfit.2.2, scale = "Cp")
plot(regfit.2.2, scale = "bic")

The best subset selection was performed. As for R-squared, adjusted R-squared, Cp and Bayesian Information Criterion (BIC), all four of them have indicated that GDP_per_capita, Life_expectancy, Freedom_to_choices and Generosity were the best subset of predictors.

# k-fold CV method (2nd method for subset selection)
k <- 10  
set.seed(1)
folds <- sample(1:k, nrow(train_western), replace = TRUE) 
cv_errors <- matrix(NA, k, 4, dimnames = list(NULL, paste(1:4)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices + Generosity, data = train_western[folds != j, ], nvmax = 10) 
  for(i in 1:4) {
    pred <- predict(best_fit, train_western[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_western$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

By using 10-fold cross validation, it was observed that a total of 4 predictors were the best as it has the lowest mean cross validation error.

Model 3 (lm.western3):

Our objective is to examine the impact of GDP_per_capita, Life_expectancy and Freedom_to_choices on Happiness_score for the countries in Western Europe region. Having included three independent variables with the smallest p-value into our model, we used the multiple linear regression model to address the pertaining question.

Estimation

The estimated model (lm.western.3) was performed using the âlmâ function in R using the Ordinary Least Square criterion to identify the âbest fittingâ line that minimizes the sum of the residuals (RSS) in estimating regression coefficients.

lm.western.3 <- lm(Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, data = train_western)

Interpretation

summary(lm.western.3)

## 
## Call:
## lm(formula = Happiness_score ~ GDP_per_capita + Life_expectancy + 
##     Freedom_to_choices, data = train_western)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0944 -0.2057  0.1007  0.2481  0.6583 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -0.5611     0.8527  -0.658 0.512472    
## GDP_per_capita       2.5475     0.4276   5.958 6.96e-08 ***
## Life_expectancy      2.4115     0.6186   3.898 0.000204 ***
## Freedom_to_choices   3.1748     0.3275   9.693 4.91e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4003 on 78 degrees of freedom
## Multiple R-squared:  0.7474, Adjusted R-squared:  0.7377 
## F-statistic: 76.93 on 3 and 78 DF,  p-value: < 2.2e-16

Estimated Happiness_score = â 0.5611 + 2.5475 GDP_per_capita + 2.4115 Life_expectancy + 3.1748 Freedom_to_choices

Ã0 = -0.5611 suggests that if GDP_per_capita, Life_expectancy and Freedom_to_choices and Generosity are equal to 0, we would expect the Happiness_score to decrease by 0.5611.

Ã1 = 2.5475 suggests that if GDP_per_capita increases by 1, we would expect the Happiness_score to increase 2.5475, holding all other variables constant.

Ã2 = 2.4115 suggests that if the Life_expectancy increases by 1, we would expect the Happiness_score to increase by 2.4115, holding all other variables constant.

Ã3 = 3.1748 suggests that if the Freedom_to_choices increases by 1, we would expect the Happiness_score to increase by 3.1748, holding all other variables constant.

Diagnosis

In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (lm.western.3) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(lm.western.3):

autoplot(lm.western.3, which = 1:2, nrow = 2, ncol = 1)

From the Residuals vs Fitted plot, it is observed that the residuals spread around a horizontal line without any distinct pattern. This is a good indication that we do not have non-linear relationships. As for the Normal Q-Q plot, it is observed that the residuals deviate slightly at the beginning and towards the end. This shows that the residuals are not normally distributed.

ii) Multicollinearity

To identify if multicollinearity exists in our model (lm.western.3), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(lm.western.3)

##     GDP_per_capita    Life_expectancy Freedom_to_choices 
##           1.320881           1.058364           1.255651

From the results, it is observed that all three independent variables (GDP_per_capita, Life_expectancy and Freedom_to_choices) are not highly correlated because all of them met the criterion cut-off point which is below 10. However, both scatter plot matrix and correlation matrix show that GDP_per_capita and Life_expectancy has a moderately strong positive correlation.

Prediction

confint(lm.western.3, level = 0.95)

##                        2.5 %   97.5 %
## (Intercept)        -2.258793 1.136573
## GDP_per_capita      1.696295 3.398749
## Life_expectancy     1.179942 3.643008
## Freedom_to_choices  2.522712 3.826810

The 95% confidence interval for GDP_per_capita is 1.6963 to 3.3987. We are 95% confident that if GDP_per_capita increases by 1, the Happiness_score will increase between 1.6963 and 3.3987. This interval does not include zero and thus slope Ã1 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Life_expectancy is 1.1799 to 3.6430. We are 95% confident that if Life_expectancy increases by 1, the Happiness_score will increase between 1.1799 and 3.6430. This interval does not include zero and thus slope Ã2 is significantly different from zero at the 5% level of significance.

The 95% confidence interval for Freedom_to_choices is 2.5227 to 3.8268. We are 95% confident that if Freedom_to_choices increases by 1, the Happiness_score will increase between 2.5227 and 3.8268. This interval does not include zero and thus slope Ã3 is significantly different from zero at the 5% level of significance.

Evaluation

cv_lm.western.3 <- train(
  form = Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, 
  data = train_western, 
  method = "lm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_lm.western.3

## Linear Regression 
## 
## 82 samples
##  3 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 74, 74, 74, 73, 74, 74, ... 
## Resampling results:
## 
##   RMSE       Rsquared   MAE      
##   0.3983567  0.7799532  0.3205176
## 
## Tuning parameter 'intercept' was held constant at a value of TRUE

By using 10-fold cross validation, it was observed that model (lm.western.3) has a RMSE of 0.3984 and a MAE of 0.3205. This suggests that there is some variation in the magnitude of the errors but very large errors are unlikely to occur. MAE of 0.3205 indicates that the average difference between the predicted and the actual Happiness_score was 0.3205. Furthermore, it has a R-squared of 0.78. This suggests that 78% of the variation in Happiness_score is explained by the variation in GDP_per_capita, Life_expectancy and Freedom_to_choices. It also has an AIC value of 88.4765 and a BIC value of 100.5101.

Improving the Model

regfit.2.3 <- regsubsets(Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, data = train_western)

par(mfrow=c(2,2))
plot(regfit.2.3, scale = "r2")
plot(regfit.2.3, scale = "adjr2")
plot(regfit.2.3, scale = "Cp")
plot(regfit.2.3, scale = "bic")

The best subset selection was performed. As for R-squared, adjusted R-squared, Cp and Bayesian Information Criterion (BIC), all four of them have indicated that GDP_per_capita, Life_expectancy and Freedom_to_choices were the best subset of predictors.

# k-fold CV method (2nd method for subset selection)
k <- 10  
set.seed(1)
folds <- sample(1:k, nrow(train_western), replace = TRUE) 
cv_errors <- matrix(NA, k, 3, dimnames = list(NULL, paste(1:3)))

# Perform cross validation using a for loop where the folds that are equal to j are in the testing set
# While the remaining are in the training set
# We then perform our predictions for each model size using the predict function we have created earlier
# To compute the test erros on the appropriate subset and allocate them into their appropriate slots in the matrix cv_errors

for(j in 1:k) {
  best_fit <- regsubsets(Happiness_score ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, data = train_western[folds != j, ], nvmax = 10) 
  for(i in 1:3) {
    pred <- predict(best_fit, train_western[folds == j, ], id = i) 
    cv_errors[j,i] <- mean((train_western$Happiness_score[folds == j] - pred)^2) 
  }
}

mean_cv_errors <- apply(cv_errors, 2, mean)
plot(mean_cv_errors, type = "b")

min_point <- which.min(mean_cv_errors)
points(min_point, mean_cv_errors[min_point], col = "red", cex = 2, pch = 20)

# k-fold CV method is preferred as it takes into account of any outliers

By using 10-fold cross validation, it was observed that a total of 3 predictors were the best as it has the lowest mean cross validation error.

Based on the information that we have obtained from the above methods, we have concluded that three independent variables GDP_per_capita, Life_expectancy and Freedom_to_choices would be considered as predictors for our model.

Best model selection :

Comparing goodness of fit between models :

	lm.western	lm.western.2	lm.western.3
R-squared	92.32%	90.85%	93.45%
RMSE	0.3857	0.3813	0.3984
MAE	0.3176	0.3192	0.3205
AIC	80.13	81.93	88.48
BIC	96.98	96.37	100.51

Referring to the table above, we can see that although Model (lm.western.3) has the highest R-squared value (93.45%), the RMSE, MAE, AIC and BIC values in Model (lm.western.3) are the highest as compared to the other models. Hence, Model (lm.western.3) does not fit the data well.

Besides that, by comparing between Model (lm.western) and Model (lm.western.2), it is observed that Model (lm.western) has a higher R-squared value of 92.32% than Model (lm.western.2). This indicates that Model (lm.western) is better in terms of explaining the variation in Happiness_score.

Furthermore, the MAE and AIC values in Model (lm.western) are 0.3176 and 80.13 respectively which are smaller than the MAE and AIC values in Model (lm.western.2). Although Model (lm.western.2) has a lower RMSE and BIC values which are 0.3813 and 96.37 respectively as compared to Model (lm.western), Model (lm.western) is still selected as the best model because there is not much difference in RMSE and BIC values between both models. Therefore, Model (lm.western) is the best in predicting the Happiness_score for the countries in Western Europe region where it is best to use all 5 variables which are GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity.

4.2 Logistic Regression

4.2.1 Research Question 3

Model 1 (glm.world):

Logistic regression is used to investigate the relationship between the binary dependent variables with other independent variables. We classify the instances into two groups (0 and 1) by calculating the probability of them in one group. In this research question, we would examine the impact of GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity on Happiness_score, where 0 represents low happiness score (with the happiness score less than the mean) and 1 represents high happiness score.

# split data into training and testing set 
set.seed(123)
split <- sample(1:nrow(lm_world), round(nrow(lm_world) * 0.8)) # 80-20 split
train_world <- lm_world[split, ]
test_world <- lm_world[-split, ]

Estimation

The estimated model (glm.world) was performed using the âglmâ function in R using the logistic regression to identify the association of the binary independent variable with other dependent variables and fit the line values to the sigmoid curves.

glm.world <- glm(Happy~., data = train_world, family = binomial)

Interpretation

summary(glm.world)

## 
## Call:
## glm(formula = Happy ~ ., family = binomial, data = train_world)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.38803  -0.26160   0.05166   0.37126   2.73839  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -13.0128     1.2115 -10.741  < 2e-16 ***
## GDP_per_capita       5.3536     0.7608   7.037 1.97e-12 ***
## Social_support       1.8573     0.5791   3.207  0.00134 ** 
## Life_expectancy      4.0072     0.9887   4.053 5.06e-05 ***
## Freedom_to_choices   7.6789     1.2339   6.223 4.87e-10 ***
## Generosity           0.1316     1.2400   0.106  0.91549    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.08  on 613  degrees of freedom
## Residual deviance: 347.46  on 608  degrees of freedom
## AIC: 359.46
## 
## Number of Fisher Scoring iterations: 7

Estimated happiness score = -13.0128 + 5.3536 GDP per capita + 1.8573 Social support + 4.0072 Life expectancy + 7.6789 Freedom to choices + 0.1316 Generosity

Ã0 = -13.0128 suggests that the probability of happiness score not being 1 is 13.01%

Ã1 = 5.3536 suggests that if the GDP per capita increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(4.7110^{-4}\)%, holding all other variables constant. \(\frac{e^{(-13.0128+5.3536)}}{1+e^{(-13.0128+5.3536)}}=4.7110^{-4}\)%

Ã2 = 1.8573 suggests that if the social support increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(1.4310^{-5}\)%, holding all other variables constant. \(\frac{e^{(-13.0128+1.8573)}}{1+e^{(-13.0128+1.8573)}}=1.4310 ^{-5}\)%

Ã3 = 4.0072 suggests that if the life expectancy increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(1.2310^{-4}\)%, holding all other variables constant. \(\frac{e^{(-13.0128+4.0072)}}{1+e^{(-13.0128+4.0072)}}=1.2310 ^{-4}\)%

Ã4 = 7.6789 suggests that if the freedom to choices increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(4.8010^{-3}\)%, holding all other variables constant. \(\frac{e^{(-13.0128+7.6789)}}{1+e^{(-13.0128+7.6789)}}=4.8010^{-3}\)%

Ã5 = 0.1316 suggests that if the generosity increases by 1, the probability of having a happiness score of 1 increases multiplicatively by 4.8010-3%, holding all other variables constant. \(\frac{e^{(-13.0128+0.1316)}}{1+e^{(-13.0128+0.1316)}}=2.5510^{-6}\)%

Diagnosis

Regression diagnostics are a powerful ally in the model and data validation stages of regression model building. It allows us to assess the overall adequacy of our models. In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (glm.model) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(glm.world):

autoplot(glm.world, which = 1:2, nrow = 2, ncol = 1 )

Based on the Residuals vs Fitted plot, the residuals are equally spread without any distinct patterns, hence we could conclude that there are no non-linear relationships in this model. However, we could observe that the residuals deviate slightly at the beginning and towards the end in the Normal Q-Q plot, this shows that the residuals are not normally distributed.

ii) Multicollinearity

To identify if multicollinearity exists in our model (glm.world), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(glm.world)

##     GDP_per_capita     Social_support    Life_expectancy Freedom_to_choices 
##           1.382507           1.059949           1.219299           1.427642 
##         Generosity 
##           1.166290

From the results, it is observed that all five independent variables (GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity) are not highly correlated as all variables have met the criterion cut-off point which is below 10. However, both scatter plot matrix and correlation matrix show that GDP_per_capita and Life_expectancy has a moderately strong positive correlation.

Prediction

confint(glm.world, level = 0.95)

## Waiting for profiling to be done...

##                          2.5 %     97.5 %
## (Intercept)        -15.5649930 -10.801557
## GDP_per_capita       3.9371722   6.927085
## Social_support       0.7423078   3.017803
## Life_expectancy      2.1284406   6.018697
## Freedom_to_choices   5.3405752  10.192480
## Generosity          -2.2980744   2.575804

The 95% confidence interval for GDP_per_capita and Social_support are 3.94 to 6.93 and 0.74 to 3.02 respectively. Since the interval doesnât include 0, hence we could conclude that Ã1 and Ã2 are significant at the 5% level of significance. Similarly, Life_expectancy and Freedom_to_choices have an interval of 2.13 to 6.02 and 5.34 to 10.19 respectively at the 5% significance level, hence Ã3 and Ã4 are significant since 0 is not included.

However, the 95% confidence interval for Generosity is -2.30 to 2.58. The interval includes 0 and thus Ã5 is not significant at the 5% level of significance.

Evaluation

# k fold cv
cv_glm.world <- train(
  form = Happy~., 
  data = train_world, 
  method = "glm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_glm.world

## Generalized Linear Model 
## 
## 614 samples
##   5 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 553, 553, 553, 552, 552, 552, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8633527  0.7263887

Through 10-fold cross validation, we could observe that this model has an accuracy of 0.8634, this indicates that the accuracy for the correctly predicted observations to the total observations is as high as 86.34%. Furthermore, the kappa value for this model is 0.7264. Kappa is measured to examine the reliability of the collected data in studying the variables of the model, whether the data could represent the variables correctly. In this model, we could conclude that the collected data are reliable and could represent the variables correctly.

Improving the Model

regfit.1.1 <- regsubsets(Happy ~ ., data = train_world)

par(mfrow=c(2,2))
plot(regfit.1.1, scale = "r2")
plot(regfit.1.1, scale = "adjr2")
plot(regfit.1.1, scale = "Cp")
plot(regfit.1.1, scale = "bic")

The best subset selection is performed to find out the best fitted variables for the model. R-squared has indicated that all 5 variables would be the best subset of predictors for this model. However, adjusted R-squared, Cp and Bayesian Information Criterion (BIC) have suggested that only GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices were the best subset of predictors.

Therefore, we have concluded that three independent variables GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices would be the best predictors for our next model.

Model 2 (glm.world.2):

Our objective would be to examine the impact of the 4 significant variables, which are GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices on Happiness_score, where 0 represents low happiness score (with the happiness score less than the mean) and 1 represents high happiness score. We have excluded Generosity from the previous model as it is insignificant at 5% significance level in the previous model and its p-value is 0.92.

Estimation

The estimated model (glm.world.2) was performed using the âglmâ function in R using the logistic regression to identify the association of the binary independent variable with other dependent variables and fit the line values to the sigmoid curves.

glm.world.2 <- glm(Happy ~ GDP_per_capita + Social_support + Life_expectancy + Freedom_to_choices, data = train_world, family = binomial)

Interpretation

summary(glm.world.2)

## 
## Call:
## glm(formula = Happy ~ GDP_per_capita + Social_support + Life_expectancy + 
##     Freedom_to_choices, family = binomial, data = train_world)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.39159  -0.25881   0.05221   0.37102   2.73990  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -12.9939     1.1976 -10.850  < 2e-16 ***
## GDP_per_capita       5.3478     0.7585   7.050 1.79e-12 ***
## Social_support       1.8516     0.5764   3.212  0.00132 ** 
## Life_expectancy      4.0083     0.9880   4.057 4.98e-05 ***
## Freedom_to_choices   7.7187     1.1764   6.561 5.34e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.08  on 613  degrees of freedom
## Residual deviance: 347.47  on 609  degrees of freedom
## AIC: 357.47
## 
## Number of Fisher Scoring iterations: 7

Estimated happiness score = -12.9939 + 5.3478 GDP per capita + 1.8516 Social support + 4.0083 Life expectancy + 7.7187 Freedom to choices

Ã0 = -13.0128 suggests that the probability of happiness score not being 1 is 12.99%

Ã1 = 5.3478 suggests that if GDP per capita increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(4.7810^{-4}\)%, holding all other variables constant. \(\frac{e^{(-12.9939+5.3478)}}{1+e^{(-12.9939+5.3478)}}=4.7810^{-4}\)%

Ã2 = 1.8516 suggests that if the social support increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(1.4510^{-5}\)%, holding all other variables constant. \(\frac{e^{(-12.9939+1.8516)}}{1+e^{(-12.9939+1.8516)}}=1.4510^{-5}\)%

Ã3 =4.0083 suggests that if the life expectancy increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(1.2510^{-4}\)%, holding all other variables constant. \(\frac{e^{(-12.9939+4.0083)}}{1+e^{(-12.9939+4.0083)}}=1.2510^{-4}\)%

Ã4 =7.7187 suggests that if the freedom to choose increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(5.0910^{-3}\)%, holding all other variables constant. \(\frac{e^{(-12.9939+7.7187)}}{1+e^{(-12.9939+7.7187)}}=5.0910^{-3}\)%

Diagnosis

Regression diagnostics are a powerful ally in the model and data validation stages of regression model building. It allows us to assess the overall adequacy of our models. In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (glm.model.2) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(glm.world.2):

autoplot(glm.world.2, which = 1:2, nrow = 2, ncol = 1 )

Based on the Residuals vs Fitted plot, there are no distinct patterns for the residuals in the plot, hence we could come to a conclusion that there are no non-linear relationships in this model. In the Normal Q-Q plot, residuals deviate at the beginning and towards the end, it only falls along the line in the middle of the plot, hence this indicates that the residuals are not normally distributed and there is a possibility that there are extreme values.

ii) Multicollinearity

To identify if multicollinearity exists in our model (glm.world.2), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(glm.world.2)

##     GDP_per_capita     Social_support    Life_expectancy Freedom_to_choices 
##           1.374020           1.050486           1.219158           1.297832

From the results, it is observed that all four independent variables (GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices) are not highly correlated as all variables have met the criterion cut-off point which is below 10. However, both scatter plot matrix and correlation matrix show that GDP_per_capita and Life_expectancy has a moderately strong positive correlation.

Prediction

confint(glm.world.2, level = 0.95)

## Waiting for profiling to be done...

##                          2.5 %     97.5 %
## (Intercept)        -15.5182230 -10.809450
## GDP_per_capita       3.9354312   6.916437
## Social_support       0.7413638   3.006032
## Life_expectancy      2.1309436   6.018335
## Freedom_to_choices   5.4933686  10.120224

The 95% confidence interval for GDP_per_capita and Social_support are -15.52 to -10.81 and 3.94 to 6.92 respectively. As it doesnât include 0, hence we could conclude that both Ã1 and Ã2 are significant at the 5% level of significance. Also, Life_expectancy and Freedom_to_choices have an interval of 2.13 to 6.02 respectively, this has indicated that both Ã3 and Ã4 are significant at the 5% level of significance since 0 is not included.

Evaluation

cv_glm.world.2 <- train(
  form = Happy ~ GDP_per_capita + Social_support + Life_expectancy + Freedom_to_choices,
  data = train_world, 
  method = "glm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_glm.world.2

## Generalized Linear Model 
## 
## 614 samples
##   4 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 553, 552, 553, 552, 553, 553, ... 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.8615812  0.7229857

Through 10-fold cross validation, we could observe that this model has an accuracy of 0.8616, this indicates that the accuracy for this the correctly predicted observations to the total observations is 86.16%, which is considered high. Furthermore, the kappa value for the model is 0.7230. Kappa is measured to examine whether the collected data could represent the variables correctly. In this model, we could conclude that the collected data are reliable and could represent the variables correctly since its kappa value is relatively high.

Improving the Model

regfit.1.2 <- regsubsets(Happy ~ GDP_per_capita + Social_support + Life_expectancy + Freedom_to_choices, data = train_world)

par(mfrow=c(2,2))
plot(regfit.1.2, scale = "r2")
plot(regfit.1.2, scale = "adjr2")
plot(regfit.1.2, scale = "Cp")
plot(regfit.1.2, scale = "bic")

The best subset selection is performed to improve and find out the best fitted variables for the model. R-squared, adjusted R-squared, Cp and Bayesian Information Criterion have indicated that all 4 variables would be the best subset of predictors for this model, which includes GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices.

Therefore, we have concluded that three independent variables GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices would be the best predictors for our model.

Model 3 (glm.world3):

Our objective would be to examine the impact of the 4 significant variables, which are GDP_per_capita, Life_expectancy and Freedom_to_choices on Happiness_score, where 0 represents low happiness score (with the happiness score less than the mean) and 1 represents high happiness score. We only include the 3 variables that have the smallest p-value into this model.

Estimation

The estimated model (glm.world.3) was performed using the âglmâ function in R using the logistic regression to identify the association of the binary independent variable with other dependent variables and fit the line values to the sigmoid curves.

glm.world.3 <- glm(Happy ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, data = train_world, family = binomial)

Interpretation

summary(glm.world.3)

## 
## Call:
## glm(formula = Happy ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, 
##     family = binomial, data = train_world)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.49246  -0.29281   0.04748   0.37817   2.66679  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -12.0961     1.1318 -10.687  < 2e-16 ***
## GDP_per_capita       5.9186     0.7590   7.798 6.31e-15 ***
## Life_expectancy      4.3923     0.9702   4.527 5.98e-06 ***
## Freedom_to_choices   8.6361     1.1499   7.510 5.90e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 851.08  on 613  degrees of freedom
## Residual deviance: 358.36  on 610  degrees of freedom
## AIC: 366.36
## 
## Number of Fisher Scoring iterations: 7

Estimated happiness score = -12.0961 + 5.9186 GDP per capita + 4.3923 Life expectancy + 8.6361 Freedom to choices

Ã0 = -12.0961 suggests that the probability of happiness score not being 1 is 12.10%

Ã1 = 5.9186 suggests that if GDP per capita increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(2.0710^{-3}\)%, holding all other variables constant. \(\frac{e^{(-12.0961+5.9186)}}{1+e^{(-12.0961+5.9186)}}=5.0910^{-3}\)%

Ã2 = 4.3923 suggests that if life expectancy increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(4.5110^{-4}\)%, holding all other variables constant. \(\frac{e^{(-12.0961+4.3923)}}{1+e^{(-12.0961+4.3923)}}=4.5110^{-4}\)%

Ã3 =8.6361 suggests that if the freedom to choose increases by 1, the probability of having a happiness score of 1 increases multiplicatively by \(0.03\)%, holding all other variables constant. \(\frac{e^{(-12.0961+8.6361)}}{1+e^{(-12.0961+8.6361)}}=0.03\)%

Diagnosis

Regression diagnostics are a powerful ally in the model and data validation stages of regression model building. It allows us to assess the overall adequacy of our models. In this part, we will perform relevant diagnostics to detect underlying problems and assess if the estimated model (glm.model.3) works well for the data at hand, thus coming up with a much more improved model for prediction.

i) Residual Diagnostics

Residual plot and Q-Q plot for Model(glm.world.3):

autoplot(glm.world.3, which = 1:2, nrow = 2, ncol = 1 )

Based on the Residuals vs Fitted plot, we could see that the residuals are equally distributed along the line, hence there are no non-linear relationships in this model. In contrast, residuals only fall along the line in the middle, this could indicate that there might be extreme values in the data.

ii) Multicollinearity

To identify if multicollinearity exists in our model (glm.world.3), we apply the Variance Inflation Factor (VIF) to determine if the predictors in our multiple linear regression model are correlated.

vif(glm.world.3)

##     GDP_per_capita    Life_expectancy Freedom_to_choices 
##           1.393040           1.209211           1.298513

From the results, it is observed that all three independent variables (GDP_per_capita, Life_expectancy and Freedom_to_choices) are not highly correlated as all variables have met the criterion cut-off point which is below 10. However, both scatter plot matrix and correlation matrix show that GDP_per_capita and Life_expectancy has a moderately strong positive correlation.

Prediction

confint(glm.world.3, level = 0.95)

## Waiting for profiling to be done...

##                         2.5 %     97.5 %
## (Intercept)        -14.486219 -10.036201
## GDP_per_capita       4.503541   7.487018
## Life_expectancy      2.554101   6.370489
## Freedom_to_choices   6.471742  10.993504

The 95% confidence interval for GDP_per_capita and Life_expectancy are 4.50 to 7.49 and 2.55 to 6.37 respectively. As it doesnât include 0, hence we could conclude that both Ã1 and Ã2 are significant at the 5% level of significance. Also, Freedom_to_choices has an interval of 6.47 to 10.99, this has indicated that both Ã3 are significant as well at the 5% level of significance since 0 is not included.

Evaluation

cv_glm.world.3 <- train(
  form = Happy ~ GDP_per_capita + Life_expectancy + Freedom_to_choices,
  data = train_world, 
  method = "glm",
  trControl = trainControl(method = "cv", number = 10)
)
cv_glm.world.3

## Generalized Linear Model 
## 
## 614 samples
##   3 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 553, 552, 553, 551, 553, 553, ... 
## Resampling results:
## 
##   Accuracy  Kappa    
##   0.856891  0.7137284

Through 10-fold cross validation, we could observe that this model has an accuracy of 0.8569, this indicates that the accuracy for this the correctly predicted observations to the total observations is 85.69%, which is considered high. Furthermore, the kappa value for the model is 0.7137. Kappa is measured to examine whether the collected data could represent the variables correctly. In this model, we conclude that the collected data could represent the variables correctly and it is highly reliable since its kappa value is relatively high.

Improving the Model

regfit.1.3 <- regsubsets(Happy ~ GDP_per_capita + Life_expectancy + Freedom_to_choices, data = train_world)

par(mfrow=c(2,2))
plot(regfit.1.3, scale = "r2")
plot(regfit.1.3, scale = "adjr2")
plot(regfit.1.3, scale = "Cp")
plot(regfit.1.3, scale = "bic")

The best subset selection is performed to improve and find out the best fitted variables for the model. R-squared, adjusted R-squared, Cp and Bayesian Information Criterion have indicated that GDP_per_capita, Life_expectancy and Freedom_to_choices would be the best subset of predictors for this model.

Therefore, we could conclude that three independent variables, GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices would be the best predictors for our model.

Best model selection :

Comparing goodness of fit between models :

	glm.world	glm.world.2	glm.world.3
Residual deviance	347.46	347.47	358.36
AIC	359.46	357.47	366.36
Accuracy (CV)	86.34%	86.16%	85.69%
Accuracy (confusion matrix)	86.64%	86.81%	86.81%
AUC value	0.9516	0.9516	0.9476

Residual deviance is used to examine if the model is a good fit for the collected data, the smaller the residual deviance, the better the model fits the data. From the table above, it is observed that both Model (glm.world) and Model (glm.world.2) have the two lowest residual deviance, which indicates that the data in these two models fit well. Besides that, based on the AIC value, we can see that Model (glm.world.2) has the lowest AIC value as compared to other models, which indicates that the model fits the data well and the penalisation is the lowest for Model (glm.world.2).

In addition, both Model (glm.world.2) and Model (glm.world.3) have a higher confusion matrix accuracy than Model (glm.world.3) which are 86.64% and 86.81% respectively. This indicates that the tested accuracy from predicting is performing well as the accuracy is quite high. Lastly, both Model (glm.world) and Model (glm.world.2) have the highest AUC value, this suggests that both models are performing well in distinguishing between the high happiness score and low happiness score since the higher the AUC value, the higher the degree of separability.

Therefore, we can conclude that Model (glm.world.2) is our best model as it not only has the smallest residual deviance and AIC value but also the highest confusion matrix accuracy and AUC value. These metrics have indicated the data could fit Model (glm.world.2) well and could highly distinguish the high happiness score and low happiness score.

4.3 Regularized Regression

Regularized regression is a form of regression where the estimated coefficients are constrained towards zero and by reducing the variance of the model at the cost of introducing some biases in the model. The larger the Î» value, the stronger the penalty on coefficientsâ size (s). Therefore, variance decreases while bias increases. Hence, this will essentially penalize or eliminate non-important variables from the model to prevent overfitting due to the presence of multicollinearity in our data which has caused a high variability in our coefficient terms.

We would be using ridge regression and lasso regression for our Model (lm.top10) from our first research question.

Ridge Regression

Reasons for using ridge regression:

Although there are no presence of multicollinearity from all of our models based on the VIF value we have calculated earlier. It is observed that the VIF value for Model (lm.top10) is the highest as compared to other models from other research question.

Besides that, based on the VIF values obtained from Model (lm.top10), we have found that both VIF values for ‘Life_expectancy’ and ‘Freedom_to_choices’ are slightly higher than the rest. This shows that there is a small chance of multicollinearity in this model where both variables might have a correlation with the other variables. Not to mention, we have found that the R-squared value for Model (lm.top10) is substantially higher as compared to other models during our best model selection in Research Question 1, this may indicate that Model (lm.top10) is overfitted. Therefore, we would like to identify if our model will be improved by applying ridge regression to our model.

Firstly, we will split the data set into 80% training set and 20% test set.

set.seed(123)

# Split data into training and testing set
split <- sample(1:nrow(lm_top10), round(nrow(lm_top10) * 0.8)) # 80-20 split
train_top10_2 <- lm_top10[split, ]
test_top10_2 <- lm_top10[-split, ]

Next, we have to standardize the training set and test set because ridge regression assumes that the predictors are standardized and the response is centered. Therefore standardization will be required to avoid HSK-ridge regression.

train_x <- model.matrix(Happiness_score ~., train_top10_2)[,-1]  # discard intercept
train_y <- train_top10_2$Happiness_score

test_x <- model.matrix(Happiness_score ~., test_top10_2)[,-1]
test_y <- test_top10_2$Happiness_score

Furthermore, we will create a grid of values ranging from lambda 10^10 to 10^â2 to cover a full range of scenarios from the null model (only intercept) to a full model (OLS fit). This is because the glmnet package only performs a ridge regression for an automatically selected range of Î» values by default.

grid <- 10^seq(10, -2, length = 100)

# Apply ridge regression to the data
ridge_reg <- glmnet(x = train_x, y = train_y, alpha = 0, lambda = grid)

# The higher the lambda, the more the coefficients are shrinking towards zero.
plot(ridge_reg, xvar= "lambda", label = TRUE, main = "Ridge penalty")
legend("topright", lwd = 1, col = 1:6, legend = colnames(train_x), cex = .7)

Î» is a tuning parameter that allows us to control our model from overfitting to the training data. We will perform the 10-fold cross-validation to identify the optimal Î» value.

# Perform ridge regression with 10-fold CV
ridge_reg_cv <- cv.glmnet(
  x = train_x,
  y = train_y,
  alpha = 0        # alpha = 0 for ridge penalty
)

# Plot the results of penalty versus CV MSE
plot(ridge_reg_cv,  main = "MSE of a range of 10-fold cv-lambda")

Based on the figure above, the first vertical dashed line represents the Î» value with the minimum MSE, while the second vertical dashed lines represents the largest Î» value within one standard error of the minimum MSE. We can see that we can minimize our MSE by applying approximately -3.8 â¤ log(Î») â¤ â1.

# select optimal cv-lambda with the minimum MSE
ridge.pred <- predict(ridge_reg, s = ridge_reg_cv$lambda.min, newx = test_x) # unstandardized best cv-lambda value

Mean squared error (MSE) is a measure of how close a regression line is to a set of points. The lower the mean squared error, the better the model fits the data.

# calculate test MSE
mean((ridge.pred - test_y)^2) # test MSE of best cv-lambda

## [1] 0.01657888

# fit ridge regression model on full dataset
final.mod <- glmnet(train_x, train_y, alpha = 0)

# display coefficient using lambda chosen by cv
ridge.coef <- predict(final.mod, type = "coefficients", s = ridge_reg_cv$lambda.min)[1:6, ] 
ridge.coef

##        (Intercept)     GDP_per_capita     Social_support    Life_expectancy 
##         6.71894800         0.38034972        -0.01564159        -0.11875855 
## Freedom_to_choices         Generosity 
##         0.76863135        -0.49792503

We can see that all variables remain in the model with some penalty being applied to them as ridge regression does not perform variable selection where none of the estimated coefficients are zero.

# CV-RMSE using 1-SE rule lambda (1-standard error of minimum MSE)
ridge_reg_cv$cvm[ridge_reg_cv$lambda == ridge_reg_cv$lambda.1se]

## [1] 0.0119372

# The lambda for the MSE using 1-SE rule lambda
ridge_reg_cv$lambda.1se

## [1] 0.3887182

ridge_reg_min <- glmnet(
  x = train_x,
  y = train_y,
  alpha = 0
)

plot(ridge_reg_min, xvar = "lambda")
legend("topright", lwd = 1, col = 1:6, legend = colnames(train_x), cex = .7)
abline(v = log(ridge_reg_cv$lambda.1se), col = "red", lty = "dashed")

We have plot the coefficients across the Î» values and the vertical dashed red line represents the largest Î» value within one standard error of the minimum MSE. This allows us to identify how much we can apply constraints to our coefficients while maximizing the predictive accuracy of our model.

Based on the plot above, we can see that the top influential variables in order are ‘GDP_per_capita’, ‘Freedom_to_choices’, ‘Generosity’, followed by ‘Life_expectancy’ and ‘Social_support’. The higher the coefficient value, the higher the influence.

Lasso Regression

Reasons for using lasso regression:

Given that Model (lm.top10) is the best in predicting the Happiness_score for the top 10 happiest countries, where it has used all 5 variables which are GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity as shown earlier. We would like to identify if we can further automate certain parts of model selection such as the variable selection or parameter elimination where it will arbitrality select features which highly correlates to Happiness_score and reduced the coefficients of the rest to zero.

# Apply lasso regression to the data
lasso_reg <- glmnet(x = train_x, y = train_y, alpha = 1)

# The higher the lambda, the more the coefficients are shrinking towards zero.
plot(lasso_reg, xvar= "lambda", label = TRUE)
legend("topright", lwd = 1, col = 1:6, legend = colnames(train_x), cex = .7)

We will perform the 10-fold cross-validation to identify the optimal Î» value for our lasso regression.

# Perform lasso regression with 10-fold CV
lasso_reg_cv <- cv.glmnet(
  x = train_x,
  y = train_y,
  alpha = 1        # the alpha = 1 for lasso penalty
)

# Plot the results of penalty versus CV MSE
plot(lasso_reg_cv)

Based on the plot above, we can see that we can minimize our MSE by applying approximately -4.8 â¤ log(Î») â¤ â3.8.

# select optimal cv-lambda with the minimum MSE
lasso.pred <- predict(lasso_reg, s = lasso_reg_cv$lambda.min, newx = test_x) # manually set s 
mean((lasso.pred - test_y)^2) # test MSE of best cv-lambda

## [1] 0.01568489

# display coefficient using lambda chosen by cv
final.mod2 <- glmnet(train_x, train_y, alpha = 1)
lasso.coef <- predict(final.mod2, type = "coefficients", s = lasso_reg_cv$lambda.min)[1:6, ] 
lasso.coef

##        (Intercept)     GDP_per_capita     Social_support    Life_expectancy 
##         6.60547825         0.39246383         0.00000000        -0.03367441 
## Freedom_to_choices         Generosity 
##         0.74920307        -0.46879125

We can see that lasso regression has reduced the estimated coefficients of some of the less important variables to zero as lasso regression performs variable selection unlike ridge regression. Hence, some variables will only remain in the model.

# CV-RMSE using 1-SE rule lambda (1-standard error of minimum MSE)
lasso_reg_cv$cvm[lasso_reg_cv$lambda == lasso_reg_cv$lambda.1se]

## [1] 0.01229725

# The lambda for the MSE using 1-SE rule lambda
lasso_reg_cv$lambda.1se

## [1] 0.02330304

lasso_reg_min <- glmnet(
  x = train_x,
  y = train_y,
  alpha = 1
)

plot(lasso_reg_min, xvar = "lambda")
legend("topright", lwd = 1, col = 1:6, legend = colnames(train_x), cex = .7)
abline(v = log(lasso_reg_cv$lambda.1se), col = "red", lty = "dashed")

lasso_reg_min <- glmnet(
  x = train_x,
  y = train_y,
  alpha = 1
)

plot(lasso_reg_min, xvar = "lambda")
legend("topright", lwd = 1, col = 1:6, legend = colnames(train_x), cex = .7)
abline(v = log(lasso_reg_cv$lambda.1se), col = "red", lty = "dashed")

Based on the plot above, we can see that only the estimated coefficients for ‘Freedom_to_choices’, ‘GDP_per_capita’ and ‘Generosity’ are not penalize to zero, while the estimated coefficients for ‘Social_support’ and ‘Life_expectancy’ has been penalized to zero in order to maximize the predictive accuracy of our model.

Based on the plot above, we can see that the top most influential variables are ‘Freedom_to_choices’ followed by ‘GDP_per_capita’ and ‘Generosity’.

Insights drawn from both methods :

By comparing between ridge regression and lasso regression, we can see that lasso regression performs better as the MSE value for lasso regression is 0.0157 which is lower as compared to ridge regression with a MSE value of 0.0166. Therefore, we do not have to retain all variables in our model and also reduce the noise caused by less influential variables and minimize the chances of multicollinearity through ridge regression in this case.

Ultimately, we can see that the variables which strongly correlates with the ‘Happiness_score’ for the top 10 happiest countries are only ‘Freedom_to_choices’, ‘GDP_per_capita’ and ‘Generosity’ from our lasso regression as compared to using all 5 variables as a predictor for ‘Happiness_score’ in our Model (lm.top10) using a linear regression method.

5.0 Conclusion

5.1 Implications

In this report, various models were created by utilizing all three regression methods: Linear Regression, Logistic Regression and Regularized Regression.

Linear Regression method was used to address the first and second research questions. For research question 1, the best linear regression model suggested that GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity are the major factors that contribute to the Happiness_score of the top 10 happiest countries whereas Social_support is the only factor that contributes to the Happiness_score of the top 10 least happiest countries. Besides that, Regularized Regression method was also used to address the first research question and suggested that GDP_per_capita, Freedom_to_choices and Generosity are the only factors that strongly correlates with the ‘Happiness_score’ for the top 10 happiest countries. In terms of why different variables correlates to Happiness_score of the top 10 happiest and least happiest countries, we believe this could be due to the fact that citizens from every country have their own way to achieve happiness.

For the top 10 happiest countries, people would feel happy and satisfied when their countries have a strong GDP and are economically stable. As when the country is striving forward stably, consequently people could enjoy a high living standard and be satisfied with their own life. Furthermore, when people can live longer and are free to make choices without any interference, they would feel happy too as they are not restricted to anything and could live the way they wanted to. In addition, donations made in the past months and the presence of human support when facing difficulties could also affect the happiness of the people living in the top 10 happiest countries. This is because by helping others in overcoming their obstacles and difficulties, this would make them happier rather than spending time by themselves. Not to mention, by having a great social support around them, this will greatly allow them to overcome any negativity in their lives which eventually makes them feel happier in the long run.

On the other hand, we found that citizens living in the top 10 least happiest countries are happy mainly due to having a more supportive family and friends that you can turn to in times of difficulty. As people with a better social network are generally happier and able to cope with stress easily (Cherry, 2020). This could be due to the warmth and the help others are offering to them during their hardest time could really brighten their day and make them feel better, like a glimpse of hope in a dark world.

Furthermore, for research question 2, the best linear regression model suggested that GDP_per_capita, Social_support, Life_expectancy, Freedom_to_choices and Generosity are the major factors that contribute to the Happiness_score of countries in the Western Europe region. This is mainly due to the economy of the countries in the Western Europe region being generally more stable than the other regions in the world. It is known that its average GDP per capita is roughly 50,000, which is 4 times higher than the average GDP per capita globally (Nation Master, n.d.). In addition, the government officials in the Western Europe region are highly supportive of their people. Hence, when citizens have a good welfare, they tend to be happier and are prone to lesser worries leading to a higher life expectancy which indirectly correlates to the happiness of the citizens. Moreover, most of the countries in Western Europe are democratic where government officials allow their citizens to make choices freely based on their decisions, which is the reason why people tend to be happier in this way.

Furthermore, we have applied Logistic Regression to address our third research question which is to distinguish a country with high happiness score and low happiness score. The best model suggested that all 4 variables GDP_per_capita, Social_support, Life_expectancy and Freedom_to_choices except for Generosity are the best factors that contribute to distinguishing between countries with a high happiness score and low happiness score. Therefore, we can see that most countries with a high happiness score in general tend to have a stable financial status, having a great social network, living a longer life and being able to make choices on their own without having to worry about other matters. In short, individuals would be considered very happy when they have these factors in their life.

5.2 Limitations

The current research has some limitations. Firstly, there are some other factors that could potentially have a stronger influence on the Happiness score of a country such as age, marital status and income. However, these factors were not included in this research. Additionally, the sample size of this research can be improved by including more countries in our dataset, as our current analysis only consists of observations within the sample. Therefore, it does not forecast observations that lie outside the sample size or outside the model. Moreover, some countries in our current dataset might have inaccurate data due to data entry errors that might affect our results. Hence, this leaves room for further research to be carried out by other researchers that plan to conduct a more in-depth analysis to support our findings from each research question and to further improve our existing model or verify our hypothesis that we have found earlier.

5.3 Summary and Contributions

In summary, we have found out what citizens in the top 10 happiest countries, top 10 least happiest countries, countries in Western Europe region or in general value the most in terms of finding happiness in their lives. Interestingly, we can see how there are different combinations of factors that contribute to the happiness score of the citizens by looking from a different perspective. However, a country’s happiness score will stay unchanged or will not be improved if measures have not been taken to enhance it. Therefore, actions can be taken by health organizations in every country by emphasizing on these factors with a strong correlation with happiness score and organizing events that will clearly improve the country’s overall happiness in the future. Besides that, further research or explorations regarding contributing factors of happiness can also be carried out to broaden our knowledge or understanding about happiness. All in all, everyone should strive for happiness as happiness will definitely play a crucial role to one’s well-being in the long run.

References

Cherry, K. (2020). How Social Support Contributes to Psychological Health. Retrieved from https://www.verywellmind.com/social-support-for-psychological-health-4119970

Dwyer, M. (2012). Positive feelings may help protect cardiovascular health. Retrieved from https://www.hsph.harvard.edu/news/press-releases/positive-emotions-cardiovascular-health/

Garaigordobil, M. (2015). Predictor variables of happiness and its connection with risk and protective factors for health. Frontiers in Psychology, 6(1176), 1-10. doi: 10.3389/fpsyg.2015.01176

Nation Master. (n.d.). Western Europe: Statistical Profile. Retrieved from https://www.nationmaster.com/country-info/groups/Western-Europe

Thompson, D. (2019). The Happiness Dividend: Longer, Healthier Lives. Retrieved from https://www.webmd.com/healthy-aging/news/20190715/the-happiness-dividend-longer-healthier-lives#1

Warwick Business School. (2014). New study shows we work harder when we are happy. Retrieved from https://warwick.ac.uk/newsandevents/pressreleases/new_study_shows

Wike, R. & Simmons, K. (2015). Concerns and Priorities in Sub-Saharan Africa. Retrieved from https://www.pewresearch.org/global/2015/09/16/concerns-and-priorities-in-sub-saharan-africa/

World Happiness Report. (2020). Social Environments for World Happiness. Retrieved from https://worldhappiness.report/ed/2020/social-environments-for-world-happiness/