1 Prelude

South Asia, a melting pot of cultures, religions and histories, offers a unique backdrop against which the pursuit of happiness unfolds in diverse and intricate ways. This region, home to the majestic Himalayas, the serene beaches of Sri Lanka, the bustling markets of India and the ancient ruins of civilizations past, presents a tapestry of life that is rich in contrast and contradiction.

The World Happiness Report serves as a compass that navigates through the various dimensions of well-being and satisfaction across countries worldwide. In the context of South Asia, it offers us valuable insights into how people perceive their happiness, influenced by a myriad of factors from economic stability and social support to health, freedom and the environment. This analysis is not merely an academic exercise but a window into the lives of millions, shedding light on their joys, struggles, and aspirations.

Join us as we navigate through the landscapes of South Asia, seeking to understand what drives happiness in this diverse region and what lessons can be learned from its approach to fostering well-being and contentment among its people.

1.1 Data Preparation

To analyze the happiness level from several country across South Asia, we will use World Happiness Report 2023 Data published in kaggle by USAMA BUTTAR. The data itself was from 2006 to 2022.

Explanation of each column:

Country.Name : Name of the Country
Regional.Indicator : Name of the Regional on each Country
Year : Yearly data
Life.Ladder : Happiness Level [1-10]
Log.GDP.Per.Capita : Log Gross Domestic Product each people in the country
Social.Support : Having someone to rely on.
Healthy.Life.Expectancy.At.Birth : Expectation of having proper healthy life since birth
Freedom.To.Make.Life.Choices : Difficulties of making live choices [0-1]
Generosity : How often do they make a donation each month
Perceptions.Of.Corruption : Perceptions of Corruption level [0-1]
Positive.Affect : Average positive effect from yesterday for laugh, happiness and interest
Negative.Affect : Average negative effect from yesterday for worry, sadness and anger
Confidence.In.National.Goverment : How trust with the government

Needed Libraries

#Packages for dataframe transformation
library(dplyr)
library(tidyr)
library(lubridate)

#Packages for visualization
library(ggcorrplot)
library(gplots)
library(ggplot2)
library(plotly)
library(foreign)

#Packages for further analysis
library(plm)
library(lfe)
library(lmtest)
library(car)
library(tseries)
library(MLmetrics)

We will do several steps including:

Read data from folder data_input named World_Happiness_Report.csv
Select only South Asia country data

# 1. dataset import
df <- read.csv("data_input/World_Happiness_Report.csv")

# 2. using only south asia data and remove regional indicator column
df_sasia <- df %>% 
  filter(Regional.Indicator == "South Asia") %>% 
  select(-Regional.Indicator)

head(df_sasia)

1.2 Checking for the Balance of the Data

We can use two different checking:

1. Checking the freuqencies of the data from individual index

table(df_sasia$Country.Name)

## 
## Afghanistan  Bangladesh       India    Maldives       Nepal    Pakistan 
##          14          17          17           1          17          16 
##   Sri Lanka 
##          15

2. Using is.pbalanced() Function

We can use this function with notes that the data must came with pdata.frame format otherwise we can add the parameter index("individual column", "time column"). The expected result from the checking is TRUE which mean the data panel is now balanced.

is.pbalanced(df_sasia,index = c("Country.Name","Year"))

## [1] FALSE

From frequency checking and data balancing above, we can see that:

The data is not balanced.
Maldives is the country that has the most insufficient data followed by Afghanistan, Sri Lanka and Pakistan

After some consideration, i took the Maldives out from the data as it is highly insufficient for further process.

df_sasia <- df_sasia %>% filter(Country.Name != "Maldives")
df_sasia

1.2.1 Data Structure Adjustment

1. Create Panel Data Frame

For balancing purpose of the data, we have to change the data format to become a panel data frame. To create a panel data frame, we can use pdata.frame() function with parameters:

data : The data that will be used
index : c(“individual information”,“time information”)

#creating pdata.frame
df_sasia <- df_sasia %>% pdata.frame(index = c("Country.Name","Year"))

#memeriksa struktur data
glimpse(df_sasia)

## Rows: 96
## Columns: 12
## $ Country.Name                      <fct> Afghanistan, Afghanistan, Afghanista…
## $ Year                              <fct> 2008, 2009, 2010, 2011, 2012, 2013, …
## $ Life.Ladder                       <pseries> 3.723590, 4.401778, 4.758381, 3.…
## $ Log.GDP.Per.Capita                <pseries> 7.350416, 7.508646, 7.613900, 7.…
## $ Social.Support                    <pseries> 0.4506623, 0.5523084, 0.5390752,…
## $ Healthy.Life.Expectancy.At.Birth  <pseries> 50.500, 50.800, 51.100, 51.400, …
## $ Freedom.To.Make.Life.Choices      <pseries> 0.7181143, 0.6788964, 0.6001272,…
## $ Generosity                        <pseries> 0.167652458, 0.190808803, 0.1213…
## $ Perceptions.Of.Corruption         <pseries> 0.8816863, 0.8500354, 0.7067661,…
## $ Positive.Affect                   <pseries> 0.4142970, 0.4814214, 0.5169067,…
## $ Negative.Affect                   <pseries> 0.2581955, 0.2370924, 0.2753238,…
## $ Confidence.In.National.Government <pseries> 0.6120721, 0.6115452, 0.2993574,…

By doing this, it will automatically changed the type of the data of each columns

The index column will be converted as factor
Other than that will be converted as pseries

2. Checking Data Dimension

We can use pdim() function to check the dimension of the data

pdim(df_sasia)

## Unbalanced Panel: n = 6, T = 14-17, N = 96

From this checking step, we could get some information such:

The data is not balanced.
There are 6 countries
Time index from 14 to 17
It has 96 of observation data row

1.2.2 Balancing Data

We are now doing data balancing with help of make.balanced function with parameter balance.type that can be filled with 3 options:

fill (default): The union of available time periods over all individuals is taken (w/o NA values). Missing time periods for an individual are identified and corresponding rows (elements for pseries) are inserted and filled with NA for the non–index variables (elements for a pseries). This means, only time periods present for at least one individual are inserted, if missing.
shared.times : The intersect of available time periods over all individuals is taken (w/o NA values). Thus, time periods not available for all individuals are discarded, i. e., only time periods shared by all individuals are left in the result).
shared.individuals: All available time periods are kept and those individuals are dropped for which not all time periods are available, i. e., only individuals shared by all time periods are left in the result (symmetric to “shared.times”).

We use the fill options as it likely matched with our needs

Balancing with fill

balance1 <- df_sasia %>% make.pbalanced(balance.type = "fill")

table(balance1$Country.Name)

## 
## Afghanistan  Bangladesh       India       Nepal    Pakistan   Sri Lanka 
##          18          18          18          18          18          18

unique(balance1$Year)

##  [1] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## [16] 2020 2021 2022
## 18 Levels: 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 ... 2022

We will save it to an object name balance1.

is.pbalanced(balance1)

## [1] TRUE

pdim(balance1)

## Balanced Panel: n = 6, T = 18, N = 108

The data is now balanced.

1.3 Checking of Missing Value

Before we continue to checking the completeness of the data, we have to know how many time information was added on previous step.

# Amount of balanced missing data  - Amount of unbalance missing data
colSums(is.na(balance1)) - colSums(is.na(df_sasia))

##                      Country.Name                              Year 
##                                 0                                 0 
##                       Life.Ladder                Log.GDP.Per.Capita 
##                                12                                12 
##                    Social.Support  Healthy.Life.Expectancy.At.Birth 
##                                12                                12 
##      Freedom.To.Make.Life.Choices                        Generosity 
##                                12                                12 
##         Perceptions.Of.Corruption                   Positive.Affect 
##                                12                                12 
##                   Negative.Affect Confidence.In.National.Government 
##                                12                                12

It shows that there are 12 additional row data added for each column from the balancing steps.

We are checking the completeness of data

colSums(is.na(balance1))

##                      Country.Name                              Year 
##                                 0                                 0 
##                       Life.Ladder                Log.GDP.Per.Capita 
##                                12                                13 
##                    Social.Support  Healthy.Life.Expectancy.At.Birth 
##                                12                                12 
##      Freedom.To.Make.Life.Choices                        Generosity 
##                                12                                16 
##         Perceptions.Of.Corruption                   Positive.Affect 
##                                12                                14 
##                   Negative.Affect Confidence.In.National.Government 
##                                12                                19

Based on the inspection results above, we can see that overall there are quite a lot of columns that have missing values

The Confidence.In.National.Government column has almost 1/5 (108/ 19) of the total missing data so it will not be included in modeling also,
The Generosity, which is almost 1/7 (108/16) of the total missing data

balance1 <-
  balance1 %>% select(-Confidence.In.National.Government,-Generosity)

To check and filling missing value, we have to interpolate it separately on each country.

1.4 Fill Missing Value(On each Country)

1.4.1 Afghanistan

afghan <- balance1 %>% filter(Country.Name == "Afghanistan")

colSums(is.na(afghan))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                4                                5 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                4                                4 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                4                                4 
##                  Positive.Affect                  Negative.Affect 
##                                4                                4

We found that there’s some missing value on several columns, lets fill it with na.fill() function with fill = "extend"

afghan <- afghan %>% mutate(
  Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
  Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
  Social.Support = na.fill(Social.Support, fill = "extend"),
  Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
  Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
  Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
  Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
  Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
  
anyNA(afghan)

## [1] FALSE

1.4.2 Bangladesh

bangla <- balance1 %>% filter(Country.Name == "Bangladesh")

colSums(is.na(bangla))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                1                                1 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                1                                1 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                1                                1 
##                  Positive.Affect                  Negative.Affect 
##                                2                                1

We found that there’s some missing value on several columns, lets fill it with na.fill() function with fill = "extend"

bangla <- bangla %>% mutate(
  Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
  Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
  Social.Support = na.fill(Social.Support, fill = "extend"),
  Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
  Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
  Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
  Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
  Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
  
anyNA(bangla)

## [1] FALSE

1.4.3 India

india <- balance1 %>% filter(Country.Name == "India")

colSums(is.na(india))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                1                                1 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                1                                1 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                1                                1 
##                  Positive.Affect                  Negative.Affect 
##                                1                                1

We found that there’s some missing value on several columns, lets fill it with na.fill() function with fill = "extend"

india <- india %>% mutate(
  Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
  Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
  Social.Support = na.fill(Social.Support, fill = "extend"),
  Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
  Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
  Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
  Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
  Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
  
anyNA(india)

## [1] FALSE

1.4.4 Nepal

nepal <- balance1 %>% filter(Country.Name == "Nepal")

colSums(is.na(nepal))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                1                                1 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                1                                1 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                1                                1 
##                  Positive.Affect                  Negative.Affect 
##                                1                                1

We found that there’s some missing value on several columns, lets fill it with na.fill() function with fill = "extend"

nepal <- nepal %>% mutate(
  Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
  Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
  Social.Support = na.fill(Social.Support, fill = "extend"),
  Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
  Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
  Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
  Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
  Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
  
anyNA(nepal)

## [1] FALSE

1.4.5 Pakistan

pakis <- balance1 %>% filter(Country.Name == "Pakistan")

colSums(is.na(pakis))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                2                                2 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                2                                2 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                2                                2 
##                  Positive.Affect                  Negative.Affect 
##                                3                                2

pakis <- pakis %>% mutate(
  Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
  Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
  Social.Support = na.fill(Social.Support, fill = "extend"),
  Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
  Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
  Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
  Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
  Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
  
anyNA(pakis)

## [1] FALSE

1.4.6 Sri Langka

srilan <- balance1 %>% filter(Country.Name == "Sri Lanka")

colSums(is.na(srilan))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                3                                3 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                3                                3 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                3                                3 
##                  Positive.Affect                  Negative.Affect 
##                                3                                3

srilan <- srilan %>% mutate(
  Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
  Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
  Social.Support = na.fill(Social.Support, fill = "extend"),
  Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
  Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
  Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
  Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
  Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
  
anyNA(srilan)

## [1] FALSE

After all of the missing values has been filled, we bind it back altogether and saved it to balanced2

balanced2 <- bind_rows(afghan, bangla, india, nepal, pakis, srilan)

Recheck the balance of the data

pdim(balanced2)

## Balanced Panel: n = 6, T = 18, N = 108

Checking the completeness of the data

colSums(is.na(balanced2))

##                     Country.Name                             Year 
##                                0                                0 
##                      Life.Ladder               Log.GDP.Per.Capita 
##                                0                                0 
##                   Social.Support Healthy.Life.Expectancy.At.Birth 
##                                0                                0 
##     Freedom.To.Make.Life.Choices        Perceptions.Of.Corruption 
##                                0                                0 
##                  Positive.Affect                  Negative.Affect 
##                                0                                0

The data is now ready for the next step.

2 Exploratory Data Analysis

summary(balanced2)

##       Country.Name      Year     Life.Ladder    Log.GDP.Per.Capita
##  Afghanistan:18    2005   : 6   Min.   :1.281   Min.   :7.324     
##  Bangladesh :18    2006   : 6   1st Qu.:4.180   1st Qu.:7.961     
##  India      :18    2007   : 6   Median :4.479   Median :8.312     
##  Nepal      :18    2008   : 6   Mean   :4.445   Mean   :8.344     
##  Pakistan   :18    2009   : 6   3rd Qu.:4.931   3rd Qu.:8.638     
##  Sri Lanka  :18    2010   : 6   Max.   :5.982   Max.   :9.529     
##                    (Other):72                                     
##  Social.Support   Healthy.Life.Expectancy.At.Birth Freedom.To.Make.Life.Choices
##  Min.   :0.2282   Min.   :50.50                    Min.   :0.3352              
##  1st Qu.:0.5380   1st Qu.:55.48                    1st Qu.:0.6056              
##  Median :0.6150   Median :59.71                    Median :0.7322              
##  Mean   :0.6442   Mean   :59.06                    Mean   :0.6916              
##  3rd Qu.:0.7807   3rd Qu.:62.28                    3rd Qu.:0.8182              
##  Max.   :0.8737   Max.   :67.20                    Max.   :0.9064              
##                                                                                
##  Perceptions.Of.Corruption Positive.Affect  Negative.Affect 
##  Min.   :0.6169            Min.   :0.1789   Min.   :0.1523  
##  1st Qu.:0.7676            1st Qu.:0.4740   1st Qu.:0.2340  
##  Median :0.8210            Median :0.5176   Median :0.2952  
##  Mean   :0.8128            Mean   :0.5374   Mean   :0.2998  
##  3rd Qu.:0.8616            3rd Qu.:0.5897   3rd Qu.:0.3500  
##  Max.   :0.9544            Max.   :0.7894   Max.   :0.6067  
##

From the summary above, we could get some information:

The Highest level of Life Ladder across several country in South Asia is 5.982
The Lowest Level of Life Ladder across several country in South Asia is 1.281

3 Correlation between Variable

We can use the ggcorplot function to visualize it for convenience

need to unselect categorical and factor variable, in here is Country.Name and Year column

balanced2 %>% select(-Country.Name, -Year) %>% cor() %>% ggcorrplot(type = "lower", lab = TRUE)

From the plot result above, we could see that:

Variable that create a quite strong bond with Life.Ladder are:
- Social.Support
- Positive Affect
And we found the indication of multicolinearlities (threshold >= 0.75) between:
- Log.GDP.Per.Capita and Healthy.Life.Expectancy.At.Birth and,
- Log.GDP.Per.Capita and Positive Affect

3.0.1 Socio Demography Exploration

We can use coplot() function to gain a better information from our data, with parameters:

formula = filled with target ~ index1 given index2
type = "l" for line dan "b" for point & line plot
data = dataset
rows = How many row the panel will be plotted
col = color of the plot

3.0.1.1 Life Ladder

coplot(Life.Ladder ~ Year|Country.Name,
       type = "b",
       data = balanced2,
       rows = 1,
       col = "red")

From line plot above, we could see that:

In general, the happiest citizen across several country in South Asia is Pakistan and followed by Nepal although in Pakistan, it starts to decreasing since 2017.
The opposite, the unhappiest citizen across several country in South Asia is Afghanistan and its getting worst.

3.0.1.2 Log GDP Per Capita

coplot(Log.GDP.Per.Capita ~ Year|Country.Name,
       type = "b",
       data = balanced2,
       rows = 1,
       col = "red")

From line plot above, we could see that:

In general, the highest country on Log.GDP.Per.Capita value across several country in South Asia is Sri Lanka
Afghanistan is the only country that facing a stagnant to decreasing trend on Log.GDP.Per.Capita value

3.0.1.3 Social Support

coplot(Social.Support ~ Year|Country.Name,
       type = "b",
       data = balanced2,
       rows = 1,
       col = "red")

From line plot above, we could see that:

Afghanistan and Bangladesh are facing massive decreasing in Social Support
India log quite stagnant Social Support Value
Nepal and Sri Lanka share high Social Support Value
Pakistan Social Support value is growing

4 Modeling

4.1 Cross-Validation

This is the step before creating a model, the data will be splitted into train data and test data. The data has year information therefore, we will split it sequentially by year.

Train data will use previous year
Test data will be using the latest year (2022)

Using filter() function

#creating train data
ladder_train <- balanced2 %>% filter(Year != 2022) 
  
#creating test data
ladder_test <- balanced2 %>% filter(Year == 2022)

After that, we have to assure that the train data is balanced, we can do balancing if needed.

ladder_train <- ladder_train %>% 
  droplevels() %>%    # Cleaning 2022 time information
  make.pbalanced()    # doing rebalance

is.pbalanced(ladder_train)

## [1] TRUE

4.2 Multicollinearity Assumption Checking

On earlier step, we found that there are some multicollinearities happened between predictor variable, therefore, we will be doing multicollinearities assumption checking by creating a regression model first with lm() function and continue with vif() function.

if:

VIF Value > 10: The model has multicollinearity

VIF Value < 10: The model has no multicollinearity detected

We took the Country.Name and Year out as it is a categorical and factor variable

lm(Life.Ladder ~ .-Country.Name -Year, ladder_train) %>% vif()

##               Log.GDP.Per.Capita                   Social.Support 
##                         6.364906                         2.642561 
## Healthy.Life.Expectancy.At.Birth     Freedom.To.Make.Life.Choices 
##                         4.346578                         1.973927 
##        Perceptions.Of.Corruption                  Positive.Affect 
##                         1.483278                         4.696021 
##                  Negative.Affect 
##                         1.959480

The results is: Model has no Multicollinearity (VIF < 10)

4.3 Picking the Estimation Model

4.3.1 Model Creation

For each model creation, we will be using plm() function from plm package with parameters:

formula = Target ~ Prediktor
data = dataframe
index = c(“individual_column”,“time_column”)
model =
- "pooling" : for CEM model
- "within" : for FEM model
- "random" : for REM model

where

Variabel target : Life.Ladder
Variabel prediktor :
- Log.GDP.Per.Capita
- Social.Support
- Healthy.Life.Expectancy.At.Birth
- Freedom.To.Make.Life.Choices
- Perceptions.Of.Corruption
- Positive.Affect
- Negative.Affect

4.3.1.1 Common Effect Model (CEM)

Create Common Effect Model and saved to an object named cem

cem <- plm(
  Life.Ladder ~ Log.GDP.Per.Capita
  + Social.Support
  + Healthy.Life.Expectancy.At.Birth
  + Freedom.To.Make.Life.Choices
  + Perceptions.Of.Corruption
  + Positive.Affect
  + Negative.Affect,
  data = ladder_train,
  index = c("Country.Name", "Year"),
  model = "pooling"
)

4.3.1.2 Fixed Effect Model (FEM)

Create FEM model with additional parameter effect = "twoways", to adding individual and time effect, saved to an object named fem.two

fem <- plm(
  Life.Ladder ~ Log.GDP.Per.Capita
  + Social.Support
  + Healthy.Life.Expectancy.At.Birth
  + Freedom.To.Make.Life.Choices
  + Perceptions.Of.Corruption
  + Positive.Affect
  + Negative.Affect,
  data = ladder_train,
  index = c("Country.Name", "Year"),
  model = "within"
)

4.3.1.3 Chow Test

Chow Test is done to choose which model give the best result. To do this, we can use pooltest(model_cem, model_fem) function.

The hypothesis that will be tested are:

H0 : Common Effect Model
H1 : Fixed Effect Model

H0 will be rejected if P-value < α. The α value is 5%.

pooltest(cem,fem)

## 
##  F statistic
## 
## data:  Life.Ladder ~ Log.GDP.Per.Capita + Social.Support + Healthy.Life.Expectancy.At.Birth +  ...
## F = 24.717, df1 = 5, df2 = 89, p-value = 1.572e-15
## alternative hypothesis: unstability

From the test above, the p-value < α, therefore, the best model to be used in World Happiness data is Fixed Effect Model.

4.3.1.4 Random Effect Model (REM)

Creating Random Effect Model and saved to an object named rem

rem <- plm(
  Life.Ladder ~ Log.GDP.Per.Capita
  + Social.Support
  + Healthy.Life.Expectancy.At.Birth
  + Freedom.To.Make.Life.Choices,
  data = ladder_train,
  index = c("Country.Name", "Year"),
  model = "random"
)

4.3.1.5 Hausman Test

Use phtest(model_rem, model_fem) function to do the test with hypothesis:

H0 : Random Effect Model
H1 : Fixed Effect Model

Decision to reject H0 if p-value < α.

phtest(rem,fem)

## 
##  Hausman Test
## 
## data:  Life.Ladder ~ Log.GDP.Per.Capita + Social.Support + Healthy.Life.Expectancy.At.Birth +  ...
## chisq = 5.0966, df = 4, p-value = 0.2775
## alternative hypothesis: one model is inconsistent

Based on Hausman Test, p-value > α, therefore, we fail to reject the Null Hypothesis (H0), but on this moment, we will just continue with Fixed Effect Model and do assumption test.

5 Assumption Test

5.1 Normality

The Hypothesis are:

H0 : Residue has normally spreaded
H1 : Residue does not normally spreaded

H0 will be rejected if P-value < α. The α value is 5%.

fem$residuals %>% shapiro.test()

## 
##  Shapiro-Wilk normality test
## 
## data:  .
## W = 0.98697, p-value = 0.4209

Based on the results of the residual normality test, a p-value > 0.05 was obtained, meaning that the residuals were not normally distributed.

5.2 Homogenity

The Hypothesis are:

H0 : The remainder has a homogeneous variety
H1 : The remainder does not have a homogeneous variety

H0 will be rejected if P-value < α. The α value is 5%.

fem %>% bptest()

## 
##  studentized Breusch-Pagan test
## 
## data:  .
## BP = 11.694, df = 7, p-value = 0.1111

Based on the results of the homogeneity test, a p-value > 0.05 was obtained, meaning that the residuals had a variety that was not homogeneous.

5.3 Autocorrelation

The Hypothesis are:

H0 : There is no autocorrelation in the residuals
H1 : Autocorrelation occurs in the residuals

H0 will be rejected if P-value < α. The α value is 5%.

fem$residuals %>% Box.test(type = "Ljung-Box")

## 
##  Box-Ljung test
## 
## data:  .
## X-squared = 17.093, df = 1, p-value = 3.559e-05

Based on the results of the autocorrelation test, a p-value < 0.05 was obtained, meaning that there was an autocorrelation problem between the residuals.

6 Model Interpretation

6.1 Coefficient

summary(fem)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = Life.Ladder ~ Log.GDP.Per.Capita + Social.Support + 
##     Healthy.Life.Expectancy.At.Birth + Freedom.To.Make.Life.Choices + 
##     Perceptions.Of.Corruption + Positive.Affect + Negative.Affect, 
##     data = ladder_train, model = "within", index = c("Country.Name", 
##         "Year"))
## 
## Balanced Panel: n = 6, T = 17, N = 102
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -0.926489 -0.280156  0.033346  0.276894  1.252755 
## 
## Coefficients:
##                                   Estimate Std. Error t-value  Pr(>|t|)    
## Log.GDP.Per.Capita                1.432868   0.651477  2.1994 0.0304402 *  
## Social.Support                    1.999088   0.787535  2.5384 0.0128742 *  
## Healthy.Life.Expectancy.At.Birth -0.253399   0.066212 -3.8271 0.0002406 ***
## Freedom.To.Make.Life.Choices      0.235235   0.477405  0.4927 0.6234118    
## Perceptions.Of.Corruption        -0.699380   0.955635 -0.7318 0.4661843    
## Positive.Affect                   0.047065   0.950073  0.0495 0.9606017    
## Negative.Affect                  -1.365632   0.892104 -1.5308 0.1293656    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    23.896
## Residual Sum of Squares: 14.977
## R-Squared:      0.37326
## Adj. R-Squared: 0.28876
## F-statistic: 7.57217 on 7 and 89 DF, p-value: 3.8176e-07

Interpretation:

\[life.ladder = 1.432868 * Log.GDP.Per.Capita + 1.999088 * Social.Support - 0.253399 * Healthy.Life.Expectancy.At.Birth + 0.235235 * Freedom.To.Make.Life.Choices - 0.699380 * Perceptions.Of.Corruption + 0.047065 * Positive.Affect - 1.365632 * Negative.Affect + uit\]

The significant variable that affect the level of happiness for citizen are:
- Log.GDP.Per.Capita
- Social.Support
- Negative.Affect
The level of people’s happiness in a country will increase by 1.432868 for every 1 unit increase in Log.GDP.Per.Capita, provided that other variables have a fixed value
The level of people’s happiness in a country will increase by 1.999088 for every 1 unit increase in Social.Support, provided that other variables have a fixed value
The level of people’s happiness in a country will decrease by 1.365632 for every 1 unit increase in Negative.Affect, provided that other variables have a fixed value

6.2 Fixed Effect Model Information Extraction

By using fixef(model fem) function

fixef(fem)

## Afghanistan  Bangladesh       India       Nepal    Pakistan   Sri Lanka 
##      5.8346      8.1019      6.6609      7.7865      6.8733      6.5465

Interpretation:

The level of people’s happiness in the country of Afghanistan is 5.8346 if there is no other information
The level of people’s happiness in the country of Bangladesh is 8.1019 if there is no other information
The level of people’s happiness in the country of India is 6.6609 if there is no other information
The level of people’s happiness in the country of Nepal is 7.7865 if there is no other information
The level of people’s happiness in the country of Pakistan is 6.8733 if there is no other information
The level of people’s happiness in the country of Sri Lanka is 6.5465 if there is no other information

6.2.1 Prediction and Evaluation

Using predict() function to make a prediction with parameters: - object = name of the model - newdata = new data to be predicted

pred <- predict(fem, ladder_test, na.fill = F)

We can use MAPE error metric to evaluate if our new model is good or not, with MAPE() function and parameters:

y_pred = prediction result value
y_true = real target value

MAPE(y_pred = pred,
     y_true = ladder_test$Life.Ladder)

## [1] 0.3730909

Insight: The goodness of FEM model to predict is just about 63% (1 - 0.373), we can do some adjustment in the process to gain better result.

7 Conclusion

From several analysis steps that we have done, the conclusion are:

The highest Level of Happiness in South Asia is Pakistan
The lowest level of happiness in South Asia is Afghanistan
variables that significantly influence the level of happiness of people in a country are Log.GDP.Per.Capita, Social.Support and Positive.Affect.
From the final model, we know that the individual index has an influence on people’s level of happiness. which means that each country has different characteristics regarding the level of happiness of its people.

Unveiling Socio Demographic Patterns: Exploring Data Panels in R

Part of Learn By Building by Algoritma Data Science School

Ronny G.

2 March 2024

1 Prelude

1.1 Data Preparation

1.2 Checking for the Balance of the Data

1.2.1 Data Structure Adjustment

1.2.2 Balancing Data

1.3 Checking of Missing Value

1.4 Fill Missing Value(On each Country)

1.4.1 Afghanistan

1.4.2 Bangladesh

1.4.3 India

1.4.4 Nepal

1.4.5 Pakistan

1.4.6 Sri Langka

2 Exploratory Data Analysis

3 Correlation between Variable

3.0.1 Socio Demography Exploration

3.0.1.1 Life Ladder

3.0.1.2 Log GDP Per Capita

3.0.1.3 Social Support

4 Modeling

4.1 Cross-Validation

4.2 Multicollinearity Assumption Checking

4.3 Picking the Estimation Model

4.3.1 Model Creation

4.3.1.1 Common Effect Model (CEM)

4.3.1.2 Fixed Effect Model (FEM)

4.3.1.3 Chow Test

4.3.1.4 Random Effect Model (REM)

4.3.1.5 Hausman Test

5 Assumption Test

5.1 Normality

5.2 Homogenity

5.3 Autocorrelation

6 Model Interpretation

6.1 Coefficient

6.2 Fixed Effect Model Information Extraction

6.2.1 Prediction and Evaluation

7 Conclusion