Are you happy? How are other people? And were they happier in the past? These questions are very difficult to answer due to the uncertainty of the definition of “happiness”. However, “The World Happiness Report” generates an annual report regarding global happiness and life satisfaction. The score is based on the pooled results from the Gallup World Poll, which surveys more than 160 countries. The survey itself asks: “On which step ladder do you imagine your life is?” The highest and lowest possible levels of satisfaction are 10 and 0, respectively. This life satisfaction scale is called the “Cantril Ladder” method. The score incorporates several factors: economic production, social support, life expectancy, freedom, absence of corruption, and generosity. These factors do not affect the total score assigned to each country, but they do describe why some countries rank higher than others.
In this project, we will analyze the happiness levels across various
countries in the North America and ANZ region by leveraging
socio-demographic data available in the World Happiness Report. Our
analysis will be conducted using the plm() function in R,
which allows for panel data analysis. Through this analysis, we aim to
gain insights into the socio-economic dynamics and their impact on the
well-being of individuals in North America and ANZ.
Project Expectation
This project, as part of Learn by Building (LBB), focuses on applying
knowledge related to exploring panel data using the plm()
function in R. The sequence of the process include:
- Library Preparation
- Importing Data
- Data Preparation
- Exploratory Data Analysis
- Summary of Data
- Modeling
- Conclusion and Recommendations
Libraries Preparation
Below is the list of libraries used in this project.
#Packages for processing the dataframe
library(dplyr)
library(tidyr)
library(lubridate)
#Packages for visualization
library(ggcorrplot)
library(gplots)
library(ggplot2)
library(plotly)
library(foreign)
#Packages for analyzing
library(plm)
library(lfe)
library(lmtest)
library(car)
library(tseries)
library(MLmetrics)
library(inspectdf)
Importing Data
These datasets were created by “The World Happiness Report”. However, the datasets is collected in csv format from the Algoritma Database
First, read raw dataset from the working directory and assign them as
variables called data. Then, observe the data by applying
glimpse().
# Read the dataframe
data <- read.csv("data_input/World_Happiness_Report.csv")
# Check the dataframe
glimpse(data)
## Rows: 2,199
## Columns: 13
## $ Country.Name <chr> "Afghanistan", "Afghanistan", "Afgha…
## $ Regional.Indicator <chr> "South Asia", "South Asia", "South A…
## $ Year <int> 2008, 2009, 2010, 2011, 2012, 2013, …
## $ Life.Ladder <dbl> 3.723590, 4.401778, 4.758381, 3.8317…
## $ Log.GDP.Per.Capita <dbl> 7.350416, 7.508646, 7.613900, 7.5812…
## $ Social.Support <dbl> 0.4506623, 0.5523084, 0.5390752, 0.5…
## $ Healthy.Life.Expectancy.At.Birth <dbl> 50.500, 50.800, 51.100, 51.400, 51.7…
## $ Freedom.To.Make.Life.Choices <dbl> 0.7181143, 0.6788964, 0.6001272, 0.4…
## $ Generosity <dbl> 0.167652458, 0.190808803, 0.12131604…
## $ Perceptions.Of.Corruption <dbl> 0.8816863, 0.8500354, 0.7067661, 0.7…
## $ Positive.Affect <dbl> 0.4142970, 0.4814214, 0.5169067, 0.4…
## $ Negative.Affect <dbl> 0.2581955, 0.2370924, 0.2753238, 0.2…
## $ Confidence.In.National.Government <dbl> 0.6120721, 0.6115452, 0.2993574, 0.3…
We found that the data has 2199 rows and 13 columns. The
variables or factors in the dataset include:
- Life.Ladder: An indicator of life satisfaction, ranging from 0 to 10.
- Log.GDP.Per.Capita: A measure representing a country’s standard of living, describing how much citizens benefit from their country’s economy.
- Healthy.Life.Expectancy.At.Birth: The average number of years of healthy life for the population.
- Social.Support: The national average of binary responses (either 0 or 1) to the question of whether a person has someone to count on in times of trouble.
- Freedom.To.Make.Life.Choices: The national average of responses to the question of whether individuals are happy with the freedom they have to choose what they want to do in their life.
- Generosity: The residual of regressing the national average response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
- Perceptions.Of.Corruption: The national average of survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” The overall perception is the average of the two 0-or-1 responses.
- Positive.Affect : Average size of the effect the day before for laughter, pleasure, and interest.
- Negative.Affect : Average size of the effect the day before for worries, sadness, and anger.
- Confidence.In.National.Goverment : How much trust in the government.
For detailed information, please refer to this link.”
Data Preparation
Select Region
The objective for this section entails specific data manipulation tasks.
- Observe the Distinct Values in Regional.Indicator:
- Identify the unique categories within the
Regional.Indicatorvariable to understand the diversity of regions in the dataset.
- Identify the unique categories within the
- Filter the Database for Only North America and ANZ
Region:
- Extract data related to countries categorized under the North America and ANZ region to focus the analysis specifically on this geographical area.
- Drop Column Regional.Indicator:
- Remove the
Regional Indicatorcolumn from the dataset as it’s no longer needed after filtering for North America and ANZ.
- Remove the
# Check the distinct value for the Regional.Indicator
unique(data$Regional.Indicator)
## [1] "South Asia" "Central and Eastern Europe"
## [3] "Middle East and North Africa" ""
## [5] "Latin America and Caribbean" "Commonwealth of Independent States"
## [7] "North America and ANZ" "Western Europe"
## [9] "Sub-Saharan Africa" "Southeast Asia"
## [11] "East Asia"
As seen above, there are 11 distinct values in the
Regional.Indicator column. Since we want to focus
solely on the North America and ANZ region, we need to filter the
dataframe using the filter() function.
# Filter the dataset only for North America and ANZ region and drop the Regional.Indicator column
data <- data %>%
filter(Regional.Indicator == "North America and ANZ") %>%
select(-c(Regional.Indicator))
# Check the dataframe
glimpse(data)
## Rows: 66
## Columns: 12
## $ Country.Name <chr> "Australia", "Australia", "Australia…
## $ Year <int> 2005, 2007, 2008, 2010, 2011, 2012, …
## $ Life.Ladder <dbl> 7.340688, 7.285391, 7.253757, 7.4500…
## $ Log.GDP.Per.Capita <dbl> 10.66206, 10.69443, 10.70946, 10.713…
## $ Social.Support <dbl> 0.9678922, 0.9652762, 0.9466352, 0.9…
## $ Healthy.Life.Expectancy.At.Birth <dbl> 69.800, 69.960, 70.040, 70.200, 70.2…
## $ Freedom.To.Make.Life.Choices <dbl> 0.9349733, 0.8906820, 0.9157333, 0.9…
## $ Generosity <dbl> NA, 0.3434340, 0.3017218, 0.3131209,…
## $ Perceptions.Of.Corruption <dbl> 0.3904159, 0.5125785, 0.4308105, 0.3…
## $ Positive.Affect <dbl> 0.7697703, 0.7623039, 0.7289923, 0.7…
## $ Negative.Affect <dbl> 0.2380121, 0.2153506, 0.2184273, 0.2…
## $ Confidence.In.National.Government <dbl> 0.5320633, 0.5269328, 0.6446754, 0.6…
Insight:
- Our dataset consists of 66 rows and 12 columns.
- There are 4 countries in the North America and ANZ region.
Check Data Balance
Now, we will use two methods to check the data balance.
1. Observe data frequency by indexing each individual using
the table() function.
table(data$Country.Name)
##
## Australia Canada New Zealand United States
## 16 17 16 17
2. Using p_balanced() function
Since our dataframe is not yet converted to a
pdata.frame, we will add the parameter
index("individual index", "time index"). We expect to have
a value of TRUE if the dataframe is balanced.
is.pbalanced(data , index = c("Country.Name", "Year"))
## [1] FALSE
Insights:
- Data Frame Balance:
- The data frame is not balanced, suggesting that some countries may have more data points or observations compared to others.
- Information Availability among North America and ANZ
Countries:
- Canada and the United States have the highest amount of available information among the North America and ANZ countries.
- Conversely, New Zealand and Australia have the lowest amount of available information compared to the other countries.
- Based on this result, we will retain all the countries.
Data Structure Adjustment
1. Create a panel dataframe
Before balancing, we need to create a pdata.frame
using pdata.frame() with the following parameters:
data: The dataset to be used.index:c("individual index", "time index")
#Filter dataframe and hange datatype of Country.Name and Year as factor
data <- data %>%
filter(Country.Name %in% c( "Australia", "Canada", "New Zealand", "United States" )) %>%
mutate(Country.Name = as.factor(Country.Name)) %>%
mutate(Year= as.factor(Year))
# Change to pdata.frame
data <- data %>%
pdata.frame(index = c("Country.Name","Year"))
#Check structure of data
glimpse(data)
## Rows: 66
## Columns: 12
## $ Country.Name <fct> Australia, Australia, Australia, Aus…
## $ Year <fct> 2005, 2007, 2008, 2010, 2011, 2012, …
## $ Life.Ladder <pseries> 7.340688, 7.285391, 7.253757, 7.…
## $ Log.GDP.Per.Capita <pseries> 10.66206, 10.69443, 10.70946, 10…
## $ Social.Support <pseries> 0.9678922, 0.9652762, 0.9466352,…
## $ Healthy.Life.Expectancy.At.Birth <pseries> 69.800, 69.960, 70.040, 70.200, …
## $ Freedom.To.Make.Life.Choices <pseries> 0.9349733, 0.8906820, 0.9157333,…
## $ Generosity <pseries> NA, 0.3434340, 0.3017218, 0.3131…
## $ Perceptions.Of.Corruption <pseries> 0.3904159, 0.5125785, 0.4308105,…
## $ Positive.Affect <pseries> 0.7697703, 0.7623039, 0.7289923,…
## $ Negative.Affect <pseries> 0.2380121, 0.2153506, 0.2184273,…
## $ Confidence.In.National.Government <pseries> 0.5320633, 0.5269328, 0.6446754,…
Insight :
Based on the provided information, it seems that the data types of
most columns have been automatically changed to pseries,
except for the index columns Country.Name and
Year, which changed to factor.
2. Check Data Dimension
We can use pdim() to check our pdata.frame
dimension
pdim(data)
## Unbalanced Panel: n = 4, T = 16-17, N = 66
Insight:
- The number of data points is currently not balanced.
- There are data from 4 countries.
- The time index ranges from 16 to 17.
- The total number of observations is 66.
Balancing Data
Now, we will balance the data using make.pbalanced with
the parameter balance.type.
1. fill: Any missing values in the time columns
will be filled with NA.
balance1 <- data %>%
make.pbalanced(balance.type = "fill")
table(balance1$Country.Name)
##
## Australia Canada New Zealand United States
## 18 18 18 18
unique(balance1$Year)
## [1] 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
## [16] 2020 2021 2022
## 18 Levels: 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 ... 2022
2. shared.times: This option will select all
individuals with the condition that time information is available for
all individuals.
balance2 <- data %>%
make.pbalanced(balance.type = "shared.times")
table(balance2$Country.Name)
##
## Australia Canada New Zealand United States
## 15 15 15 15
unique(balance2$Year)
## [1] 2007 2008 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022
## 18 Levels: 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 ... 2022
3.shared.individuals: This option will select
individuals with complete time information.
balance3 <- data %>%
make.pbalanced(balance.type = "shared.individuals")
table(balance3$Country.Name)
##
## Australia Canada New Zealand United States
## 0 0 0 0
unique(balance3$Year)
## factor()
## 18 Levels: 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 ... 2022
Insight :
Based on the result, we will use the fill option since
it offers the possibility to achieve the most optimal balancing. We will
save the balanced data in an object called balance1.
# Check the balance of the data
is.pbalanced(balance1)
## [1] TRUE
# Check the dimension of the data
pdim(balance1)
## Balanced Panel: n = 4, T = 18, N = 72
Insight :
The balance1 dataset consists of 4 countries, with a
total of 72 observations, and the time index information includes 18
time points.
Check Missing Value
colSums(is.na(balance1)) - colSums(is.na(data))
## Country.Name Year
## 0 0
## Life.Ladder Log.GDP.Per.Capita
## 6 6
## Social.Support Healthy.Life.Expectancy.At.Birth
## 6 6
## Freedom.To.Make.Life.Choices Generosity
## 6 6
## Perceptions.Of.Corruption Positive.Affect
## 6 6
## Negative.Affect Confidence.In.National.Government
## 6 6
Insight :
Based on the result above, after conducting the balancing process, 6 rows with NA values were added in each column. The next step involves checking the completeness of the balancing result to ensure that the data is properly aligned and no important information has been lost during the balancing process.
colSums(is.na(balance1))
## Country.Name Year
## 0 0
## Life.Ladder Log.GDP.Per.Capita
## 6 6
## Social.Support Healthy.Life.Expectancy.At.Birth
## 8 6
## Freedom.To.Make.Life.Choices Generosity
## 6 8
## Perceptions.Of.Corruption Positive.Affect
## 6 6
## Negative.Affect Confidence.In.National.Government
## 6 10
Insight:
Based on the provided information, the column Confident.In.National.Government has a high number of missing values compared to other columns, approximately 1/7 of the total entries. Therefore, it is suggested to drop this column as it may not provide meaningful information due to the large proportion of missing data.
# Drop the column
balance1 <- balance1 %>%
select(-c(Confidence.In.National.Government))
Handling missing values
To address missing values in other columns, filling them up will be done using interpolation for each country. Interpolation can help in estimating missing values based on existing data points, which can improve the completeness of the dataset while preserving the overall trend or pattern within each country’s data.
1. Australia
# Check the number of missing values for each country
aus <- balance1 %>%
filter(Country.Name == "Australia")
colSums(is.na(aus))
## Country.Name Year
## 0 0
## Life.Ladder Log.GDP.Per.Capita
## 2 2
## Social.Support Healthy.Life.Expectancy.At.Birth
## 2 2
## Freedom.To.Make.Life.Choices Generosity
## 2 3
## Perceptions.Of.Corruption Positive.Affect
## 2 2
## Negative.Affect
## 2
# Fill the missing values
aus <- aus %>% mutate(
Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
Social.Support = na.fill(Social.Support, fill = "extend"),
Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
Generosity = na.fill(Generosity, fill = "extend"),
Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
anyNA(aus)
## [1] FALSE
2. Canada
# Check the number of missing values for each country
can <- balance1 %>%
filter(Country.Name == "Canada")
colSums(is.na(can))
## Country.Name Year
## 0 0
## Life.Ladder Log.GDP.Per.Capita
## 1 1
## Social.Support Healthy.Life.Expectancy.At.Birth
## 2 1
## Freedom.To.Make.Life.Choices Generosity
## 1 1
## Perceptions.Of.Corruption Positive.Affect
## 1 1
## Negative.Affect
## 1
# Fill the missing values
can <- can %>% mutate(
Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
Social.Support = na.fill(Social.Support, fill = "extend"),
Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
Generosity = na.fill(Generosity, fill = "extend"),
Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
anyNA(can)
## [1] FALSE
3. New Zealand
# Check the number of missing values for each country
new <- balance1 %>%
filter(Country.Name == "New Zealand")
colSums(is.na(new))
## Country.Name Year
## 0 0
## Life.Ladder Log.GDP.Per.Capita
## 2 2
## Social.Support Healthy.Life.Expectancy.At.Birth
## 2 2
## Freedom.To.Make.Life.Choices Generosity
## 2 2
## Perceptions.Of.Corruption Positive.Affect
## 2 2
## Negative.Affect
## 2
# Fill the missing values
new <- new %>% mutate(
Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
Social.Support = na.fill(Social.Support, fill = "extend"),
Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
Generosity = na.fill(Generosity, fill = "extend"),
Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
anyNA(new)
## [1] FALSE
4. United States
# Check the number of missing values for each country
uni <- balance1 %>%
filter(Country.Name == "United States")
colSums(is.na(uni))
## Country.Name Year
## 0 0
## Life.Ladder Log.GDP.Per.Capita
## 1 1
## Social.Support Healthy.Life.Expectancy.At.Birth
## 2 1
## Freedom.To.Make.Life.Choices Generosity
## 1 2
## Perceptions.Of.Corruption Positive.Affect
## 1 1
## Negative.Affect
## 1
# Fill the missing values
uni <- uni %>% mutate(
Life.Ladder = na.fill(Life.Ladder, fill = "extend"),
Log.GDP.Per.Capita = na.fill(Log.GDP.Per.Capita, fill = "extend"),
Social.Support = na.fill(Social.Support, fill = "extend"),
Healthy.Life.Expectancy.At.Birth = na.fill(Healthy.Life.Expectancy.At.Birth, fill = "extend"),
Freedom.To.Make.Life.Choices = na.fill(Freedom.To.Make.Life.Choices, fill = "extend"),
Perceptions.Of.Corruption = na.fill(Perceptions.Of.Corruption, fill = "extend"),
Positive.Affect = na.fill(Positive.Affect, fill = "extend"),
Generosity = na.fill(Generosity, fill = "extend"),
Negative.Affect = na.fill(Negative.Affect, fill = "extend"))
anyNA(uni)
## [1] FALSE
After filling in all the missing values in each country, we will now
merge the data and save it in the object called “balanced2” using the
bind_rows function.
Merge the Data
# Merge data
balanced2 <- bind_rows(aus, can, new, uni)
# Check structure data
glimpse(balanced2)
## Rows: 72
## Columns: 11
## $ Country.Name <fct> Australia, Australia, Australia, Aust…
## $ Year <fct> 2005, 2006, 2007, 2008, 2009, 2010, 2…
## $ Life.Ladder <pseries> 7.340688, 7.313040, 7.285391, 7.2…
## $ Log.GDP.Per.Capita <pseries> 10.66206, 10.67825, 10.69443, 10.…
## $ Social.Support <pseries> 0.9678922, 0.9665842, 0.9652762, …
## $ Healthy.Life.Expectancy.At.Birth <pseries> 69.800, 69.880, 69.960, 70.040, 7…
## $ Freedom.To.Make.Life.Choices <pseries> 0.9349733, 0.9128276, 0.8906820, …
## $ Generosity <pseries> 0.3434340, 0.3434340, 0.3434340, …
## $ Perceptions.Of.Corruption <pseries> 0.3904159, 0.4514972, 0.5125785, …
## $ Positive.Affect <pseries> 0.7697703, 0.7660371, 0.7623039, …
## $ Negative.Affect <pseries> 0.2380121, 0.2266814, 0.2153506, …
Now, we are ready to move the next step which is exploratory data analysis
Exploratory Data Analysis
Summary Data
summary(balanced2)
## Country.Name Year Life.Ladder Log.GDP.Per.Capita
## Australia :18 2005 : 4 Min. :6.693 Min. :10.53
## Canada :18 2006 : 4 1st Qu.:7.137 1st Qu.:10.68
## New Zealand :18 2007 : 4 Median :7.252 Median :10.75
## United States:18 2008 : 4 Mean :7.236 Mean :10.77
## 2009 : 4 3rd Qu.:7.366 3rd Qu.:10.86
## 2010 : 4 Max. :7.650 Max. :11.08
## (Other):48
## Social.Support Healthy.Life.Expectancy.At.Birth Freedom.To.Make.Life.Choices
## Min. :0.8968 Min. :65.72 Min. :0.7356
## 1st Qu.:0.9254 1st Qu.:68.98 1st Qu.:0.8764
## Median :0.9424 Median :70.12 Median :0.9139
## Mean :0.9399 Mean :69.47 Mean :0.8989
## 3rd Qu.:0.9538 3rd Qu.:70.76 3rd Qu.:0.9321
## Max. :0.9873 Max. :71.45 Max. :0.9573
##
## Generosity Perceptions.Of.Corruption Positive.Affect Negative.Affect
## Min. :0.02932 Min. :0.1859 Min. :0.7060 Min. :0.1514
## 1st Qu.:0.19095 1st Qu.:0.3509 1st Qu.:0.7398 1st Qu.:0.2102
## Median :0.24089 Median :0.4199 Median :0.7692 Median :0.2333
## Mean :0.23037 Mean :0.4491 Mean :0.7654 Mean :0.2345
## 3rd Qu.:0.28326 3rd Qu.:0.5590 3rd Qu.:0.7852 3rd Qu.:0.2604
## Max. :0.36576 Max. :0.7469 Max. :0.8247 Max. :0.3067
##
inspect_num(balanced2) %>%
show_plot()
Insights:
- The highest level of happiness observed in several countries in North America and ANZ is 7.650.
- The lowest level of happiness observed in several countries in North America and ANZ is 6.693.
- Variable
Freedom.To.Make.Life.Choiceshas left-skew because most of the values are distributed on higher end of the scale.
Relationship between variable
To explore the relationships between predictor variables and the
target variable (happiness), we can utilize the ggcorrplot
function. This function allows us to visualize correlations between
variables, helping us understand how different factors may be related to
happiness levels in North America and ANZ countries.
balanced2 %>%
select (-Country.Name, -Year) %>%
cor() %>%
ggcorrplot(type = "lower", lab = TRUE)
Insight :
There is strong positive relationship between
Life.LadderandSocial.Support,Freedom.To.Make.Life.Choices,GenerosityandPositive.AffectThere is strong negative relationship between
Life.LadderandLog.GDP.Per.Capita.There are indications of multicollinearity between:
Log.GDP.Per.CapitaandPerception.Of.Curroption
Multicollinearity suggests that these variables may be highly correlated with each other, which can affect the reliability of regression analysis results.
Exploration of Socio-demographic Factors
To observe information from our data, we will use the
coplot() function with the following parameters:
formula: The formula specifying the target variable in relation to the two indices.type: Set to"l"for line plots and"b"for point and line plots.data: The dataset to be used for analysis.rows: The number of rows in the panel plot.col: The color of the plot.
1. Life.ladder
coplot(Life.Ladder ~ Year|Country.Name,
type = "b",
data= balanced2,
rows = 1,
col = "blue")
Insight :
- The Life Ladder across all countries has shown a declining trend in recent years, with a significant drop occurring from 2020 to 2022 in most countries.
- This declining trend is particularly notable in Canada and the United States.
- Specifically, the Life Ladder trend in the United States reached its lowest level in 2022. Conversely, the highest level of Life Ladder was observed in Canada in 2010.
2. Log.GDP.Per.Capita
# Your Code Here
coplot(Log.GDP.Per.Capita ~ Year|Country.Name,
type = "b",
data= balanced2,
rows = 1,
col = "blue")
Insight:
- Overall, the GDP trend over the past years shows an increase among the countries. It is observed that in 2020, the GDP in all countries dropped but continued to increase in the following years.
- The GDP in the United States reached its highest point in 2022 and has consistently been above the average GDP.
- New Zealand has the lowest GDP compared to other countries, while Australia and Canada exhibit relatively similar patterns of GDP trends.
3. Perceptions.Of.Corruption
# Perceptions.Of.Corruption
coplot(Perceptions.Of.Corruption ~ Year|Country.Name,
type = "b",
data= balanced2,
rows = 1,
col = "blue")
Insight:
- The Perception of Corruption in the United States has consistently been high over the past years, and the trend appears stagnant.
- Australia and Canada exhibit relatively similar trends, fluctuating around the average point.
- New Zealand has the lowest Perception of Corruption among the other countries.
4. Freedom.To.Make.Life.Choices
# Freedom.To.Make.Life.Choices
coplot(Freedom.To.Make.Life.Choices ~ Year|Country.Name,
type = "b",
data= balanced2,
rows = 1,
col = "blue")
Insight:
- Overall, there was a significant drop in the Freedom To Make Life Choices in most countries in 2021, with the highest drop observed in the United States.
- The declining trend of Freedom To Make Life Choices in the United States is notably significant compared to other countries.
- The level of Freedom To Make Life Choices is relatively similar and quite high in Australia, Canada, and New Zealand, contrasting with the situation in the United States.
Heterogeneity in Life Ladder
To observe heterogeneity among individuals and over time, we can use
the plotmeans() function from the gplots
package with the following parameters:
- formula: Target ~ individual/time index variable
- data: data frame
1. Heterogeneity among countries
plotmeans( Life.Ladder ~ Country.Name, data = balanced2, main="Heterogeneity among countries")
2. Heterogenitas antar Waktu
plotmeans(Life.Ladder ~ Year, data = balanced2, main="Heterogeneity among years")
Insight :
Based on the visual results above, it appears that the data among countries is quite heterogeneous.
Modeling
Cross-Validation
The data will be divided into training and testing datasets. Since panel data contains time information, the data splitting should not be random but sequential.
- The training data will use older data.
- The testing data will use newer data.
To achieve this, we can use the filter() function.
# Create data train
ladder_train <- balanced2 %>% filter(Year != 2022)
# Create data test
ladder_test <- balanced2 %>% filter(Year == 2022)
After performing cross-validation, we need to ensure that the training data is balanced by conducting balancing. In this step, we will drop the time information that was used for the testing dataset, and then we will conduct balancing.
# Drop the time information & balancing
ladder_train <- ladder_train %>%
droplevels() %>%
make.pbalanced()
# check the data train
is.pbalanced(ladder_train)
## [1] TRUE
Multicollinearity Assumption Checking
Due to the correlation analysis results in the previous EDA stage
indicating multicollinearity among predictor variables, we will first
check the assumption of multicollinearity by creating a regression model
using the lm() function and then testing using the
vif() function.
- VIF value > 10: indicates multicollinearity in the model
- VIF value < 10: indicates no multicollinearity in the model
# your code here
lm(Life.Ladder ~.-Country.Name -Year, ladder_train) %>%
vif()
## Log.GDP.Per.Capita Social.Support
## 8.680726 2.282650
## Healthy.Life.Expectancy.At.Birth Freedom.To.Make.Life.Choices
## 4.795869 4.766495
## Generosity Perceptions.Of.Corruption
## 1.618213 10.570769
## Positive.Affect Negative.Affect
## 1.645778 2.242920
Insight:
- The table above indicates that the
Perceptions.Of.Corruptionvariable has a VIF value greater than 10. Therefore, we will not use this column in our analysis. - In the correlation chart above, it is indicated that there is
evidence of multicollinearity between
Log.GDP.Per.CapitaandPerception.Of.Corruption. However, since thePerception.Of.Corruptionvariable is dropped, we can useLog.GDP.Per.Capita.
Model Estimation
Model Creation
For each model creation, we will use the plm() function
from the plm package with the following parameters:
formula: Target ~ Predictorsdata: dataframeindex: c(“individual column”,“time column”)model:"pooling": for Pooled OLS model"within": for Fixed Effects model"random": for Random Effects model
where
- Target variable: Life Ladder
- Predictor variables (which has the strong positive/negative
relationship with Life.Ladder as mentioned in correlation chart above)
- Log.GDP.Per.Capita
- Social.Support
- Freedom.To.Make.Life.Choices
- Positive.Affect
- Generosity
Combined Effects Model (CEM)
Creating a Common Effects Model and storing it in the
cem object.
cem <- plm(Life.Ladder ~ Log.GDP.Per.Capita
+ Social.Support
+ Freedom.To.Make.Life.Choices
+ Positive.Affect
+ Generosity,
data = ladder_train,
index = c("Country.Name", "Year"),
model = "pooling")
Fixed Effect Model (FEM)
fem <- plm(Life.Ladder ~ Log.GDP.Per.Capita
+ Social.Support
+ Freedom.To.Make.Life.Choices
+ Positive.Affect
+ Generosity,
data = ladder_train,
index = c("Country.Name","Year"),
model = "within")
Chow Test
The Chow test is conducted to select the best model between the
Combined Effects Model (CEM) and the Fixed Effects Model (FEM). To
perform the Chow test, we can use the pooltest()
function.
The hypotheses:
- H0: Combined Effects Model
- H1: Fixed Effects Model
H0 is rejected if P-value < α. The significance level α used is 5%.
pooltest(cem,fem)
##
## F statistic
##
## data: Life.Ladder ~ Log.GDP.Per.Capita + Social.Support + Freedom.To.Make.Life.Choices + ...
## F = 4.1007, df1 = 3, df2 = 59, p-value = 0.01037
## alternative hypothesis: unstability
Insight :
Based on the results of the Chow test above, we obtained a p-value < α. This means that the best model to use is the Fixed Effects Model.
Random Effects Model (REM)
Creating a Random Effects Model and storing it in the rem object. However, since the number of predictors exceeds the number of individual variables, we cannot use the random effect model effectively. Thus, we will continue to perform the assumption testing.
#rem <- plm(Life.Ladder ~ Log.GDP.Per.Capita
# + Social.Support
# + Freedom.To.Make.Life.Choices
# + Positive.Affect,
# data = ladder_train,
# index = c("Country.Name","Year"),
# model = "random")
Assumption Testing
Normality
The hypotheses tested are as follows:
- H0: Residuals are normally distributed
- H1: Residuals are not normally distributed
H0 is rejected if P-value < α. The significance level α used is 5%.
# your code here
fem$residuals %>% shapiro.test()
##
## Shapiro-Wilk normality test
##
## data: .
## W = 0.98468, p-value = 0.571
Insight :
Based on the normality assumption test results, we obtained a p-value > 0.05, indicating that the residuals are normally distributed.
Homogeneity
The hypotheses tested are as follows:
- H0: Residuals have homoscedasticity
- H1: Residuals do not have homoscedasticity
H0 is rejected if P-value < α. The significance level α used is 5%.
fem %>% bptest()
##
## studentized Breusch-Pagan test
##
## data: .
## BP = 11.098, df = 5, p-value = 0.04948
Insight :
Based on the homogeneity assumption test results, we obtained a p-value < 0.05. However, the p-value is relatively close to 0.05.
Autocorrelation
The hypotheses tested are as follows:
- H0: There is no autocorrelation in the residuals
- H1: There is autocorrelation in the residuals
H0 is rejected if P-value < α. The significance level α used is 5%.
fem$residuals %>% Box.test(type = "Ljung-Box")
##
## Box-Ljung test
##
## data: .
## X-squared = 7.234, df = 1, p-value = 0.007153
Insight :
Based on the autocorrelation assumption test results, we obtained a p-value < 0.05, indicating that there is autocorrelation among the residuals.
Model Interpretation
Coefficients
summary(fem)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = Life.Ladder ~ Log.GDP.Per.Capita + Social.Support +
## Freedom.To.Make.Life.Choices + Positive.Affect + Generosity,
## data = ladder_train, model = "within", index = c("Country.Name",
## "Year"))
##
## Balanced Panel: n = 4, T = 17, N = 68
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.2956204 -0.0620498 -0.0016029 0.0638440 0.2907850
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## Log.GDP.Per.Capita -0.15517 0.38547 -0.4026 0.6887296
## Social.Support 3.71697 1.00885 3.6844 0.0004998 ***
## Freedom.To.Make.Life.Choices 0.38583 0.63099 0.6115 0.5432400
## Positive.Affect 2.39006 0.93210 2.5642 0.0129081 *
## Generosity 0.42221 0.26425 1.5978 0.1154357
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 1.4321
## Residual Sum of Squares: 0.66616
## R-Squared: 0.53484
## Adj. R-Squared: 0.47177
## F-statistic: 13.5675 on 5 and 59 DF, p-value: 8.1878e-09
\[life.ladder = -0.15517 * Log.GDP.Per.Capita + 3.71697 * Social.Support + 0.38583 *Freedom.To.Make.Life.Choices + 2.39006 * Positive.Affect + 0.42221*Generosity\]
- The variables that significantly influence the happiness level in
the North America and ANZ region are
Social.SupportandPositive.Affect. - Among these variables, the
Social.Supporthas the most significant impact. - The happiness level of a country in North America and ANZ will
decrease by -0.15517 units for each one-unit increase in
Log.GDP.Per.Capita, holding all other variables constant. - The happiness level of a country in North America and ANZ will
increase by 3.71697 units for each one-unit increase in
Social.Support, holding all other variables constant. - The happiness level of a country in North America and ANZ will
increase by 0.38583 units for each one-unit increase in
Freedom.To.Make.Life.Choices, holding all other variables constant. - The happiness level of a country in North America and ANZ will
increase by 2.39006 units for each one-unit increase in
Positive.Affect, holding all other variables constant. - The happiness level of a country in North America and ANZ will
increase by 0.42221 units for each one-unit increase in
Generosity, holding all other variables constant.
Prediction & Evaluation
To make predictions, we will use the predict() function
with the following parameters:
- object: the name of the model we are using
- newdata: the new data we want to predict
# your code here
pred <- predict(fem, ladder_test, na.fill=F)
To test whether the model we have is good at predicting new data, we
will evaluate it using error values, and one commonly used error metric
is MAPE (Mean Absolute Percentage Error). We can do this using the
MAPE() function with the following parameters:
- y_pred: predicted values
- y_true: actual target values
# your code here
MAPE(y_pred = pred,
y_true =ladder_test$Life.Ladder)
## [1] 0.01806783
Insight:
The prediction error rate of the REM model in predicting new values is 1.8%, indicating that the model can be used to predict new data accurately.
Conclusion and Suggestions
Conclusion:
- The highest level of happiness in the North America and ANZ region was observed in Canada in 2010, while the lowest was recorded in the United States in 2022.
- Significant variables influencing happiness in the region include Social Support and Positive Affect, with Social Support exerting a greater influence.
- Despite a violation of the homogeneity assumption in the testing, we accept the model due to the p-value’s proximity to the significance level. However, dataset transformation could enhance the model’s performance.
- The final model indicates that individual indices influence happiness, suggesting that each country has different characteristics regarding its population’s happiness levels.
- There appears to be no time effect on the happiness level in the North America and ANZ region.
- Based on the prediction error rate, it suggests that the model can be utilized to predict a new dataset.
Suggestions:
To improve happiness levels in the North America and ANZ region, consider the following:
- For Positive Affect:
- Encourage volunteerism to enhance social support and fulfillment, leading to greater positive affect.
- Implement education and skill-building programs to improve skills and emotional intelligence, thus promoting positive affect.
- For Social Support:
- Launch mental health awareness campaigns to mitigate stigma and facilitate access to resources and support services.
Reference
Dataset: World Happiness Report 2023
Workflow : Algoritma