Submission details

WOD7004 Sem 2/2022/2023

Group Project

Occurrence 1 Group 10

Members:

  1. Wong Kian Wai (S2180506)
  2. Muhammad Fikri bin Muhammad Azli (S2198872)
  3. Chen Yi (22079565)
  4. Siti Nur Ani Yeap (17218658)
  5. Woo Yong Shen (S2175268)

Lecturer: Profesor Madya Dr. Ang Tan Fong

Dataset used in this study can be obtained from: https://worldhappiness.report/ed/2022/#appendices-and-data

Introduction

Happiness is a fundamental right of humanity, a positive emotion that has become increasingly significant in scientific studies and policy decisions. The well-being of the population is essential because it is associated with physical and mental health, economic productivity, and overall education levels. Moreover, the happiness of the population is an indicator of social progress and development in countries. This is because happy individuals are generally more active, efficient, productive, and creative. A happier population translates to a more peaceful society, as people tend to be more tolerant and understanding of others when they are content with their own lives.

The World Happiness Report (WHR) is a publication that ranks the happiness of individuals in nations and is released every year. World happiness measurements are crucial because they provide insights into the quality of life experienced by individuals living in a country. They help governments make policies that promote the happiness and well-being of the population, which is an essential component of a healthy and prosperous society. By integrating the pursuit of happiness into public policy, society can work more efficiently and suffer fewer mental health problems. This leads to increased economic productivity and a higher standard of living for all citizens.

The world happiness index is measured on various parameters, and countries are ranked based on their happiness score. Specifically, there are six variables that contribute to the overall happiness score, including GDP, social support, life expectancy, freedom to make decisions, donations, and perceptions of corruption. These variables were asked as questions to individuals and rated on a scale of 0 to 10. By analysing these variables, governments can identify areas for improvement in their policies and work towards creating a happier and more prosperous society.

Developing a world happiness prediction model is an essential step towards understanding and improving the quality of life of people around the world. By measuring world happiness levels and predicting future trends, we can work towards building a happier and more prosperous world. Besides, the model can also help in the understanding the factors that contribute to overall happiness levels and predict future happiness trends. The purpose of building a prediction model is to identify the key parameters that influence world happiness and develop strategies to improve them. This can help policymakers make informed decisions and enhance the well-being of their citizens.

In determining a country’s general well-being and happiness, GDP is said to be a key indicator. GDP is a gauge of a nation’s economic output and productivity. A stronger economy with better job prospects, higher salaries, and higher living standards is often indicated by a larger GDP. It’s common knowledge that greater levels of pleasure and life satisfaction are correlated with economic well-being. Therefore, it is the aim of project to study the relationship of GDP with other variables and how it relates to the happiness score.

Project Questions:

  1. How will happiness scores change in the future?
  2. What are the key predictors of happiness scores?

Objectives:

  1. To understand the factors affecting the world happiness.
  2. To develop predictive model for world happiness using different ML algorithms.
  3. To evaluate the performance of the predictive models.
  4. To identify the best performing predictive models for world happiness.
  5. To study the relationship of the dependent variables and independent variables.

Data Loading

Read.csv function is used to load the happiness score data and assigned into the data table df_Hscore. For Exploratory Data Analysis (EDA) purposes, the data set is then assigned into table df_rev1.

#load the datasets

df_Hscore <- read.csv("Data_H_Score.csv", header = TRUE)

#Load datasets for data understanding and eda
df_rev1 <- df_Hscore

Data Understanding

The section is to understand the data with the following simple codes. The aim is to understand the content of the dataset which includes knowing the sizes, structure, small overview of the dataset and knowing important features of the dataset.

The dataset contains 2199 attributes and 11 features. The details of the dataset are as follows

  1. Country name - List of the country involved
  2. Year - Year the data was conducted
  3. Life Ladder - The Happiness score
  4. Log GDP per capita - Gross Domestic Product, or how much each country produces, divided by the number of people in the country. Information about the strength of a nation’s economy
  5. Social support - Dependency on someone in times of trouble
  6. Healthy life expectancy at birth - Number of years an individual is expected to live in good health which combines both the length of life and the quality of life.
  7. Freedom to make life choice - Right to life and liberty, freedom from slavery and torture, freedom of opinion and expression, the right to work and education
  8. Generosity - Charity, donation and community engagement
  9. Perception of corruption - The perception of government corruption, business corruption and individual corruption
  10. Positive Effect - Average of three positive affect measures: laugh, enjoyment & doing things you like
  11. Negative effect - Average of three negative affect measures: worry, sadness & anger
#get to see the firs 5 of df
head(df_rev1, 5)
#size of dataframe
dim(df_rev1)
## [1] 2199   11
#get to know the dataframe features
colnames(df_rev1)
##  [1] "Country.name"                     "year"                            
##  [3] "Life.Ladder"                      "Log.GDP.per.capita"              
##  [5] "Social.support"                   "Healthy.life.expectancy.at.birth"
##  [7] "Freedom.to.make.life.choices"     "Generosity"                      
##  [9] "Perceptions.of.corruption"        "Positive.affect"                 
## [11] "Negative.affect"
#take a glimpse of dataframe
glimpse(df_rev1)
## Rows: 2,199
## Columns: 11
## $ Country.name                     <chr> "Afghanistan", "Afghanistan", "Afghan…
## $ year                             <int> 2008, 2009, 2010, 2011, 2012, 2013, 2…
## $ Life.Ladder                      <dbl> 3.724, 4.402, 4.758, 3.832, 3.783, 3.…
## $ Log.GDP.per.capita               <dbl> 7.350, 7.509, 7.614, 7.581, 7.661, 7.…
## $ Social.support                   <dbl> 0.451, 0.552, 0.539, 0.521, 0.521, 0.…
## $ Healthy.life.expectancy.at.birth <dbl> 50.500, 50.800, 51.100, 51.400, 51.70…
## $ Freedom.to.make.life.choices     <dbl> 0.718, 0.679, 0.600, 0.496, 0.531, 0.…
## $ Generosity                       <dbl> 0.168, 0.191, 0.121, 0.164, 0.238, 0.…
## $ Perceptions.of.corruption        <dbl> 0.882, 0.850, 0.707, 0.731, 0.776, 0.…
## $ Positive.affect                  <dbl> 0.414, 0.481, 0.517, 0.480, 0.614, 0.…
## $ Negative.affect                  <dbl> 0.258, 0.237, 0.275, 0.267, 0.268, 0.…

The data type of each variable is as follow:

  1. Country name - Character
  2. Year - Integer
  3. Life Ladder - Numeric
  4. Log GDP per capita - Numeric
  5. Social support - Numeric
  6. Healthy life expectancy at birth - Numeric
  7. Freedom to make life choice - Numeric
  8. Generosity - Numeric
  9. Perception of corruption - Numeric
  10. Positive Effect - Numeric
  11. Negative effect - Numeric
#summary of df
summary(df_rev1)
##  Country.name            year       Life.Ladder    Log.GDP.per.capita
##  Length:2199        Min.   :2005   Min.   :1.281   Min.   : 5.527    
##  Class :character   1st Qu.:2010   1st Qu.:4.647   1st Qu.: 8.500    
##  Mode  :character   Median :2014   Median :5.432   Median : 9.499    
##                     Mean   :2014   Mean   :5.479   Mean   : 9.390    
##                     3rd Qu.:2018   3rd Qu.:6.309   3rd Qu.:10.373    
##                     Max.   :2022   Max.   :8.019   Max.   :11.664    
##                                                    NA's   :20        
##  Social.support   Healthy.life.expectancy.at.birth Freedom.to.make.life.choices
##  Min.   :0.2280   Min.   : 6.72                    Min.   :0.2580              
##  1st Qu.:0.7470   1st Qu.:59.12                    1st Qu.:0.6562              
##  Median :0.8360   Median :65.05                    Median :0.7700              
##  Mean   :0.8107   Mean   :63.29                    Mean   :0.7479              
##  3rd Qu.:0.9050   3rd Qu.:68.50                    3rd Qu.:0.8590              
##  Max.   :0.9870   Max.   :74.47                    Max.   :0.9850              
##  NA's   :13       NA's   :54                       NA's   :33                  
##    Generosity       Perceptions.of.corruption Positive.affect  Negative.affect 
##  Min.   :-0.33800   Min.   :0.0350            Min.   :0.1790   Min.   :0.0830  
##  1st Qu.:-0.11200   1st Qu.:0.6880            1st Qu.:0.5720   1st Qu.:0.2080  
##  Median :-0.02300   Median :0.8000            Median :0.6630   Median :0.2610  
##  Mean   : 0.00009   Mean   :0.7452            Mean   :0.6521   Mean   :0.2715  
##  3rd Qu.: 0.09200   3rd Qu.:0.8690            3rd Qu.:0.7380   3rd Qu.:0.3230  
##  Max.   : 0.70300   Max.   :0.9830            Max.   :0.8840   Max.   :0.7050  
##  NA's   :73         NA's   :116               NA's   :24       NA's   :16
#get structure of df
str(df_rev1)
## 'data.frame':    2199 obs. of  11 variables:
##  $ Country.name                    : chr  "Afghanistan" "Afghanistan" "Afghanistan" "Afghanistan" ...
##  $ year                            : int  2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 ...
##  $ Life.Ladder                     : num  3.72 4.4 4.76 3.83 3.78 ...
##  $ Log.GDP.per.capita              : num  7.35 7.51 7.61 7.58 7.66 ...
##  $ Social.support                  : num  0.451 0.552 0.539 0.521 0.521 0.484 0.526 0.529 0.559 0.491 ...
##  $ Healthy.life.expectancy.at.birth: num  50.5 50.8 51.1 51.4 51.7 ...
##  $ Freedom.to.make.life.choices    : num  0.718 0.679 0.6 0.496 0.531 0.578 0.509 0.389 0.523 0.427 ...
##  $ Generosity                      : num  0.168 0.191 0.121 0.164 0.238 0.063 0.106 0.082 0.044 -0.119 ...
##  $ Perceptions.of.corruption       : num  0.882 0.85 0.707 0.731 0.776 0.823 0.871 0.881 0.793 0.954 ...
##  $ Positive.affect                 : num  0.414 0.481 0.517 0.48 0.614 0.547 0.492 0.491 0.501 0.435 ...
##  $ Negative.affect                 : num  0.258 0.237 0.275 0.267 0.268 0.273 0.375 0.339 0.348 0.371 ...
#how many country involve?
n_unique(df_rev1$`Country.name`)
## [1] 165
#summary of happiness score
summary(df_rev1$`Life.Ladder`)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.281   4.647   5.432   5.479   6.309   8.019

To further explore, finding pattern & depth into statistics, the number of missing values has to be found. This to avoid wrong exploration in the next step. Perception of corruption shows the highest number of missing values.The number of missing values is relatively small (1.4%) compared to the whole dataset.

#Analyzing Missing values NA
sum(is.na(df_rev1$`Life Ladder`))#no missing value in happiness score
## [1] 0
sum(is.na(df_rev1)) #insight: 349 missing value in df
## [1] 349
dataviz2 <- gg_miss_var(df_rev1)
dataviz2 #perception corruption with highest missing values

dataviz1 <- vis_miss(df_rev1) #get viz of missing values
dataviz1

#Comments : since 1.4% only missing value, the value quite small,
#           hence, we just remove the missing values.

Exploratory Data Analysis (EDA)

This project proceeds further by performing the Exploratory Data Analysis (EDA) on the dataset with the aim to understand and uncover patterns, relationships, and insights that can inform subsequent data modeling and decision-making processes. EDA involves a variety of statistical and visualization techniques to summarize, visualize, and explore the main characteristics of the data.

Features Correlation

The code below is used to generate a pairs correlation graph and designate colours for pair correlation visualisation. A colour palette is created using the viridis::viridis() method. It accepts the input (n), which is set to the number of rows in the df_rev2 dataframe, as the number of colours to be created.The v_color variable holds the finished colour scheme.Based on the HappyScore variable’s ordering, the colours from the v_color palette are assigned to the colour column. The HappyScore values are sorted using the order() function, and the HappyScore values are inverted using Matrix::invPerm().

#declare the color based for pair correlation visualization
v_color <- viridis::viridis(
  n = nrow(df_rev2)
)

df_rev2$color <- v_color[Matrix::invPerm(
  p = order(
    x = df_rev2$HappyScore
  )
)]

#built pairs correlation graph
pairs(
  formula = HappyScore ~ LogGDPperCapita +
    SocialSupport +
    HealthyLifeExpectancy_Birth +
    FreedomMakeChoice +
    Generosity +
    Perception_Corruption,
  data = df_rev2,
  col = df_rev2$color,
  pch = 21)

A correlation graph provides a visual overview of the relationships between variables, to identify potential patterns or clusters of correlations. It gives a broad sense of which variables are positively or negatively correlated, and it helps to identify variables that may be worth investigating further.

Looking at the correlation graph it is hard to finding the pattern. but we can observe a certain pattern that the variables are more clustered and not scattered for the scatter plots between different pairs of variable. We will focus on the first column where we can observe the relationship of between other variables and the happiness score. It seems from the scatter plot, the variables such as Log GDP per Capita, Social Support, Healthy Life Expectancy, Freedom to make choices and Generosity has positive correlation with the happiness score while the perception of corruption has negative correlation with the happiness score.

If we look at the second column, we can see that there is a positive correlation between variables such as Social Support, Healthy Life Expectancy and happiness score has positive correlation with GDP. The perception of corruption seems to have negative correlation with GDP. To study the relationship between variables even further, we proceed with the correlation diagram.

#build correlogram
options(repr.plot.width = 12, repr.plot.height = 10)
cor_happy <- subset(df_rev2, select = c(3,4,5,6,7,8,9))
corr_happy <- na.omit(cor_happy)
corr <- round(cor(corr_happy), 1)

viz2 <- ggcorrplot(corr, hc.order = TRUE,
           #type = "lower",
           lab = TRUE,
           lab_size = 4,
           method = "circle",
           colors = c("orange", "white", "cyan"),
           title = "Correlation of Variables",
           ggtheme = theme_bw)
print(viz2)

#based on insights : 
#[1] Perception of corruption shows negative correlation. Lower corruption, more happy.
#[2] GDP, Social and Healthy Life shows pos correlation. Higher these, more happy.
#[3] Generosity shows almost 0 value. Really not related with people happiness.

With the corr matrix and the ggcorrplot function, a correlation plot can be produced. Using the correlation diagram, we can observe several important correlations and indicate relationships between different variables and the happiness score. Here’s a further explanation of these findings:

  1. LogGDP per Capita and Happy Score: The correlation coefficient of +0.8 indicates a strong positive correlation between LogGDPperCapita (a measure of economic prosperity) and HappyScore. This suggests that countries with higher GDP per capita tend to have higher happiness scores. The finding aligns with the notion that wealthier countries often provide better living conditions, opportunities, and resources that contribute to overall happiness.

  2. Social Support and Happy Score: With a strong positive correlation coefficient of +0.7, SocialSupport (referring to having someone to rely on in times of need) shows a notable association with HappyScore. This implies that countries where individuals have strong social support networks tend to have higher levels of happiness. The presence of supportive relationships and a sense of belonging can contribute positively to people’s well-being.

  3. Healthy Life Expectancy at Birth and Happy Score: Another variable showing a strong positive correlation of +0.7 with HappyScore is HealthyLifeExpectancy_Birth. This suggests that countries with higher life expectancies and better health conditions at birth tend to have higher happiness scores. Good health and well-being are fundamental aspects of human happiness and are closely linked to overall life satisfaction.

  4. Generosity and Happy Score: The correlation coefficient of +0.2 indicates a positive but relatively weak correlation between Generosity and HappyScore. This finding suggests that while there is a slight positive association between generosity and happiness, it is not a strong or significant relationship. Other factors, such as economic prosperity, social support, and health, may have a more substantial impact on overall happiness.

  5. Perception of Corruption and Happy Score: The fair negative correlation of -0.5 between Perception_Corruption and HappyScore indicates that countries where people perceive lower levels of corruption in the government tend to have higher happiness scores. This suggests that trust in the government and the absence of corruption contribute to the overall happiness and well-being of individuals.

There are also some notable relationship between the variables other than the happiness score. For example, there is a strong correlation between Healthy Life Expectancy and GDP with correlation coefficient of 0.8. This means a healthier society would mean a richer society. Besides that, the correlation coefficient between social support and GDP is at 0.7 which is a strong correlation as well. From this, we believe that a society with good social connection and relationship would a richer society in a country. If we look at the perception of corruption, there is coefficient correlation of -0.4 with GDP. We can interpret this as when a society precept their government to be corrupted, generally the country is not rich or has high GDP.

Happiness vs Countries

The code below calculates the average HappyScore for each nation, ranks the outcomes in order of highest to lowest HappyScore, and then chooses the top 15 nations for the bar chart display below.

HSvsCountry <- df_rev2 %>% 
  na.omit(df_rev2) %>%
  group_by(CountryName) %>% 
  summarise(HappyScore = mean(HappyScore)) 

HSvsCountry <- HSvsCountry %>%
  arrange(desc(HappyScore)) %>% 
  head(15)
HSvsCountry
viz4 <- ggplot(HSvsCountry,
               aes(x = reorder(CountryName, -HappyScore),
                   y = HappyScore
                   )) +
  geom_bar(stat = "identity",
           fill = "darkcyan")+
  labs(title = "Top 15 Countries of Good Happiness Report",
       subtitle = "Countries vs Happiness Score",
       x = "Countries",
       y = "Happiness Score"
       ) +
  geom_text(aes(label = round(HappyScore,2),
                vjust = -.5,
                fontface = "italic",
                color = "orange",
                ),
                show.legend = FALSE,
           size = 3.5 ) +
  theme(axis.text.x = element_text(angle = -90,
                                   hjust = 0,
                                   vjust = 0))
viz4

Then, the code below calculates the average HappyScore for each country, sorts the results in ascending order, and selects the top 15 countries with the lowest HappyScores for a bar plot with the top 15 countries with the lowest happiness score.

#top worst happiness
HSvsCountry2 <- df_rev2 %>% 
  na.omit(df_rev2) %>%
  group_by(CountryName) %>% 
  summarise(HappyScore = mean(HappyScore)) 

HSvsCountry2 <- HSvsCountry2 %>%
  arrange(HappyScore) %>% 
  head(15)
HSvsCountry2
viz5 <- ggplot(HSvsCountry2,
               aes(x = reorder(CountryName, HappyScore),
                   y = HappyScore
               )) +
  geom_bar(stat = "identity",
           fill = "darkred")+
  labs(title = "Top 15 Countries of Bad Happiness Report",
       subtitle = "Countries vs Happiness Score",
       x = "Countries",
       y = "Happiness Score"
  ) +
  geom_text(aes(label = round(HappyScore, 2),
                vjust = -.5,
                fontface = "italic",
                color = "white"),
            size = 3.5,
            show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = -90,
                                   hjust = 0,
                                   vjust = 0))
viz5

The bar graph shows that developed countries tend to have happier people compare to third-world countries. We notice that the top countries with high happiness score is mostly from European countries that are developed and rich. For example, the top 3 country from the bar chart are Denmark, Finland and Norway. We do not have any European countries in the list of top 15 countries with worst happiness score but we can some countries from the Africa region. The worst 3 countries are Afghanistan, CAR (Central African Republic), and Burundi.

Happiness scores are typically based on various factors such as economic well-being, social support, life expectancy, freedom, generosity, and perceptions of corruption. The higher happiness scores in Denmark, Finland, and Norway could be attributed to several factors including high living standards, robust social support systems, quality healthcare, stable political environments, and high levels of trust within society.

On the other hand, Afghanistan, CAR, and Burundi may face various challenges that can impact happiness levels. Factors such as political instability, conflict, economic hardships, limited access to quality healthcare and education, and other socio-cultural factors could contribute to lower reported happiness scores in these countries.

GDP vs Countries

The average Log GDP per Capita for each country is calculated and then the results are sorted in ascending order of GDP, and selects the top 15 countries with the lowest average Log GDP per Capita.

#Top 15 by bad GDP
HSvsGDP <- df_rev2 %>% 
  na.omit(df_rev2) %>%
  group_by(CountryName) %>% 
  summarise(GDP = mean(LogGDPperCapita)) 

HSvsGDP <- HSvsGDP %>%
  arrange(GDP) %>% 
  head(15)
HSvsGDP

The bar plot below shows the top 15 countries with the lowest GDP and we notice that countries like Central African Republic, Malawi, Afghanistan and Sierra Leone also shown in the previous bar plot that has the low happiness score. This reconfirm that a poorer country tend to lead to a more unhappy society as this affects the ability to live a good life standard.

viz6 <- ggplot(HSvsGDP,
              aes(x = reorder(CountryName, GDP),
                  y = GDP
              )) +
  geom_bar(stat = "identity",
           fill = "Blue")+
  labs(title = "Top 15 Countries of Bad Gross Domestic Product",
       subtitle = "Countries vs GDP",
       x = "Countries",
       y = "Gross Domestic"
  ) +
  geom_text(aes(label = round(GDP, 2),
                vjust = -.5,
                fontface = "italic",
                color = "white"),
            size = 3.5,
            show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = -90,
                                   hjust = 0,
                                   vjust = 0))
viz6

The average Log GDP per Capita for each country is calculated and then the results are sorted in descending order of GDP, and selects the top 15 countries with the highest average Log GDP per Capita.

#Top 15 by good GDP
HSvsGDP2 <- df_rev2 %>% 
  na.omit(df_rev2) %>%
  group_by(CountryName) %>% 
  summarise(GDP = mean(LogGDPperCapita)) 

HSvsGDP2 <- HSvsGDP2 %>%
  arrange(desc(GDP)) %>% 
  head(15)
HSvsGDP2

The bar plot below shows the top 15 countries with the highest GDP. It is worth to observe countries like Ireland, Norway, Denmark, Netherlands, Luxembourg, Switzerland and Austria is also in the top 15 list of the happiest countries.

viz7 <- ggplot(HSvsGDP2,
               aes(x = reorder(CountryName, -GDP),
                   y = GDP
               )) +
  geom_bar(stat = "identity",
           fill = "Blue")+
  labs(title = "Top 15 Countries of Good Gross Domestic Product",
       subtitle = "Countries vs GDP",
       x = "Countries",
       y = "Gross Domestic"
  ) +
  geom_text(aes(label = round(GDP, 2),
                vjust = -.5,
                fontface = "italic",
                color = "white"),
            size = 3.5,
            show.legend = FALSE) +
  theme(axis.text.x = element_text(angle = -90,
                                   hjust = 0,
                                   vjust = 0))
viz7

Happiness Score vs GDP

This section of the code creates a new data frame named happy_gdp_df and averages the LogGDPperCapita and HappyScore for each country.

happy_gdp_df <- df_rev2 %>% 
  na.omit(df_rev2) %>% 
  group_by(CountryName) %>% 
  summarise(HappyScore = mean(HappyScore), GDP = mean(LogGDPperCapita))
happy_gdp_df

The top 15 countries with the highest happiness score with the GDP data is stored into the variable happy_gdp_df_top

happy_gdp_df_top <- happy_gdp_df %>% 
  arrange(HappyScore) %>% 
  tail(15)

print(happy_gdp_df_top)
## # A tibble: 15 × 3
##    CountryName HappyScore   GDP
##    <chr>            <dbl> <dbl>
##  1 Ireland           7.04 11.1 
##  2 Luxembourg        7.06 11.6 
##  3 Costa Rica        7.08  9.82
##  4 Austria           7.22 10.9 
##  5 Australia         7.25 10.8 
##  6 Israel            7.27 10.5 
##  7 New Zealand       7.28 10.6 
##  8 Canada            7.31 10.8 
##  9 Sweden            7.38 10.8 
## 10 Iceland           7.45 10.9 
## 11 Netherlands       7.45 10.9 
## 12 Switzerland       7.47 11.1 
## 13 Norway            7.48 11.1 
## 14 Finland           7.62 10.8 
## 15 Denmark           7.65 10.9

The top 15 countries with the lowest happiness score with the GDP data is stored into the variable happy_gdp_df_bot

happy_gdp_df_bot <- happy_gdp_df %>% 
  arrange(HappyScore) %>% 
  head(15)

print(happy_gdp_df_bot)
## # A tibble: 15 × 3
##    CountryName              HappyScore   GDP
##    <chr>                         <dbl> <dbl>
##  1 Afghanistan                    3.51  7.59
##  2 Central African Republic       3.52  6.89
##  3 Burundi                        3.55  6.68
##  4 Rwanda                         3.60  7.46
##  5 Togo                           3.66  7.53
##  6 Tanzania                       3.69  7.70
##  7 Zimbabwe                       3.81  7.61
##  8 Comoros                        3.89  8.05
##  9 Yemen                          3.93  8.01
## 10 Botswana                       3.95  9.55
## 11 Haiti                          3.95  8.03
## 12 Sierra Leone                   3.97  7.35
## 13 Malawi                         3.97  7.24
## 14 Madagascar                     3.98  7.33
## 15 Lesotho                        4.00  7.84

Then, we have plotted a graph below with GDP against happiness score for the top 15 countries. As mentioned above, the countries with the strongest economic strength tend to have happier people compared to countries that are poor in economic growth. However, high GDP does not always contribute to a happier society as there might be other variable that can affect the happiness. For example, we can see Luxembourg has the highest GDP among all the countries but it is not the most happy country compared to other countries.

viz8 <- ggplot(happy_gdp_df_top,
               aes(x = HappyScore,
                   y = GDP,
                   ))+
  geom_point(color = "turquoise")+
  geom_text(aes(label=CountryName),hjust=0.5, vjust=0.5, size=2.3)+
  geom_smooth(method = "lm") +
  labs(title = "GDP vs Happiness Score",
       subtitle = "Based on top 15 good happiness score",
       x = "Happiness Score",
       y = "Gross Domestic Product")
viz8 #pos corr
## `geom_smooth()` using formula = 'y ~ x'

Another plot of GDP against happiness score for the top 15 countries that has the lowest happiness score. There is a stronger relationship here to be seen, where poorer country is generally more unhappy.

viz9 <- ggplot(happy_gdp_df_bot,
               aes(x = HappyScore,
                   y = GDP,
                   ))+
  geom_point(color = "pink")+
  geom_text(aes(label=CountryName),hjust=0.5, vjust=0.5, size=2.3)+
  geom_smooth(method = "lm") +
  labs(title = "GDP vs Happiness Score",
       subtitle = "Based on top 15 bad happiness score",
       x = "Happiness Score",
       y = "Gross Domestic Product")
viz9 #strong pos correlation
## `geom_smooth()` using formula = 'y ~ x'

Overview:

The bar graph comparing GDP and Happiness Score for the 15 best and worst countries further supports the notion that countries with stronger economic strength tend to have higher happiness scores compared to countries with weaker economic growth.

Based on the graph, countries such as Luxembourg, Qatar, and Singapore, which are considered among the best countries in terms of GDP, also show higher happiness scores. This suggests that their economic prosperity contributes to the overall happiness of their populations.

On the other hand, countries like Burundi, Congo, and CAR, which are labeled as the worst countries in terms of GDP, also display lower happiness scores. This implies that their lower economic development may have a negative impact on the happiness levels within these countries.

Happiness Score vs Corrupt Perception

The same code is used to calculate the average HappyScore and average Perception_Corruption for each country and creates a new data frame called HS_CP_df

#Happiness Score vs Corrupt Perception
HS_CP_df <- df_rev2 %>% 
  na.omit(df_rev2) %>% 
  group_by(CountryName) %>% 
  summarise(HS = mean(HappyScore), PC = mean(Perception_Corruption))
HS_CP_df

The HS_CP_df data frame is sorted in descending order based on the average HappyScore column and selects the top 15 rows with the highest HappyScore

HS_CP_df_top <- HS_CP_df %>% 
  arrange(desc(HS)) %>% 
  head(15)
HS_CP_df_top

The HS_CP_df data frame is sorted in ascending order based on the average HappyScore column and selects the top 15 rows with the highest HappyScore

HS_CP_df_bot <- HS_CP_df %>% 
  arrange(desc(HS)) %>% 
  tail(15)
HS_CP_df_bot

In general, people who don’t believe corruption in the government tend to be happier. More corrupted countries will have lower happiness score.

viz13 <- ggplot(HS_CP_df,
                aes(x = HS,
                    y = PC)) +
  geom_point(color = "yellow") + 
  geom_smooth(method = "lm")+
  labs(title = "Corruption Perception vs Happiness Score",
       subtitle = "of all countries",
       x = "Happiness Score",
       y = "Corruption Perception Index")+
  geom_text(aes(label=CountryName),hjust=0.5, vjust=0.5, size=2.3)
viz13 #neg corr
## `geom_smooth()` using formula = 'y ~ x'

Again, we can see countries like Finland, Denmark and Norway has low perception of corruption. From the plot, most European has lower perception of corruption and tend to be happy.

viz14 <- ggplot(HS_CP_df_top,
                aes(x = HS,
                    y = PC)) +
  geom_point(color = "yellow") + 
  geom_smooth(method = "lm")+
  labs(title = "Corruption Perception vs Happiness Score",
       subtitle = "based on top 15 happiness score",
       x = "Happiness Score",
       y = "Corruption Perception Index")+
  geom_text(aes(label=CountryName),hjust=0.5, vjust=0.5, size=2.3)
viz14 #strong neg corr
## `geom_smooth()` using formula = 'y ~ x'

But when dive into specific countries with the worst happiness score, they believe in the corruption of their government. In other words, the government’s corruption plays a significant role in justifying the happiness of their people.

viz15 <- ggplot(HS_CP_df_bot,
                aes(x = HS,
                    y = PC)) +
  geom_point(color = "yellow") + 
  geom_smooth(method = "lm")+
  labs(title = "Corruption Perception vs Happiness Score",
       subtitle = "based on top 15 bad happiness score",
       x = "Happiness Score",
       y = "Corruption Perception Index")+
  geom_text(aes(label=CountryName),hjust=0.5, vjust=0.5, size=2.3)
viz15 #weak pos corr
## `geom_smooth()` using formula = 'y ~ x'

Overview:

The comparison between Corruption Perceptions and Happiness Score for the top countries (Finland, Denmark, and Sweden) and the worst countries (Afghanistan, CAR, and Burundi) provides insights into the relationship between perceptions of corruption and happiness levels.

The data suggests that in the top countries, characterized by Finland, Denmark, and Sweden, there is a perception of lower corruption, and they also exhibit higher happiness scores. This implies that people in these countries tend to believe that corruption levels are relatively low within their governments or public institutions. The lower perception of corruption may contribute to a more stable and trustworthy environment, which, in turn, positively impacts happiness levels.

On the other hand, the worst countries, including Afghanistan, CAR, and Sierra Leone, are associated with higher perceptions of corruption and lower happiness scores. This suggests that people in these countries believe that corruption is prevalent within their governments or public institutions. The presence of corruption can undermine social trust, impede socio-economic development, and create an unstable environment, negatively impacting happiness levels.

Happiness Score vs Social Support

Now, the code calculates the average HappyScore and average SocialSupport for each country and creates a new data frame called HS_SS_df

#Happiness Score vs Social Support
HS_SS_df <- df_rev2 %>% 
  na.omit(df_rev2) %>% 
  group_by(CountryName) %>% 
  summarise(HS = mean(HappyScore), SS = mean(SocialSupport))
HS_SS_df

The HS_SS_df data frame is sorted in descending order based on the HS column (average HappyScore) and selects the top 15 rows with the highest HappyScores.

HS_SS_df_top <-  HS_SS_df %>% 
  arrange(desc(HS)) %>% 
  head(15)
HS_SS_df_top

The HS_SS_df data frame is sorted in descending order based on the HS column and selects the bottom 15 rows with the lowest HappyScores

HS_SS_df_bot <-  HS_SS_df %>% 
  arrange(desc(HS)) %>% 
  tail(15)
HS_SS_df_bot

If we plot the scatter plot of Social Support against happiness score, we can see a clear relationship between social support and happiness score. When social support is high for a country, the country is more happier.

viz16 <- ggplot(data = HS_SS_df,
                aes(x = HS,
                    y = SS)) +
  geom_point(color = "darkblue") +
  geom_smooth(method = "lm") +
  labs(title = "Social Support vs Happiness Score",
       subtitle = "of all countries",
       x = "Happiness Score",
       y = "Social Support")
viz16 #strong pos corr
## `geom_smooth()` using formula = 'y ~ x'

Both graphs display that social support is very crucial in defining the happiness of people. The nation tends to be happier if they have better connection with people and society.

viz17 <- ggplot(HS_SS_df_top,
                aes(x = HS,
                    y = SS)) +
  geom_point( color = "darkblue") +
  geom_smooth(method = "lm") +
  geom_text(aes(label = CountryName),
            hjust=0.5, vjust=1.5, size=3.3) +
  labs(title = "Social Support vs Happiness Score",
       subtitle = "based on top 15 good happiness score",
       x = "Happiness Score",
       y = "Social Support Index")
viz17
## `geom_smooth()` using formula = 'y ~ x'

viz18 <- ggplot(HS_SS_df_bot,
                aes(x = HS,
                    y = SS)) +
  geom_point( color = "darkblue") +
  geom_smooth(method = "lm") +
  geom_text(aes(label = CountryName),
            hjust=0.5, vjust=1.5, size=3.3) +
  labs(title = "Social Support vs Happiness Score",
       subtitle = "based on top 15 bad happiness score",
       x = "Happiness Score",
       y = "Social Support Index")
viz18 #strong pos corr
## `geom_smooth()` using formula = 'y ~ x'

Overview:

The comparison between Social Support and Happiness Score for the top countries (Iceland, Denmark, and Finland) and the worst countries (Afghanistan, Burundi, and CAR) sheds light on the relationship between social support and happiness levels.

The data indicates that in the top countries, characterized by Iceland, Denmark, and Finland, there is a higher level of social support, which is also reflected in their higher happiness scores. This suggests that individuals in these countries have access to strong social networks, supportive relationships, and a sense of belonging. Social support plays a crucial role in promoting well-being, resilience, and overall happiness.

Conversely, the worst countries, including Afghanistan, Burundi, and CAR, display lower levels of social support and lower happiness scores. This implies that individuals in these countries may face challenges in accessing social networks and supportive relationships. The lack of social support can contribute to feelings of isolation, vulnerability, and lower levels of happiness.

Time Series Analysis

The code below filters the df_rev2 data frame to select rows where the Country Name is “Denmark” and “Afghanistan

YearHS_Denmark <- df_rev2 %>% 
  filter(CountryName == "Denmark")

YearHS_Afghanistan <- df_rev2 %>% 
  filter(CountryName == "Afghanistan")

Throughout the decades, the happiness score of Afghanistan has fluctuated where the highest recorded score is 4.758 in 2010 while the lowest is the record-breaking of 1.279 in 2022. The overall trends display that the people of Afghanistan are getting sadder in the past decades. This suggests that, on average, the people of Afghanistan have been experiencing lower levels of happiness in recent years compared to previous periods.

It’s important to note that the happiness score is influenced by various factors, including political stability, economic conditions, social support, health, education, and cultural contexts. Afghanistan has faced significant challenges such as ongoing conflict, political instability, economic struggles, and limited access to basic services, which can contribute to lower happiness levels.

viz19 <-  ggplot(YearHS_Afghanistan,
       aes(x = Year,
           y = HappyScore)) +
  geom_line() + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Happiness across 10 Years",
       subtitle = "Worst Country : Afghanistan")
viz19 #show downward trend for 10years
## `geom_smooth()` using formula = 'y ~ x'

Throughout the decades, the happiness score for Denmark also shows small fluctuation where the highest recorded is 7.97 in 2008 while the lowest recorded is 7.51 in 2014. These fluctuations suggest that the happiness levels of the Danish population have varied to some extent throughout the past decade.

However, it is worth noting that despite these fluctuations, the happiness score for Denmark has consistently remained above 7, indicating a relatively high level of happiness overall. This suggests that, on average, the people of Denmark have maintained a relatively positive sense of well-being and happiness throughout the past ten years.

Denmark is often cited as one of the happiest countries globally, and this is attributed to various factors such as high standards of living, a strong social support system, access to quality healthcare and education, social equality, and a high level of trust in institutions. These factors contribute to a positive environment that supports individual well-being and happiness.

viz20 <- ggplot(YearHS_Denmark,
       aes(x = Year,
           y = HappyScore)) +
  geom_line() + 
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Happiness Trend across 10 Years",
       subtitle = "Top County : Denmark",
       x = "Years",
       y = "Happiness Score")
viz20 #shows downward trend
## `geom_smooth()` using formula = 'y ~ x'

The code below calculates the average HappyScore for each year and creates a new data frame called HS_Year

#finding trend throughout the world for 10 years
#groupby according to year
HS_Year <- df_rev2 %>% 
  na.omit(df_rev2) %>%
  filter(Year != 2005) %>%
  group_by(Year) %>% 
  summarise(HS = mean(HappyScore))
HS_Year

The average happiness score for all countries shows increasing trends for the past 10 years. This indicates a positive trend in overall happiness levels globally, suggesting that, on average, people have experienced an improvement in their well-being and subjective happiness over the past decade. Although there is a sharp decrease from the year 2005 which might indicates some external factors affecting the happiness score.

viz21 <- ggplot(data = HS_Year,
                aes(x = Year,
                    y = HS)) +
  geom_line(color = "blue")+
  labs(x = "Year", y = "Mean Happy Score", title = "Mean Happy Score by Year")
viz21  

The section of the code calculates various summary statistics for the HappyScore variable across all years in the df_rev2 data frame

HS_Year_All <- df_rev2 %>% 
  group_by(Year) %>% 
  filter(Year>2005) %>% #remove because data not graph not smooth
  summarise(Mean = mean(HappyScore), 
            Median = median(HappyScore),
            Min = min(HappyScore),
            Max = max(HappyScore))
HS_Year_All

The graph above displays that the maximum score for every year shows small increases trend or can be said maintain between 7.7 and 7.8. This suggests that the happiest country each year achieves a score within that range.

viz22 <- ggplot(HS_Year_All,
                aes(x = Year,
                    y = Mean)) +
  geom_line() +
  geom_smooth(method = "lm") +
  labs(title = "Happiness Score Throughout 10 Years",
       subtitle = "By average score each year",
       x = "Year",
       y = "Happiness Score")
viz22  
## `geom_smooth()` using formula = 'y ~ x'

viz23 <- ggplot(HS_Year_All,
                aes(x = Year,
                    y = Median)) +
  geom_line() +
  geom_smooth(method="lm") +
  labs(title = "Happiness Score Throughout the 10 Years",
       subtitle = "By median score of every year",
       x = "Year",
       y = "Happiness Score")
viz23
## `geom_smooth()` using formula = 'y ~ x'

Contrary to the minimum score, the graph shows a huge downward trend for the past 10 years. Starting above 3.0 and decreasing to below 1.3 in the latest year. This trend indicates there has been an increasing widen gap between the happiest country and the saddest country for every year in the past decades. In other words, there has been an increasing disparity in happiness levels among countries over the past

viz24 <- ggplot(data = HS_Year_All,
                aes(x = Year,
                    y = Min)) +
  geom_line() +
  geom_smooth(method = "lm") +
  labs(title = "Happiness Score Throught 10 Years",
       subtitle = "By minimum score",
       x = "year",
       y = "Happiness Score")
viz24
## `geom_smooth()` using formula = 'y ~ x'

viz25 <- ggplot(data = HS_Year_All,
                aes(x = Year,
                    y = Max)) +
  geom_line(color = "red") +
  geom_smooth(method = "lm")+
  labs(title = "Happiness Score Throughout 10 Years",
       subtitle = "By maximum score",
       x = "Year",
       y = "Happiness Score")
viz25
## `geom_smooth()` using formula = 'y ~ x'

Data Processing and Data Cleaning

First, the line “names(df_Hscore)” retrieves the current column names of the df_Hscore data frame. This is to check the column names and then allow to changes the column names into names that is shorten and also consistent throughout the code later.

Once we have all the column names, we retrieve all the column names of the current and renamed them. The code below uses the logical indexing to find the existing column names and rename them by assigning new names.

#rename the column names
names(df_Hscore)[names(df_Hscore) == "Country.name"] <- "Country"
names(df_Hscore)[names(df_Hscore) == "year"] <- "Year"
names(df_Hscore)[names(df_Hscore) == "Life.Ladder"] <- "H_score"
names(df_Hscore)[names(df_Hscore) == "Log.GDP.per.capita"] <- "Log_GDP"
names(df_Hscore)[names(df_Hscore) == "Social.support"] <- "Social_Support"
names(df_Hscore)[names(df_Hscore) == "Healthy.life.expectancy.at.birth"] <- "Life_Expectancy"
names(df_Hscore)[names(df_Hscore) == "Freedom.to.make.life.choices"] <- "Freedom"
names(df_Hscore)[names(df_Hscore) == "Perceptions.of.corruption"] <- "Corruption"
names(df_Hscore)[names(df_Hscore) == "Positive.affect"] <- "Pos_affect"
names(df_Hscore)[names(df_Hscore) == "Negative.affect"] <- "Neg_affect"

After we have renamed the column names, we checked the new column names of df_Hscore using the function names() and we can confirm that the column names has been updated.

names(df_Hscore)
##  [1] "Country"         "Year"            "H_score"         "Log_GDP"        
##  [5] "Social_Support"  "Life_Expectancy" "Freedom"         "Generosity"     
##  [9] "Corruption"      "Pos_affect"      "Neg_affect"

The “Country” column in df_Hscore is cleaned using the str_replace_all() function from the stringr package. The function effectively removes any instances of characters that are not letters, digits, underscores, or spaces by replacing them with an empty string.

# Clean the 'Country name' column 
df_Hscore$Country <- str_replace_all(df_Hscore$Country, "[^\\w\\s]", "")

The function “rankfunction” below derives a new variable called “Rank” based on the happiness scores in a given year. It will read the table and year as the inputs.

Within the function, it will create a subset of the dataset based on the selected year. For example, if the year is 2022, the subset of the data frame df_Hscore will be assigned into data frame df. The sub data frame df is then arranged with descending order where the country with the highest happiness score will be at the first row of the data frame and followed by country with the second highest happiness score until it reaches the end of the row of the data frame.

Then, rank is assigned with value of 1 and this represents the first rank. Using for loop starting from 1 to the total number of rows of data frame df, the loop starts with first row, creates a new column called “Rank” and assign a value starting with 1. The next iteration with starts with rank increment of 1 and this process will repeats until the countries has been assigned with a rank.

#function to derive a new variable, Rank
rankfunction<- function(year){
  df <- subset(df_Hscore, df_Hscore$Year == year)
  df <- df[order(df$H_score, decreasing = TRUE), ]
  Rank <- 1
  for (x in 1:nrow(df)){
    df[x, "Rank" ] <- Rank
    Rank = Rank + 1
  }
  return (df)
}

The following chunk of codes is written to run the function “rankfunction” with different year as inputs. For each year, the subdata frame is returned from the function above and assigned into a new subdata frame. For example, we have entered year 2008 as an input for the function “rankfunction”, the function will subset data from year 2008 and assign ranks according to the happiness score of each country. The subdata frame is returned and assigned to df2008. The same process is repeated until year 2022.

#Use the function and the results are returned according to year
df2008 <- rankfunction(2008)
df2009 <- rankfunction(2009)
df2010 <- rankfunction(2010)
df2011 <- rankfunction(2011)
df2012 <- rankfunction(2012)
df2013 <- rankfunction(2013)
df2014 <- rankfunction(2014)
df2015 <- rankfunction(2015)
df2016 <- rankfunction(2016)
df2017 <- rankfunction(2017)
df2018 <- rankfunction(2018)
df2019 <- rankfunction(2019)
df2020 <- rankfunction(2020)
df2021 <- rankfunction(2021)
df2022 <- rankfunction(2022)

Since we have multiple sub datasets from year 2008 to 2022 that contains the rank variables now, we now combine all the datasets together using the function rbind() into data frame df_Hscore_new. Below is the example fo the few first rows of the combined datasets.

#combine everything back to form a new data frame
df_Hscore_new <- rbind(df2008,df2009,df2010,df2011,df2012,df2013,df2014,df2015,df2016,df2017,
                       df2018,df2019,df2020,df2021,df2022)
head(df_Hscore_new)

Finding the number of missing values in a dataset allows you to understand the extent of missing data and its potential impact on your analysis. Identifying missing values is crucial because they can affect the accuracy and validity of statistical analysis. If missing values are not appropriately addressed, they can lead to biased results or incorrect interpretations. Therefore, it is important to handle them properly before proceeding with further exploration or analysis.

The data table contains some missing values which is shown on the table below when we run the code colSums(is.na(df_Hscore_new)) %>% data.frame().

As there are quite a lot of missing values, data imputation is required to fill the missing values and they are too many null values to be removed from the data. Therefore, it is decided to fill in the missing years for each country and then fill in the missing values with the previous year’s values. Then, the remaining null values would be removed to ensure the data has the lesser amount of null values being removed.

#check the number of nulls
colSums(is.na(df_Hscore_new)) %>% data.frame()
# Create a sequence of years from 2008 to 2022
allYears <- seq(2008, 2022)

# Complete the data for each country by filling in missing years with the previous year's values
completedData <- df_Hscore_new %>%
  complete(Country, Year = allYears) %>%
  group_by(Country) %>%
  fill(everything(), .direction = "up")

completedData <- completedData %>% drop_na()

The code colSums(is.na(completedData)) is used to confirm that there is no further nulls in the data table.

colSums(is.na(completedData)) %>% data.frame()

Before moving to normalizing the variables, we temporarily removed the year,rank, Happiness score and Log GDP per capita column as we do not need to normalize the variables Year and Rank and only the independent variable are normalized. The temporary temp_table_norm1 is assigned with the data table but without the variable Year and Rank.Log GDP per capita data is not being normalized as it has already been transformed into logarithm scale to reduce the impact of extreme values.

#Remove year, rank, Happiness score and Log GDP per capita
temp_table_norm1 <- completedData[ ,-2]
temp_table_norm1 <- temp_table_norm1[ ,-11]
temp_table_norm1 <- temp_table_norm1[ ,-2]
temp_table_norm1 <- temp_table_norm1[ ,-2]

The variables from the data table will be normalized using the function preProcess () from the caret package. Since “range” is the technique specified in this instance, the data will be scaled from 0 to 1. This function accepts the temp_table_norm1 dataset as the input. We then use the predict function to apply the preprocessing method defined in the previous step to the temp_table_norm1 dataset. This step scales the data to the range of 0 to 1.

Using the reshape2 package, the function melt () is used to reshape the df_Hscore_norm dataset and transforms it into a dataset with two columns, one of which contains variable names and the other of which has related values. The reason to reshape the data is to prepare for the data to be plotted on Box Plot in order to visualized the outliers.We can see from the summary of df_Hscore_norm that the miniumun and maximum values of the variables is scaled to 0 and 1 respectively.

#Normalization and reshape the data frame for ggplot
process <- preProcess(temp_table_norm1, method=c("range"))
df_Hscore_norm <- predict(process, temp_table_norm1)
df_Hscore_long <- melt(df_Hscore_norm) 
## Using Country as id variables
summary(df_Hscore_norm)
##    Country          Social_Support   Life_Expectancy     Freedom      
##  Length:2107        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :character   1st Qu.:0.6263   1st Qu.:0.7063   1st Qu.:0.5481  
##  Mode  :character   Median :0.7661   Median :0.8297   Median :0.7070  
##                     Mean   :0.7310   Mean   :0.7963   Mean   :0.6741  
##                     3rd Qu.:0.8802   3rd Qu.:0.8919   3rd Qu.:0.8267  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##    Generosity       Corruption       Pos_affect       Neg_affect    
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.2209   1st Qu.:0.6825   1st Qu.:0.5518   1st Qu.:0.1827  
##  Median :0.3026   Median :0.8006   Median :0.6879   Median :0.2691  
##  Mean   :0.3260   Mean   :0.7471   Mean   :0.6714   Mean   :0.2885  
##  3rd Qu.:0.4097   3rd Qu.:0.8766   3rd Qu.:0.7936   3rd Qu.:0.3754  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000

The function ggplot () is used to box plot for each variables and the dataset used is the reshaped dataset, df_Hscore_long.The function aes is used to specify the plot’s aesthetics. In this instance, the x = variable and y = value variables specify that the variable column from the df_Hscore_long dataset will be mapped to the x-axis and the y-axis, respectively. We can see from the the box plot that there are quite a few data points that lie outside the interquartile range which we may need to consider them as outliers and remove them.

# Applying ggplot function
ggplot(df_Hscore_long, aes(x = variable, y = value)) +            
  geom_boxplot(coef = 1.5) 

After we have checked the box plot, the variable Year, Happiness score and Log GDP per capita are combined into the normalized dataset, df_Hscore_norm.

#combine them back
Year <- completedData$Year
H_score <- completedData$H_score
Log_GDP <- completedData$Log_GDP 
df_Hscore_norm$Year <- Year
df_Hscore_norm$H_score <- H_score
df_Hscore_norm$Log_GDP <- Log_GDP 

The function detect_outlier is created to detect the outliers. It starts with taking the numeric vector of x. The function will calculate the first quantile and the third quantile, the interquartile range (IQR) is calculated by subtracting the first quantile from the third quantile.

Following that, the function outputs a logical vector of length x, each element of which is TRUE if it is an outlier and FALSE otherwise. If the value of an element exceeds Quantile3 + (IQR * 1.5) or falls below Quantile1 - (IQR * 1.5), it is regarded as an outlier.

# create detect outlier function
detect_outlier <- function(x) {
  # calculate first quantile
  Quantile1 <- quantile(x, probs=.25)
  
  # calculate third quantile
  Quantile3 <- quantile(x, probs=.75)
  
  # calculate inter quartile range
  IQR = Quantile3-Quantile1
  
  # return true or false
  x > Quantile3 + (IQR*1.5) | x < Quantile1 - (IQR*1.5)
}

The function remove_outlier is designed to remove outlier and it works by first taking dataframe and an optional vector columns defining the columns to be checked for outliers. If columns is not specified, all of the dataframe’s columns are used by default.

The function iterates through the columns using a for loop. It uses the function detect_outlier() for each column to find outliers within that column. The logical vector that the detect_outlier() function returns shows which observations are outliers.

When outliers are found in the current column, the logical vector is used to subset the dataframe and delete the offending rows. The function would return the resulting data frame.

# create remove outlier function
remove_outlier <- function(dataframe, columns=names(dataframe)) {
  
  # for loop to traverse in columns vector
  for (col in columns) {
    
    # remove observation if it satisfies outlier function
    dataframe <- dataframe[!detect_outlier(dataframe[[col]]), ]
  }
  
  # return dataframe
  return (dataframe)
}

The remove_outlier function is used with the df_Hscore_comb dataset together with the specified columns.

The detect_outlier() function the outliers and they will be removed using the remove_outlier() function. The original dataframe will be replaced with happiness_data after the outlier observations are removed. An in-depth review of the data, domain expertise, and the effect of outliers on the analysis and interpretation of results should be used to make the decision of whether to delete outliers.

Since the Happiness score is a gauge of subjective well-being, it may not have obvious outliers. Therefore, the happiness score is excluded from the outliers removal function. Some variables may have more skewed distributions and may be more sensitive to outliers, such as economic statistics like GDP or social indicators like Life Expectancy. When this happens, it may be appropriate to use outlier detection techniques to identify extreme values and eliminate them if it is determined that they are significant or unrepresentative of the bulk of the data.

#Final cleaned data for both data, first data from 2005 to 2021 and second data
happiness_data <- remove_outlier(df_Hscore_norm, c('Log_GDP', 'Social_Support', 'Life_Expectancy', 'Freedom','Generosity', 'Corruption', 'Pos_affect', 'Neg_affect'))

Data Splitting

The cleaned data is then split into 70% training data and 30% test data using the createDataPartition function. The variable trainData is assigned with the training data while the variable testData is assigned with test data which we will use to evaluate our prediction models. Country column is removed from the training data and test data since the Country variable do not contribute to the happiness score.

data <- happiness_data
#data <- data[, -which(names(data) == "Country")]

set.seed(123) 

trainIndex <- createDataPartition(data$H_score, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]

trainData <- trainData[, -which(names(trainData) == "Country")]
testData <- testData[, -which(names(testData) == "Country")]

Modelling

The goal is to predict the happiness score (Happiness score) using various predictor variables available in the dataset. The implementation is by utilizing the available predictor variables in the dataset to predict the happiness score (Happiness score) using predictive modelling technique. By employing predictive modeling techniques, we can establish a relationship between the predictor variables and the happiness score. This predictive modeling approach allows us to make informed predictions about each country’s happiness level based on the given dataset. In this project, another goal is to predict another variable within the dataset which is the Log GDP per capita which is one of the variable that is highly related to happiness of a country.

To train a linear regression model, random forest (RF) and gradient boosting machine (GBM) models on the World Happiness Report dataset, we will leverage different methods available in R with the assistance of the caret package.

By utilizing these different methods provided by the caret package, we will explore and compare the performance of linear regression, random forest, and gradient boosting machine models on the World Happiness Report dataset, ultimately aiming to make accurate predictions of the happiness score and GDP of a country.

After training the model on the training data, we can proceed to evaluate its performance. To do this, we utilize the test data that was previously set aside. By making predictions on the test data using the trained model, we can assess how accurately the model predicts the happiness scores based on the selected predictor variables.

To make predictions on the test data, we can achieve this using the predict function from the caret package. This function takes the trained model and the test data as input and generates predicted happiness scores and GDP.

Linear Regression Model

For the linear regression model, we will employ the “lm” method provided by the caret package. This method will allow us to build a linear regression model to predict the happiness score (Happiness score) using the available predictor variables. By utilizing the “lm” method, we will estimate the coefficients of the linear regression equation and make predictions based on the relationships between the predictors and the target variable.

The method ‘lm’ is the linear regression model used in fitting the data into the model then the model us used to predict the happiness score and stored into prediction_lr. We then predict the happiness score for year 2022 using the model and compare it with the real happiness score. We can observe from the crossplot that the predicted values correlate with the real values quite well with R-squared value of 0.794.

# Fit the model on the training data
lr_model <- train(H_score ~ ., data = trainData, method = "lm")

# Predict the H_score using the model and the test data
prediction_lr <- predict(lr_model, newdata = testData)

#Test the model with data from year 2022
#Predict the 2022 H_score value
currentYear <- max(data$Year)
currentYearData <- data[data$Year == currentYear, ]
currentYearData$predicted_H_score <- predict(lr_model, newdata = currentYearData )

rsquared <- cor(currentYearData$predicted_H_score, currentYearData$H_score)^2

ggplot(currentYearData, aes(x = predicted_H_score, y = H_score)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(x = "Predicted Score", y = "Actual Score",
       title = "Predicted Happiness Score vs. Actual Happiness Score in 2022") +
  geom_text(aes(x = max(currentYearData$predicted_H_score),
                y = min(currentYearData$H_score),
                label = paste0("R-squared = ", round(rsquared, 3))),
            hjust = 1, vjust = 0, color = "red")  # Add R-squared annotation

Random Forest Model

The random forest model, known for its ensemble learning technique, combines multiple decision trees to make predictions. It considers a random subset of predictor variables at each split, resulting in improved accuracy and robustness. The caret package offers the “rf” method, which will enable us to train a random forest model on the World Happiness Report dataset.

Next, for Random Forest model, the method ‘rf’ is used in fitting the data into the model then the model us used to predict the happiness score and stored into prediction_rf. We repeat the same process of predicting the happiness score for year 2022 and we can observe from the crossplot that the predicted values matches the real values very well and better than the linear regression model with R-squared value of 0.95. This would means that the predicted happiness score is closely matches with the true values with only small error.

#Train the Random Forest Regression Regression model 
model_rf <- train(H_score ~ ., data = trainData, method = "rf",
               trControl = trainControl(method = "none"),
               tuneGrid = data.frame(mtry = 2),
               verbose = TRUE)
print(model_rf)
## Random Forest 
## 
## 1276 samples
##    9 predictor
## 
## No pre-processing
## Resampling: None
#Test the model with data from year 2022
# Predict the test data
prediction_rf  <- predict(model_rf, newdata = testData)

#Test the model with data from year 2022
#Predict the 2022 H_score value
Year2022 <- max(data$Year)
previousYear <- currentYear - 1
Year2022Data_rf <- data[data$Year == currentYear, ]
previousYearData <- data[data$Year == previousYear, ]
Year2022Data_rf$predicted_H_score <- predict(model_rf, newdata = Year2022Data_rf )

rsquared <- cor(Year2022Data_rf$predicted_H_score, Year2022Data_rf$H_score)^2

ggplot(Year2022Data_rf, aes(x = predicted_H_score, y = H_score)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(x = "Predicted Score", y = "Actual Score",
       title = "Predicted Happiness Score vs. Actual Happiness Score in 2022") +
  geom_text(aes(x = max(Year2022Data_rf$predicted_H_score),
                y = min(Year2022Data_rf$H_score),
                label = paste0("R-squared = ", round(rsquared, 3))),
            hjust = 1, vjust = 0, color = "red")  # Add R-squared annotation

Gradient Boosting Model

We will utilize the gradient boosting machine method, available as the “gbm” method in the caret package. Gradient boosting is a powerful machine learning technique that combines weak learners, usually decision trees, to create a strong predictive model. It iteratively builds the model by minimizing the errors of the previous iterations. With the “gbm” method, we will train a gradient boosting machine model to predict the happiness score based on the available predictor variables

The method ‘gbm’ is used training the GBM model with the training data and we have predicted the happiness score and assign it to prediction_gbm. The correlation between the predicted values and real happiness score is high with with R-squared value of 0.859.

# Train the Gradient Boosting Regression model 
# Define the training control
ctrl <- trainControl(method = "repeatedcv", 
                     number = 5, 
                     repeats = 2, 
                     verboseIter = FALSE)

# Train the GBM model
model <- train(H_score ~ ., 
               data = trainData, 
               method = "gbm", 
               trControl = ctrl)

# Predict the test data
prediction_gbm <- predict(model, newdata = testData)
# Predict the test data
prediction_gbm <- predict(model, newdata = testData)

#Test the model with data from year 2022
#Predict the 2022 H_score value
Year2022 <- max(data$Year)
previousYear <- currentYear - 1
Year2022Data_gbm <- data[data$Year == currentYear, ]
previousYearData <- data[data$Year == previousYear, ]
Year2022Data_gbm$predicted_H_score <- predict(model, newdata = Year2022Data_gbm )

rsquared <- cor(Year2022Data_gbm$predicted_H_score, Year2022Data_gbm$H_score)^2

ggplot(Year2022Data_gbm, aes(x = predicted_H_score, y = H_score)) +
  geom_point() +
  geom_abline(intercept = 0, slope = 1, linetype = "dashed") +
  labs(x = "Predicted Score", y = "Actual Score",
       title = "Predicted Happiness Score vs. Actual Happiness Score in 2022") +
  geom_text(aes(x = max(Year2022Data_gbm$predicted_H_score),
                y = min(Year2022Data_gbm$H_score),
                label = paste0("R-squared = ", round(rsquared, 3))),
            hjust = 1, vjust = 0, color = "red")  # Add R-squared annotation

Model Evaluation

To make predictions on the test data, we can achieve this using the predict function from the caret package. This function takes the trained model and the test data as input and generates predicted happiness scores. By comparing these predicted scores with the actual happiness scores in the test data, we can compute evaluation metrics such as mean squared error (MSE), root mean squared error (RMSE), or R-squared.

MSE measures the average squared difference between the predicted and actual happiness scores. RMSE, on the other hand, is the square root of MSE and provides a more interpretable measure of prediction accuracy. A lower MSE or RMSE value indicates better model performance.Another commonly used metric is R-squared, which represents the proportion of variance in the happiness scores that can be explained by the predictor variables. R-squared ranges from 0 to 1, where a value closer to 1 suggests a higher degree of prediction accuracy.

By computing these evaluation metrics, we will gain insights into how well the model performs in predicting the happiness score based on the selected predictor variables. This information is crucial for assessing the model’s reliability and determining if any adjustments or improvements are necessary.

Once we have obtained the best performing model, we will utilize the model to predict the happiness scores for the next year.

Happiness Score Linear Regression Model Evaluation

set.seed(123)

# Remaining code...
common_indices <- intersect(1:length(prediction_lr), 1:length(testData$H_score))

aligned_predictions <- prediction_lr[common_indices]
# print(aligned_predictions)

aligned_actualH_score <- testData$H_score[common_indices]
# print(aligned_actualH_score)

# Calculate the MSE                 
mse_lr_hpy <- mean((aligned_predictions - aligned_actualH_score )^2)
# Calculate the RMSE
rmse_lr_hpy <- sqrt(mse_lr_hpy)
# Calculate the R-squared
rsquared_lr_hpy <-cor(aligned_predictions, aligned_actualH_score) ^ 2

plot(x=aligned_predictions , y= aligned_actualH_score,
     xlab='Predicted Score',
     ylab='Actual Score',
     main='Predicted vs. Actual Score for all years')+
abline(a=0, b=1)

## integer(0)
# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse_lr_hpy, "\n")
## Mean Squared Error (MSE): 0.3092783
cat("Root Mean Squared Error (RMSE):", rmse_lr_hpy, "\n")
## Root Mean Squared Error (RMSE): 0.5561279
cat("R-squared:", rsquared_lr_hpy, "\n")
## R-squared: 0.7104922

Happiness Score Random Forest Model Evaluation

set.seed(123)

# Find the common indices between the two vectors
common_indices <- intersect(1:length(prediction_rf), 1:length(testData$H_score))

aligned_predictions <- prediction_rf [common_indices]
aligned_actualH_score <- testData$H_score[common_indices]

plot(x=aligned_predictions , y= aligned_actualH_score,
     xlab='Predicted Score',
     ylab='Actual Score',
     main='Predicted vs. Actual Score for all years')+abline(a=0, b=1)

## integer(0)
# Calculate the MSE                 
mse_rf_hpy <- mean((aligned_predictions - aligned_actualH_score )^2)
# Calculate the RMSE
rmse_rf_hpy <- sqrt(mse_rf_hpy)
# Calculate the R-squared
rsquared_rf_hpy <-cor(aligned_predictions, aligned_actualH_score) ^ 2

# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse_rf_hpy, "\n")
## Mean Squared Error (MSE): 0.1529868
cat("Root Mean Squared Error (RMSE):", rmse_rf_hpy, "\n")
## Root Mean Squared Error (RMSE): 0.3911353
cat("R-squared:", rsquared_rf_hpy, "\n")
## R-squared: 0.8598955

Happiness Score Gradient Boosting Model Evaluation

set.seed(123)
# Find the common indices between the two vectors
common_indices <- intersect(1:length(prediction_gbm), 1:length(testData$H_score))

aligned_predictions <- prediction_gbm[common_indices]

aligned_actualH_score <- testData$H_score[common_indices]

plot(x=aligned_predictions , y= aligned_actualH_score,
     xlab='Predicted Score',
     ylab='Actual Score',
     main='Predicted vs. Actual Score for all years')+abline(a=0, b=1)

## integer(0)
# Calculate the MSE                 
mse_gbm_hpy <- mean((aligned_predictions - aligned_actualH_score )^2)
# Calculate the RMSE
rmse_gbm_hpy <- sqrt(mse_gbm_hpy)
# Calculate the R-squared
rsquared_gbm_hpy <-cor(aligned_predictions, aligned_actualH_score) ^ 2

# Print the evaluation metrics
cat("Mean Squared Error (MSE):", mse_gbm_hpy, "\n")
## Mean Squared Error (MSE): 0.2143828
cat("Root Mean Squared Error (RMSE):", rmse_gbm_hpy, "\n")
## Root Mean Squared Error (RMSE): 0.4630149
cat("R-squared:", rsquared_gbm_hpy, "\n")
## R-squared: 0.7996362

Model Comparison

Model_Comp <- data.frame(
  Models = c("LR Happiness", "RF Happiness", "GBM Happiness"),
  MSE = c(mse_lr_hpy, mse_rf_hpy, mse_gbm_hpy),
  RMSE = c(rmse_lr_hpy, rmse_rf_hpy, rmse_gbm_hpy),
  Rsquared = c(rsquared_lr_hpy, rsquared_rf_hpy, rsquared_gbm_hpy)
)
print(Model_Comp)
##          Models       MSE      RMSE  Rsquared
## 1  LR Happiness 0.3092783 0.5561279 0.7104922
## 2  RF Happiness 0.1529868 0.3911353 0.8598955
## 3 GBM Happiness 0.2143828 0.4630149 0.7996362

Among the 3 models, the Random Forest model perform better in predicting both the happiness score. In predicting happiness score, the R squared value of the model is 0.8598955. This indicates that the model offers a decent fit to the data and is able to capture a sizable percentage of the variability in the happiness scores. On the other hand, the Mean Squared Error and the Root Mean Squared Error is 0.1529868 and 0.3911353 respectively for predicting happiness score. These values are the lowest compared to the values from Linear Regression model and Gradient Boosting Model. The low Mean Squared Error and the Root Mean Squared Error would means that have small average errors and are generally accurate in estimating the happiness scores.

As for Gradient Boosting Model, the R squared value of the happiness score prediction model is 0.7996362. This indicates the model also fits the data very well but the Random Forest model is slightly better. The Mean Squared Error and the Root Mean Squared Error of the happiness score prediction model is 0.2143828 and 0.4630149 respectively which are slightly higher than the values from Random Forest model, but this still indicates that the model has quite low error when predicting the happiness score.

The worst performing model is linear regression model with the R squared value of 0.7104922 which is the lowest among all happiness prediction models. Although the R squared value can be considered a good fit to the data, but the other models are performing better.The same applies to the Mean Squared Error and the Root Mean Squared Error which the values are 0.3092783 and 0.5561279 respectively.

Feature Importance Analysis

Random forest model has the highest accuracy in predicting happiness score. Therefore, the model is used for feature importance analysis to check which factors have the most significant impact on a country’s happiness score.

The varImp() function is used to calculate variable importance in a random forest model. The “Overall” importance, which indicates the typical decline in accuracy when a given variable is randomly permuted, is the importance metric that is employed here. A bar plot is used to display the variable importance in descending order.

From the bar plot, it is evident that the GDP is the most important variable that would affect the happiness of the population in a country. The second highest important variable is life expectancy. Using these information, a developed country that is rich and has good healthcare system would have happier population. Social support is the third important variable that is affecting the happiness score. In short, social support can be defined as the perception of having assistance, care, and emotional support from family, friends, and the community. People who have strong social support networks tend to experience higher levels of happiness.Year has no contribution to the happiness score and it is understandable that time is not a variable that can affect the happiness score but it serves as an important indicator of the temporal context in which the happiness scores are measured.

var_importance <- varImp(model_rf)

# Create a data frame with variables and their importance scores
variable_importance <- data.frame(
  Variable = row.names(var_importance$importance),
  Importance = var_importance$importance[, "Overall"])

# Sort the variable importance in descending order
variable_importance <- variable_importance[order(-variable_importance$Importance), ]

# Visualize the variable importance
library(ggplot2)
ggplot(variable_importance, aes(x = Variable, y = Importance)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  xlab("Variable") +
  ylab("Importance") +
  ggtitle("Variable Importance for Happiness Score") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

2023 Happiness Score Prediction

To predict the happiness score for the next year, we will utilize the backfill imputation method. This approach involves filling in missing values with the next available value in the dataset. In the context of the World Happiness Report dataset, we will use the happiness score from the previous year as a substitute for the missing value in the current year. By incorporating the backfill imputation method, we will obtain predictions for the next year’s happiness scores, and obtain the World Happiness Report country rank for the next year projection

# Create a sequence of years from 2008 to 2023
allYears <- seq(2008, 2023)
# Complete the data for each country by filling in missing years with the previous year's values
completedData <- df_Hscore_new %>%
  complete(Country, Year = allYears) %>%
  group_by(Country) %>%
  fill(everything(), .direction = "up")

data <-completedData

# Impute missing values with the mean:
# Group the data by Country
data <- data %>%
  group_by(Country) %>%
  arrange(Year)

# Fill missing values with previous year's value for each country
imputedData <- data %>%
  fill(everything())

data <-imputedData
data <- data %>% drop_na()

trainIndex <- createDataPartition(data$H_score, p = 0.7, list = FALSE)
trainData <- data[trainIndex, ]
testData <- data[-trainIndex, ]
# Predict the Next year's H_score value
currentYear <- max(data$Year)
previousYear <- currentYear - 1
nextYearData <- data[data$Year == currentYear, ]
previousYearData <- data[data$Year == previousYear, ]
nextYearData$H_score <- predict(model_rf, newdata = nextYearData)

# Print the latest predicted H_score for each country
print(nextYearData[, c("Country", "H_score","Year")])
## # A tibble: 155 × 3
## # Groups:   Country [155]
##    Country     H_score  Year
##    <chr>         <dbl> <int>
##  1 Afghanistan    4.24  2023
##  2 Albania        5.53  2023
##  3 Algeria        5.49  2023
##  4 Angola         5.48  2023
##  5 Argentina      6.22  2023
##  6 Armenia        5.47  2023
##  7 Australia      6.75  2023
##  8 Austria        6.60  2023
##  9 Azerbaijan     5.85  2023
## 10 Bahrain        6.68  2023
## # ℹ 145 more rows
# AveragedData dataframe will contain the averaged H_score values for each country 
# across all the years from 2008 to 2023.
averagedData <- NULL

for (year in 2008:2023) {
  currentYearData <- data[data$Year == year, ]
  previousYearData <- data[data$Year == year - 1, ]
  
  combinedData <- rbind(currentYearData, previousYearData)
}

# Merge next year data with combinedData
combinedData <- rbind(combinedData, nextYearData)

# View the updated combinedData dataframe
# combinedData

# Aggregate the updated averagedData dataframe (average)
averagedData <- aggregate(H_score ~ Country, data = combinedData, FUN = mean)

# Rank the average H_score
averagedData <- averagedData[order(-averagedData$H_score),]
averagedData$Rank <- 1:nrow(averagedData)
averagedData <- averagedData
averagedData

Conclusion

In our analysis, we have successfully addressed the following questions:

  1. How will happiness scores change in the future? By utilizing the best-performing model, we have generated predictions for the next year’s happiness scores. These predictions provide insights into the expected changes in happiness rankings, allowing us to anticipate how countries’ happiness scores might evolve over time.

  2. What are the key predictors of happiness scores? Through an assessment of variance importance and correlation analysis, we have identified the key predictors that significantly influence happiness scores. Our findings indicate that GDP (Gross Domestic Product) emerges as the most important variable for predicting happiness. Additionally, we have explored the relationships between other variables such as social support, healthy life expectancy, freedom, generosity, and perceptions of corruption, shedding light on their contributions to happiness scores.

We also have successfully achieved all the objectives of our analysis:

  1. Understanding the factors affecting world happiness - Through our EDA, we have explored various variables and their relationships with happiness scores. This helps us gain insights into the factors that contribute to overall happiness.

  2. Developing predictive models for world happiness using different ML algorithms - We have utilized different machine learning algorithms, such as Linear Regression,Random Forest and Gradient Boosting, to develop predictive models for world happiness. These models utilize a range of predictor variables to forecast happiness scores.

  3. Evaluating the performance of the predictive models - We have assessed the performance of the developed models by evaluating metrics such as mean squared error (MSE), root mean squared error (RMSE), and R-squared. These metrics provide an indication of how well the models are able to predict happiness scores based on the selected predictor variables.

  4. Identifying the best performing predictive models for world happiness - Among the three algorithms applied in this study, the Random Forest algorithm has the best performance in predicting happiness score with the R-squared values of 0.861 respectively. The high R-squared values mean the Random Forest algorithm has an excellent ability in predicting both happiness score. Future work includes performing hyperparameter optimization to even upgrade the performance of the parameter.

  5. Studying the relationship between dependent and independent variables - Comparing the variance importance and correlation diagram, GDP is the most important variable to predict happiness score and at the same time, GDP has the highest correlation with happiness score. Both indicate that GDP is crucial and heavily influences the happiness score. Healthy Life Expectancy and Social Support are the top three variables with the highest correlation with happiness score and also become the top three most important factors in predicting happiness score. Hence, the government need to focus on improving GDP, healthy life expectancy and social support so that society can live in a happier environment.

By accomplishing these objectives, we have gained a comprehensive understanding of world happiness, developed predictive models, assessed their performance, and explored the relationships between variables. These findings contribute to the field of happiness research and provide valuable insights for policymakers, researchers, and individuals interested in promoting happiness and well-being worldwide

References

  1. SAMREEN I, MAJEED MT. Does social development increase the happiness level? Evidence from global panel data. Turkish Econ Rev. 2019;6(4):320–34.
  2. Mishra PP, Dash S. World Happiness Report. Palgrave Encycl Glob Secur Stud. 2023;1595–9.