I. INTRODUCTION TO DATA

Traditionally, economic prosperity is considered a success indicator of a country. However, societal well-being is being increasingly taken into account regarding the economy’s flourishing. It is challenging and essential to comprehensively understand various factors that shape the overall global happiness and how significantly each factor influences the targeted figure.

We acknowledge the importance of life satisfaction and the influential features that affect the figures relating to the residents’ happiness. This statistical report delves into the relationship between key indicators and the Happiness index across 150 countries and 4 continents in the year 2022.

This report aims to identify the main factors contributing to nations’ happiness levels by looking at the economic factors (GDP, GDP per Capita, GDP Growth Rate, and Inflation Rate), and social metrics (Population, Corruption Index, and Unemployment Rate).

This comprehensive dataset of nations with varied economic structures, cultural backgrounds, and governance models is expected to uncover the patterns and correlations of investigated elements. As a result, the Happiness index of a particular country can be predicted, and the most contributive variables can be revealed.

The dataset comprehensively overviews various countries, including economic and socio-cultural indicators. Each entry in the dataset represents a different country and includes the following information:

Rank: The ranking of the country based on GDP.
Country: The name of the country.
GDP (Gross Domestic Product): The country’s total economic output indicates its size.
Population: The total number of people living in the country.
Inflation Rate: The percentage increase in the general price level of goods and services over one year.
GDP Growth Rate: The rate at which the country’s economy expands or contracts over a specific period.
Happiness Index: A measure of the subjective well-being and happiness of the country’s citizens.
GDP per Capita: The GDP divided by the population, providing an average economic output per person.
Continent: The continent to which the country belongs.
Corruption Index: A measure of perceived corruption in the country, though the specific index used is not defined in the provided dataset.
Unemployment Rate: The percentage of the labor force that is unemployed and actively seeking employment.

#build Dataset
group <- GDP$Continent
subgroup <- GDP$Country
value <- GDP$GDP
data <- data.frame(group,subgroup,value)

#box plot for GDP per capita
GDP %>%
  filter (GDP_per_capita < 50) %>%
  ggplot(aes(x = reorder(Continent, -GDP_per_capita, FUN = median), y = GDP_per_capita, color = Continent)) +
  geom_line(linewidth = 0.8) +
  geom_boxplot(position = position_dodge(width = 5), outlier.shape = NA, linewidth = 0.8) +
  labs(title = "Average GDP per capita of the four Continents",
       x = "Continent",
       y = "GDP per capita") +
  theme_bw()

Description:

Most countries have a GDP per capita lower than 50. While Europe’s figures are the most noticeable, Africa has insignificant numbers.
Europe has the highest lower and upper bounds and the average number at 25. Although America’s upper bounds are lower than Asia, its average rate is higher at around 10. This can be explained by the huge variance in the figures of Asia from 5 to 30 compared to around 10 to 20 in America.

#histogram for happiness index
GDP %>%
  ggplot(aes(x=Happiness_index)) +
  geom_histogram(binwidth=0.25,
                 fill="#B0A695",
                 color="#F3EEEA", alpha=0.9) +
  theme_minimal()+
  labs(title = "Happiness Index distribution",
       x = "Happiness index",
       y = "Frequency")

Description:

The overall Happiness index represents a symmetric distribution.
The figures from 5.8 to 6.2 are outstandingly high, appearing the most for 14 counts at 6.0. The 5.0 rate also accounts for a high frequency of 13.
There are some data gaps before reaching endpoint 2.4, which should be considered an outlier.

GDP %>%
  ggplot(aes(x=Happiness_index)) +
  geom_histogram(binwidth=0.5,
                 fill="#B0A695",
                 color="#F3EEEA", alpha=0.9) +
  theme_minimal()+
  labs(title = "Happiness Index distribution by Continents",
       x = "Happiness index",
       y = "Frequency")+
  facet_wrap(~Continent)

Description:

However, the figures are moderately different when looking at the graph categorized by Continents with 0.5 bin width.
The peak points of America and Asia are roughly the same with the former graph at around 6. Interestingly, Africa has most recorded rates lower than 5.5. On the other hand, the Europe graph shows high counts at 7 to 7.5 apart from the similar peaks at around 6.
We may conclude that Europeans are slightly happier and Africa’s happiness rate is quite low.

#bar chart for Corruption index
mean_values <- aggregate(`Corruption_index` ~ Continent, data = GDP, mean)

ggplot(mean_values, aes(x = Continent, y = `Corruption_index`, fill = Continent)) +
    geom_bar(stat = "identity", width = 0.7) +
    geom_text(aes(label = round(`Corruption_index`, 2)),
              position = position_nudge(y = 1), size = 3, color = "black") +
    labs(title = "Average Corruption Rate by Continent",
         x = "Continent",
         y = "Average Corruption Rate") +
    theme_minimal()

Description:

The bar chart represents the average corruption rates across four continents—Africa, America, Asia, and Europe.
The analysis reveals distinct levels of corruption across continents, with Africa exhibiting the lowest average rate at 32.62, followed by America at 40.11, Asia at 41.35, and Europe with the highest average rate at 58.49.
The ascending order of corruption rates is clearly illustrated, offering a comparative view of the relative corruption levels.

It is crucial to make accurate calculations in this section to compare every factor precisely. We use the Multiple Linear Regression and T-Test, technique, as well as other methods such as: Confidence interval, Correlation, etc. Using those aforementioned techniques, we can from then on figure out which factor will be eliminated during the analysis procedure, and which one will be considered thoroughly throughout the calculating process. Before going to the main parts of Multiple Linear Regression, we shall do the following calculations to make further analysis much easier to do:

#Calculate to eliminate outliers
IQR1 <- IQR(GDP_sample$Happiness_index)
Lower_limit1 <- quantile(GDP_sample$Happiness_index, probs = 0.25) - 1.5*IQR1
Upper_limit1 <- quantile(GDP_sample$Happiness_index, probs =0.75) + 1.5*IQR1

IQR2 <- IQR(GDP_sample$Corruption_index)
Lower_limit2 <- quantile(GDP_sample$Corruption_index, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_sample$Corruption_index, probs =0.75) + 1.5*IQR2

IQR3 <- IQR(GDP_sample$Inflation_rate)
Lower_limit3 <- quantile(GDP_sample$Inflation_rate, probs = 0.25) - 1.5*IQR3
Upper_limit3 <- quantile(GDP_sample$Inflation_rate, probs =0.75) + 1.5*IQR3

IQR4 <- IQR(GDP_sample$GDP_per_capita)
Lower_limit4 <- quantile(GDP_sample$GDP_per_capita, probs = 0.25) - 1.5*IQR4
Upper_limit4 <- quantile(GDP_sample$GDP_per_capita, probs =0.75) + 1.5*IQR4

IQR5 <- IQR(GDP_sample$Unemployment)
Lower_limit5 <- quantile(GDP_sample$Unemployment, probs = 0.25) - 1.5*IQR5
Upper_limit5 <- quantile(GDP_sample$Unemployment, probs =0.75) + 1.5*IQR5

IQR6 <- IQR(GDP_sample$GDP_growth_rate)
Lower_limit6 <- quantile(GDP_sample$GDP_growth_rate, probs = 0.25) - 1.5*IQR6
Upper_limit6 <- quantile(GDP_sample$GDP_growth_rate, probs =0.75) + 1.5*IQR6

#build dataset
GDP_5_stats <- subset(GDP_sample,
                      select=c("Happiness_index","Corruption_index",
                               "Inflation_rate","GDP_per_capita",
                               "Unemployment","Continent"))

GDP_no_outliers <- GDP_5_stats %>%
  filter(Happiness_index>Lower_limit1
         & Happiness_index<Upper_limit1) %>%
  filter(Inflation_rate>Lower_limit3
         & Inflation_rate<Upper_limit3) %>%
  filter(Corruption_index<Upper_limit2
         & Corruption_index>Lower_limit2) %>%
  filter(GDP_per_capita<Upper_limit4
         & GDP_per_capita>Lower_limit4) %>%
  filter(Unemployment<Upper_limit5
         & Unemployment>Lower_limit5)

Note: We take a sample of 50 countries to investigate in order to mimic the real process of investigation as in real life we need to collect the data; therefore, this number of nations in the sample data is reasonable to ensure the size of data is big enough and we can save time when constructing the data set.

II. MULTIPLE LINEAR REGRESSION

1. Introduction

Because we mostly focus on factors beyond GDP, this section will discuss five main elements: GDP per Capita, Corruption Index, Inflation Rate, Unemployment Rate, and Happiness Index.

From those rigorous analyses and calculations, we can conclude which factors greatly impact the happiness index. This analytical calculation is also utilized in real-life economic growth and social development situations.

Every country has its ups and downs. Hence, every leader or government will consider every element to figure out gradually which has a greater effect on the happiness of the citizens of that country. From thorough research and calculation, those said leaders can determine which one to invest more or less in to give the citizens a better quality of life and make them more content with their overall lives.

After all, every leader of every country will always wish the best for the country. Great economic growth and (societal elements), especially happiness, are no exceptions. From then on, they will see a great increase in the well-being of the citizens of those nations.

After that, the leaders or governments can create a suitable economic-social module and then figure out which to invest more or less in for the well-being of the people in their country.

2. Motivation and methodology

Having several factors with complex relationships, it is time-consuming to analyze the impact of all variables in detail. Instead, we want to examine their influence on the Happiness index simultaneously, and the Multiple Linear Regression may suit our purpose well.

The technique builds a linear model with the dependent variable \(y_i\) to be the Happiness index and independent variables GDP per capita, Inflation rate, Corruption index, and Unemployment rate to be \(x_{i1}\), \(x_{i2}\), \(x_{i3}\), and \(x_{i4}\), respectively.

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} \]

Coefficients: The coefficients \(\beta_1\), \(\beta_2\), \(\beta_3\), and \(\beta_4\) indicate if the independent variable has a positive or negative correlation relationship with the dependent variable. If the coefficient is positive, the variables are positively correlated, and vice versa.

p-value: The Multiple Linear Regression computes the p-values of the independent variable for the hypothesis test.

Hypothesis: The null hypothesis \(H_0\) states that there is no significant relationship between the independent and dependent variables.

\[ H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0 \]

Hypothesis Test:

If we reject the null hypothesis of any variable (\(p_{\text{value}} < 0.1\)), the independent variables have a noticeable influence on the dependent variable.
Otherwise, if we accept the null hypothesis of any variable (\(p_{\text{value}} > 0.5\)), the influence of independent variables on the dependent variable is considered insignificant, and we may eliminate that variable from the hypothesis test.
If \(0.1 < p_{\text{value}} < 0.5\), we may accept the null hypothesis, and we can remove the variable from the model. However, there is no sufficient evidence that the variable has no impact on the dependent variable.

3. Interpretation of the tests

First, the Multiple Linear Regression is applied to four independent variables: GDP per capita, Inflation rate, Corruption index, Unemployment rate, and dependent variable Happiness index. The regression model is presented as follows:

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} + \beta_4 x_{i4} \]

Dependent Variable:

\(y_i\): Happiness index

Independent Variables:

\(x_{i1}\): Corruption index
\(x_{i2}\): Inflation rate
\(x_{i3}\): GDP per capita
\(x_{i4}\): Unemployment rate

Null Hypothesis (\(H_0\)):

The null hypothesis states that there is no significant relationship between four independent variables: GDP per capita, Corruption Index, Inflation Rate, Unemployment rate, and the dependent variable Happiness Index.

\[ H_0: \beta_1 = \beta_2 = \beta_3 = \beta_4 = 0 \]

To test this hypothesis, we will use the following R code to calculate the p-values:

#table test of multiple linear regression
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita + Unemployment,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Corruption index',
                             'Inflation rate','GDP per capita','Unemployment')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.7416	0.3201	14.8126	0.0000
Corruption index	0.0208	0.0079	2.6459	0.0125
Inflation rate	-0.0185	0.0178	-1.0407	0.3058
GDP per capita	0.0253	0.0083	3.0468	0.0046
Unemployment	-0.0248	0.0248	-1.0023	0.3237

From the results, we can accept the null hypothesis for the two independent variables, Inflation rate and Unemployment rate, and they are no longer needed in the model. Among the two variables, the Unemployment rate has the higher p_value and is thus removed from the model first. The subsequent model with Corruption index, Inflation rate, and GDP per capita should be fitted again.

Regression Equation:

The regression equation is given by:

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} + \beta_3 x_{i3} \]

Dependent Variable:

\(y_i\): Happiness index

Independent Variables:

\(x_{i1}\): Corruption index
\(x_{i2}\): Inflation rate
\(x_{i3}\): GDP per capita

Null Hypothesis (\(H_0\)):

The null hypothesis states that there is no significant relationship between three independent variables (Corruption Index, Inflation Rate, GDP per capita) and the dependent variable (Happiness Index).

\[ H_0: \beta_1 = \beta_2 = \beta_3 = 0 \]

To test this hypothesis, we will use the following R code to calculate the p-values:

#multiple-linear regression without unemployment
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter','Corruption index',
                             'Inflation rate','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.6753	0.3132	14.9266	0.0000
Corruption index	0.0198	0.0078	2.5316	0.0163
Inflation rate	-0.0245	0.0167	-1.4678	0.1516
GDP per capita	0.0264	0.0082	3.2026	0.0030

The p_value for Inflation rate now becomes 0.15 (notice that it has dropped slightly after the Unemployment rate has been removed from the model), but it again indicates that we may accept the null hypothesis and the variable should be removed from the table. The following model should be conducted with Corruption index and GDP per capita.

Regression Equation:

The regression equation is given by:

\[ y_i = \beta_0 + \beta_1 x_{i1} + \beta_2 x_{i2} \]

Dependent Variable:

\(y_i\): Happiness index

Independent Variables:

\(x_{i1}\): Corruption index
\(x_{i2}\): Inflation rate

Null Hypothesis (\(H_0\)):

The null hypothesis states that there is no significant relationship between two independent variables (Corruption Index, GDP per capita) and the dependent variable (Happiness Index).

\[ H_0: \beta_1 = \beta_2 = 0 \]

To test this hypothesis, we will use the following R code to calculate the p-values:

#multiple-linear regression without inflation rate
model <- lm(Happiness_index ~ Corruption_index + GDP_per_capita,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter',
                             'Corruption index','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.4054	0.2578	17.0863	0.000
Corruption index	0.0195	0.0079	2.4633	0.019
GDP per capita	0.0292	0.0081	3.5880	0.001

The final Multiple linear regression suggests that the p_values of GDP per capita and Corruption index is small enough to reject the Null hypothesis and conclude that there is a significant relationship between these two variables and the Happiness index.

4. Conclusion

Throughout the analysis of Multiple Linear Regression, we reach to the two final conclusions about the collected dataset:

Inflation and Unemployment rates are not needed in the multiple linear regression model. However, there is insufficient evidence to conclude that their impact on the Happiness index is insignificant.
From the demonstrated hypothesis tests of Multiple Linear Regression, we can conclude that the Corruption index and the GDP per capita significantly influence the Happiness index. As a result, it is important to consider these two variables more deeply.

In further analysis, other techniques will be applied for further research to identify how GDP per capita and Corruption index individually affect the country’s happiness levels.

III. ANOVA

1. Introduction

We have been taught, or at least assumed, “the wealthier, the merrier” by intuition. But is it actually true in reality? And how can we prove it explicitly?

To answer this question, our team aims to test whether there is statistical equality in the average Happiness Indices among the concerned continents by utilizing one-factor analysis of variances technique (ANOVA). In case there are differences among them, in other words, when the null hypothesis \(H_0\) is rejected, we further apply the Tukey’s multiple comparisons procedure (or Tukey’s HSD) to construct the confidence intervals for these differences. Consequently, we form the testing hypothesis as follows:

\([H_{0}]\): \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus \([H_{A}]\): at least one pair of means are different from each other

2. Detailed analysis

Hypothesis testing at a 5% (=0.05) level of significance: \(H_0\): \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus \(H_A\): at least one pair of means are different from each other
\(𝝁_{1}\): the mean of Happiness Index for Africa
\(𝝁_{2}\): the mean of Happiness Index for America (Continent)
\(𝝁_{3}\): the mean of Happiness Index for Asia
\(𝝁_{4}\): the mean of Happiness Index for Europe

First, R empowers us to effortlessly construct the boxplot. The detailed R code is as follows:

#Take out a portion of dataset
SubsetGDP <- subset(GDP_sample, select = c("Continent", "Happiness_index"))

#Draw boxplot
par(mar = c(3, 4, 3, 1))
boxplot(SubsetGDP$Happiness_index ~ SubsetGDP$Continent, 
        xlab = "Continents", ylab = "Happiness Index",
        main = "Happiness Index in accordance to Continents")

At first glance of the plot, we might notice the possible difference between the concerned means. To make things more explicit, we shall now calculate the precise values for the statistics of ANOVA, which can be easily computed thanks to the R programming language. The detailed R code is as follows:

#One-way ANOVA
oneway <- aov(Happiness_index ~ Continent, 
              data = SubsetGDP)
anova_table <- summary(oneway)
SumSq <- anova_table[[1]]$'Sum Sq'
Df <- anova_table[[1]]$'Df'
MeanSq <- anova_table[[1]]$'Mean Sq'
F_value <- anova_table[[1]]$'F value'
p_value <- anova_table[[1]]$'Pr(>F)'
table <- cbind(Df, SumSq, MeanSq, F_value, p_value)

# Display the ANOVA summary as a table
colnames(table) <- c('Degree of freedom', 'Sum squared', 'Mean squared', 'F-value', 'p-value')
rownames(table) <-c('Continent','Residuals')
kable(table, caption = "ANOVA Summary Table")

ANOVA Summary Table
	Degree of freedom	Sum squared	Mean squared	F-value	p-value
Continent	3	21.30352	7.101172	10.89693	1.58e-05
Residuals	46	29.97668	0.651667	NA	NA

From the ANOVA table above, it is unambiguous that the p-value is smaller than 5% (even far smaller than 1%), the null hypothesis \(H_0\) thereby is clearly rejected. Hence, it can be concluded that there is sufficient evidence that at least the Happiness Index of one pair of Continents are different from each other.

3. Constructing confidence intervals

But we want to delve into the data further. We wonder \(𝝁_{i}\) of how many pairs of Continents are different from each other and by how much?
First, recall that in terms of GDP per capita, Europe is considered to be the richest, followed by Asia and America (Continent), while Africa is, as expected, the poorest among the five.
Next, to satisfy our concern, our team further conduct the Tukey’s honestly significant difference test (also known as Tukey multiple comparisons procedure) on this dataset. In particular, we conduct the test at a 10% significance level. The R code is as follows:

#Tukey's HSD test
tukey <- TukeyHSD(oneway, conf.level=0.9)
table2 <- tukey$Continent
colnames(table2) <- c('Difference', 'Lower', 'Upper', 'p-value')
kable(table2, caption="Tukey's HSD Table")

Tukey’s HSD Table
	Difference	Lower	Upper	p-value
America-Africa	1.0000000	0.1607861	1.8392139	0.0353294
Asia-Africa	1.0187500	0.2919694	1.7455306	0.0096522
Europe-Africa	1.8461538	1.0842808	2.6080269	0.0000046
Asia-America	0.0187500	-0.7742326	0.8117326	0.9999369
Europe-America	0.8461538	0.0208890	1.6714187	0.0879329
Europe-Asia	0.8274038	0.1167760	1.5380317	0.0413377

However, the result came out a bit vague and is indeed quite challenging to observe and make further comparisons. To make it more illustrative, we decided to plot the results. The R code is as follows:

#Confidence level plotting
par(mar = c(5, 7, 3, 1))
plot(tukey, las = 2)

Overall, it is clear that the 90% confidence interval for the difference in Happiness Index between Asia and America (Continent) does indeed contain 0, while that between the others does not have the same property.
It is therefore highly plausible that the Happiness Indices of Asia and America are extremely close to each other, implying that people in Asia and in America are comparably happy in general. Particularly, the 90% confidence interval for the difference between Asia and America is:

\(𝝁_{3}\) - \(𝝁_{2}\) ∈ (-0.7742326, 0.8117326)

On the other hand, there is sufficient evidence that it is not plausible that the Happiness Indices of Europe and America are equivalent to each other, suggesting that in general, people in Europe and in America are unequally satisfied with their lives. The 90% confidence interval for the difference between Europe and America is:

\(𝝁_{4}\) - \(𝝁_{2}\) ∈ (0.0208890, 1.6714187)

Similarly, the most noticeable difference we may be aware of must be between Europe and Africa. The 90% confidence interval for the difference between Europe and Africa is:

\(𝝁_{4}\) - \(𝝁_{1}\) ∈ (1.0842808, 2.6080269)

This result implies that people in Africa are generally very unhappier than those in Europe. Given that the GDP per capita of Europe is approximately 20 times higher than that of Africa, leading to significantly higher standard welfare and much better living environment, the result above is understandable.
Similarly, the 90% confidence interval for the differences between America and Africa, between Asia and Africa, and between Europe and Asia are in turn below:

\(𝝁_{2}\) - \(𝝁_{1}\) ∈ (0.1607861, 1.8392139)

\(𝝁_{3}\) - \(𝝁_{1}\) ∈ (0.2919694, 1.7455306)

\(𝝁_{4}\) - \(𝝁_{3}\) ∈ (0.1167760, 1.5380317)

4. Conclusion

All things considered, the analysis above suggests that our initial intuition is plausible, i.e. as the wealth of continents increases, so does the overall happiness and the well-being of their populations. In particular, it can be observed that people in Europe are the happiest, followed by those in the continent of Asia and America. Notwithstanding, Africa certainly exhibits the highest degree of unhappiness among its citizens.

Once again, up to now, we might be aware of the possible relationship between GDP per capita (or the “richness”) and the happiness (using Happiness Indices as a measurement) of the continents by intuition. But to make our belief more explicit and unarguable, we further conduct the non-linear regression technique as follows.

IV. NON-LINEAR REGRESSION

1. Introduction

We have completed the comparison of GDP per capita and the happiness index among the four continents. However, it is necessary to construct a model representing the relationship between the two variables, which are GDP per capita and happiness index in order to further the understanding of this connection. Our goal is to build a mathematical model and based on it, to investigate the relationship. Firstly, we construct the hypotheses to test whether our assumption formula is plausible to consider further.

Two-sided hypotheses:

\(\ \ \ H_0: \beta_1 = 0\ \ \ versus \ \ \ H_A: \beta_1 \neq 0\)

Then, we will check how consistent this model is; therefore, we can apply to predict the happiness index to have a look into the contentment of a particular population with the quality of life when we already have the GDP per capita values.

2. Detailed analysis

GDP per capita: is an economic metric that breaks down a country’s economic output per person and is calculated by dividing the total GDP of a country to its total population. Economists often use this index to determine the prosperity of a nation.

IQR7 <- IQR(GDP_No_Outliers$GDP_per_capita)
Lower_limit7 <- quantile(GDP_No_Outliers$GDP_per_capita, probs = 0.25) - 1.5*IQR7
Upper_limit7 <- quantile(GDP_No_Outliers$GDP_per_capita, probs =0.75) + 1.5*IQR7

GDP_NO_outliers <- subset(GDP_No_Outliers, GDP_per_capita>Lower_limit7 & GDP_per_capita<Upper_limit7)

*Here, we denote the independent variable is GDP per capita while happiness index is dependent variable

First of all, we take a general look to the relationship between happiness index and GDP per capita:

ggplot(GDP_NO_outliers, aes(x = GDP_per_capita, y = Happiness_index)) +
  geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
  geom_smooth(method = 'loess', formula = y ~ x, color = 'blue4', linewidth = 1.5, span=1.2) +
  labs(
    title = 'Relationship between happiness index and GDP per capita',
    y = 'Happiness index', x = 'GDP per capita',
    subtitle = 'Happiness index: scale of 10 | GDP per capita: thousand USD')

From the graph above, we can figure out that there is an obvious connection between happiness index and GDP per capita and there should be a non-linear connection here. To be more particularly, the line is quite similar to the plot of the function: \(y=\ m\ + \ log(x) \ (x>0)\) ; therefore, we assume that the non-linear formula between y=happiness index an x=GDP per capita would be: \(y_i=\beta_0\ + \beta_1*log(x_i)\) . Then, it it necessary to find out the values of \(\beta_0\) and \(\beta_1\) expected and how strong this relationship is.

model_non <- nls(Happiness_index ~ b + a*log(GDP_per_capita), 
             data = GDP_NO_outliers, start = list(a=1, b=1))
summary(model_non)

## 
## Formula: Happiness_index ~ b + a * log(GDP_per_capita)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a  0.54147    0.06954   7.786 7.07e-10 ***
## b  4.42452    0.16106  27.472  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6151 on 45 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 1.205e-08

summary_table_non <- summary(model_non)$parameters
rownames(summary_table_non) <- c('Beta 1 (a)', 'Beta 0 (b)')
colnames(summary_table_non) <- c('Estimate', 'Standard Error', 't-value', 'p-value')
kable(summary_table_non, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Beta 1 (a)	0.5415	0.0695	7.7864	0
Beta 0 (b)	4.4245	0.1611	27.4717	0

From the results obtained from the table above:\(\beta_0=4.4245\) [and] \(\beta_1=0.5415\). Therefore, the formula of this model will be:\(y_i=4.4245\ +\ 0.5415*log(x_i)\) and this represents a positive relationship. Furthermore, the achieved convergence tolerance of this model is quite small (~0), which indicates a high level of precision and accuracy in this estimation process.

Hence, we are now confident to utilize this model to predict the happiness index of a nation when we already have the GDP per capita value. Let’s take Vietnam and China situations as examples, with the GDP per capta being 4.2 (thousand USD) and 12.7 (thousand USD) respectively.

new_gdppercapita <- data.frame(GDP_per_capita = c(4.2,12.7))
prediction_gdppercapita <- predict(model_non, interval = "prediction", newdata = new_gdppercapita, level = 0.8)
GDP_per_cap <- new_gdppercapita$GDP_per_capita
Vietnam <- prediction_gdppercapita[1]
China <- prediction_gdppercapita[2]
predicted_happiness_bygdp <- rbind(Vietnam, China)
predicted_happiness_bygdp_table <- cbind(GDP_per_cap, predicted_happiness_bygdp)
colnames(predicted_happiness_bygdp_table) <- c('GDP per capita selected', 'Happiness index predicted')
rownames(predicted_happiness_bygdp_table) <- c('Viet Nam', 'China')
kable(predicted_happiness_bygdp_table, caption='Happiness Index Predicted of the Country Selected', digits = c(1,4,4,4), align = 'cccc')

Happiness Index Predicted of the Country Selected
	GDP per capita selected	Happiness index predicted
Viet Nam	4.2	5.2016
China	12.7	5.8007

According to the table, the predicted happiness indexes of Vietnam and China would be around 5.2 and 5.8 respectively. This result corresponds pretty well with the data from the real situation as Vietnam’s happiness index is 5.5 and China’s happiness index is 5.6 points.

3. Conclusion

In conclusion, we can conclude in the countries with higher GDP per capita (or in other words, the more prosperous the country is) will provide better welfare for their citizens and the civilians will be also more content with their life. Beside that, as the relationship between these two variables is indicated by the line which is similar to the graph of the function: \(y=log(x)\), the countries with lower GDP per capita will obtain the more significant increase in the happiness level with the same increase in the GDP per capita. This result does support more or less an investigation published in the journal “Beyond GDP: Economics and Happiness” of Berkeley Economic Review (the non-profit publication of the University of California with aim to fostering the undergraduate writing and research on economics issues) (Appendix). According to this publication, there is a positive relationship between GDP per capita and happiness index and a 1% change in GDP per capita will cause about 0.3 unit change in happiness.

V. LINEAR REGRESSION

1. Introduction

The third method we used in the report is linear regression with the aim to build a mathematical model of two variables and investigate further relationships between them. Firstly, we formed the testing hypothesis to check whether there were any connections between 2 variables nor not:

Two-sided hypothesis:

\([H_0]:\ \beta_1=0\ \ \ versus\ \ \ [H_A]:\ \beta_1\neq 0\)

Furthermore, we did investigations on the relations between happiness index and corruption index of all the countries in the world to test whether the civilians in the nations with less corruption behaviors (in both state and non-state organizations) would experience better quality of life. In addition, we also investigated how well this relationship is suitable for each continent.

2. Analysis

Happiness index: Happiness index is measured by collecting the data from the people of each country through a big survey in the scale from 0 to 10. This index expresses how civilians in this nation content with the quality of life and general problems related to their community.
Corruption index: Corruption index or Corruption Perceptions Index (CPI) ranks countries and territories worldwide by their perceived levels of public sector corruption, with the scores ranging from 0 (highly corrupt) to 100 (very clean)

Here, we denote corruption index is independent variable and happiness index is dependent variable.

First of all, we set up testing two-sided hypotheses to obtain the an overview of this relationship:

\(\ \ \ H_0: \beta_1 = 0\ \ \ versus \ \ \ H_A: \beta_1 \neq 0\)

with: \(\beta_1:\) is the slope parameter

Then, we need to figure out the p-value to reach the goal, which is the very first idea about the relationship between the two variables which are happiness index and corruption index. We use the following code to evaluate the p-value:

model1 <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers) #Intercept parameter & Slope parameter
summary_table <- summary(model1)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Slope parameter')
summary(model1)

## 
## Call:
## lm(formula = Happiness_index ~ Corruption_index, data = GDP_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.27101 -0.46493  0.07322  0.43154  0.88379 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.991542   0.266856  14.958  < 2e-16 ***
## Corruption_index 0.040705   0.006141   6.628 1.16e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5997 on 35 degrees of freedom
## Multiple R-squared:  0.5566, Adjusted R-squared:  0.5439 
## F-statistic: 43.93 on 1 and 35 DF,  p-value: 1.16e-07

kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	3.9915	0.2669	14.9576	0
Slope parameter	0.0407	0.0061	6.6280	0

From the result we obtain from the code above, it indicates that the p-value is very low:(~0), which means the null hypothesis (\(H_0\)) is implausible; in other words, the slope parameter is non-zero. Therefore, there must be a relationship between happiness index and corruption index or the happiness index has been shown to depend on the corruption index.

However, we need to investigate further more to understand how close-knit this relationship is and how we can utilize this model to forecast happiness index if we know the corruption score of a country.

ggplot(GDP_no_outliers, aes(y = Happiness_index, x = Corruption_index)) +
  geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
  geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
  labs(
    title = 'Relationship between happiness index and corruption index',
    y = "Happiness index", x = 'Corruption index',
    subtitle = 'Happiness index: scale of 10 | Corruption index: scale of 100'
  ) +
  theme_minimal()

From the graph above, we can observe that there is obviously a positive correlation between corruption index and happiness index and there should be a linear combination of the two index. Particularly, we obtain from the results above: \(R^2=0.5859\) indicate a relatively strong relationship between corruption index and happiness index.

Furthermore, we will form a particular formula for this relationship. We also obtain the slope parameter \(\beta_1\) and the intercept parameter \(\beta_0\) from the results above. As we can see in the table, the slope parameter is: \(\beta_1: 0.0427\) and the intercept parameter is: \(\beta_0: 3.8001\). Therefore, the simple linear regression model is:\(y_i=3.8001\ +\ 0.0427*x_i\) or the data values \((x_i, y_i)\) will lie closer to t he line\(y_i=3.8001\ +\ 0.0427*x_i\) as the error variance decreases.

In addition, we can also predict confidence interval of the happiness index for a particular value of corruption index. We use the following code to find a 80% confidence level two-sided prediction interval.

For example, we have corruption indexes which is: 42 (which is the Vietnam’s corruption index) and 69 (which is the corruption index of the USA). Following the code, we will have the result:

model_prediction <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers)
new_corruption_index <- data.frame(Corruption_index = c(42,69))
prediction_happiness <- predict(model_prediction, interval = "prediction", newdata = new_corruption_index, level = 0.80)

Corruption_index = new_corruption_index$Corruption_index
Predicted_happiness = prediction_happiness[, 1]
Lower_CI = prediction_happiness[, 2]
Upper_CI = prediction_happiness[, 3]

summary_happiness <- cbind(Corruption_index, Lower_CI, Predicted_happiness, Upper_CI)
colnames(summary_happiness) <- c('Corruption index selected', 'Lower', 'Fit', 'Upper')
rownames(summary_happiness) <- c('Viet Nam', 'the USA')
kable(summary_happiness, caption='Confidence Interval of the Country Selected', digits = c(1,4,4,4), align = 'cccc')

Confidence Interval of the Country Selected
	Corruption index selected	Lower	Fit	Upper
Viet Nam	42	4.9072	5.7011	6.4951
the USA	69	5.9738	6.8002	7.6265

The result does make sense as the Vietnam’s happiness index in real life is: 5.5 point and the figure for the USA is: 7.0 point. The results can be illustrated as the graph below:

ggplot(GDP_no_outliers, aes(y = Happiness_index, x = Corruption_index)) +
  geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
  geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
  labs(title = 'Relationship between happiness index and corruption index',
    y = "Happiness index", x = 'Corruption index',
    subtitle = 'Happiness index: scale of 10 | Corruption index: scale of 100') +
  geom_segment(aes(x = 42, xend = 42, y = prediction_happiness[1, 2], yend = prediction_happiness[1, 3]),
    linetype = "solid", color = "red", linewidth = 1.5) +
  geom_text(aes(x = 42, y = prediction_happiness[1, 3] + 0.35, label = "Vietnam"), size = 5) +
  geom_segment(aes(x = 69, xend = 69, y = prediction_happiness[2, 2], yend = prediction_happiness[2, 3]),
    linetype = "solid", color = "red", linewidth = 1.5) +
  geom_text(aes(x = 69, y = prediction_happiness[2, 3] + 0.25, label = "the USA"), size = 5) +
  geom_hline(yintercept = 5.54,
    linetype = "dashed", color = "green4", linewidth = 1) +
  geom_text(aes(x=72, y=5.35), label = "Global average level", size = 5, color = "green4")

The prediction intervals of the two nations are illustrated by the two vertical red line in the graph above. From this visualization, we can observe that the USA’s happiness indexes, which is predicted with 80% confidence to range from around 6.0 to 7.5 points, is noticeably higher than the global average level, which is 5.54 points. In the meantime, the fitted value of Vietnam is likely to equal to the average of the world.

3. Conclusion

In conclusion, there is obvious a positive relationship between corruption index and happiness index. To be more particular, the countries with higher CPI point tend to have higher scores of happiness level; in other words, the civilians in the nations with more transparent political systems and better in minimizing corruption behaviors would experience better quality of life. The results obtained above do enhance the results of the paper “The Most Influential Factors in Determining the Happiness of Nations” by Julie Lang (University of Northern Iowa). According to this investigation, corrupt condition does play a noticeable role (Appendix) in determining the life satisfaction of the civilians in a country as the better the control of corruption is, the higher the life satisfaction index is.

I. INTRODUCTION TO DATA

II. MULTIPLE LINEAR REGRESSION

1. Introduction

2. Motivation and methodology

3. Interpretation of the tests

4. Conclusion

III. ANOVA

1. Introduction

2. Detailed analysis

3. Constructing confidence intervals

4. Conclusion

IV. NON-LINEAR REGRESSION

1. Introduction

2. Detailed analysis

3. Conclusion

V. LINEAR REGRESSION

1. Introduction

2. Analysis

3. Conclusion

VI. APPENDIX

1. Appendix 1: Data set

2. Appendix 2: Beyond GDP: Economics and Happiness

3. Appendix 3: The Most Influential Factors in Determining the Happiness of Nations