library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(ggplot2)
library(dplyr)
library(lattice)
library(knitr)

#Import data
GDP <- read_csv("F:/VGU/ACADEMIC YEAR 1/OSTA/GDP.csv",show_col_types = FALSE)

#Sample data:
sample_data <- read_csv("F:/VGU/ACADEMIC YEAR 1/OSTA/GDP_sample.csv",show_col_types = FALSE)

#Change columns' names
colnames(sample_data)[5] <- "Inflation_rate"
colnames(sample_data)[6] <- "GDP_growth_rate"
colnames(sample_data)[7] <- "Happiness_index"
colnames(sample_data)[8] <- "GDP_per_capita"
colnames(sample_data)[10] <- "Corruption_index"

#Remove outliners
IQR <- IQR(sample_data$Happiness_index)
Lower_limit <- quantile(sample_data$Happiness_index, probs = 0.25) - 1.5*IQR
Upper_limit <- quantile(sample_data$Happiness_index, probs =0.75) + 1.5*IQR

GDP_no_outliers <- subset(sample_data, Happiness_index>Lower_limit & Happiness_index<Upper_limit)

Introduction:

We have been taught, or at least assumed, “the richer the happier” by intuition. But is it actually true in reality? To answer this question, our team aims to test whether there is statistical equality in the average Happiness Indexes among the concerned continents by utilizing one-factor analysis of variances technique (ANOVA). In case there are differences among them, in other words, when the null hypothesis H0 is rejected, we further apply the Tukey multiple comparisons procedure (or Tukey’s HSD) to construct the confidence intervals for these differences. Consequently, we form the testing hypothesis as follows:

\(H_{0}\) : \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus \([H_{A}]\): at least one pair of means are different from each other

Formulas: (Appendix):

Data analysis:
Hypothesis testing at a 10% (=0.1) level of significance: [\(H_0\)] : \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus [\(H_A\)] : at least one pair of means are different from each other
\(𝝁_{1}\): the mean of Happiness Index for America (Continent)
\(𝝁_{2}\): the mean of Happiness Index for Europe
\(𝝁_{3}\): the mean of Happiness Index for Africa
\(𝝁_{4}\): the mean of Happiness Index for Asia

First, R empowers us to effortlessly construct the boxplot. The detailed R code is as follows:

library(tidyverse)
library(knitr)


#Importing dataset
setwd("F:/VGU/ACADEMIC YEAR 1/OSTA") 
GDP <- read.csv("F:/VGU/ACADEMIC YEAR 1/OSTA/GDP.csv") #data input

#Take out a portion of dataset
SubsetGDP <- subset(GDP, select = c("Continent", "Happiness.index"))



#Draw boxplot
par(mar = c(3, 4, 3, 1))
boxplot(SubsetGDP$Happiness.index ~ SubsetGDP$Continent, 
        xlab = "Continents", ylab = "Happiness.Index",
        main = "Happiness Index in accordance to Continents")

#One-way ANOVA
oneway <- aov(Happiness.index ~ Continent, 
              data = SubsetGDP)
anova_table <- summary(oneway)

SumSq <- anova_table[[1]]$'Sum Sq'
Df <- anova_table[[1]]$'Df'
MeanSq <- anova_table[[1]]$'Mean Sq'
F_value <- anova_table[[1]]$'F value'
p_value <- anova_table[[1]]$'Pr(>F)'
table <- cbind(Df, SumSq, MeanSq, F_value, p_value)

# Display the ANOVA summary as a table
colnames(table) <- c('Degree of freedom', 'Sum squared', 'Mean squared', 'F-value', 'p-value')
kable(table, caption = "ANOVA Summary Table")

ANOVA Summary Table
Degree of freedom	Sum squared	Mean squared	F-value	p-value
3	76.73511	25.578369	41.91498	0
146	89.09563	0.610244	NA	NA

From the ANOVA table above, it is unambiguous that the p-value is smaller than 10% (even far smaller than 1%), the null hypothesis H0 thereby is clearly rejected. Hence, it can be concluded that there is sufficient evidence that at least the Happiness Index of one pair of Continents are different from each other.

Constructing confidence intervals:

But \(𝝁_{i}\) of how many pairs of Continents are different from each other and by how much? To answer this question, we further conduct the Tukey’s honestly significant difference test on this dataset. The R code is as follows:

#Tukey's HSD test
tukey <- TukeyHSD(oneway, conf.level=0.90)
table2 <- tukey$Continent
colnames(table2) <- c('Difference', 'Lower', 'Upper', 'p-value')
kable(table2, caption="Tukey's HSD Table")

Tukey’s HSD Table
	Difference	Lower	Upper	p-value
America-Africa	1.3732194	0.9210694	1.8253693	0.0000000
Asia-Africa	0.8862254	0.4868649	1.2855859	0.0000054
Europe-Africa	1.9037523	1.4997857	2.3077190	0.0000000
Asia-America	-0.4869940	-0.9304572	-0.0435307	0.0581966
Europe-America	0.5305330	0.0829172	0.9781487	0.0344875
Europe-Asia	1.0175269	0.6233073	1.4117465	0.0000001

However, the result came out a bit vague and is indeed quite challenging to observe and make further comparisons. To make it more illustrative, we decided to plot the results. The R code is as follows:

#Confidence level plotting
par(mar = c(5, 7, 3, 1))
plot(tukey, las = 2)

Overall, it is clear that the 90% confidence interval for the difference between the considered continents does not contain 0, implying that it is plausible that the Happiness Index of the continents are different with 90% confidence level.

The most noticeable difference we may be aware of must be between Asia and America - the two richest (using total GDP as a measurement) continents. The 90% confidence interval for the difference between Asia and America is:

\(𝝁_{4}\) - \(𝝁_{1}\) ∈ (-0.93045722, -0.04353072)

The result implies that people in Asia are generally unhappier than those in America. This can be explained by the huge culture gap between the two, i.e. while Western (in this case American) perspectives on creativity tend to emphasize the individual traits of creative individuals, Eastern (in this case Asian) concepts center more on social aspects, such as teamwork and having support from others (Zotero).

Conclusion:

All things considered, the analysis above suggests that our initial intuition is plausible, i.e. as the wealth of continents increases, so does the overall happiness and well-being of their populations. Furthermore, it can be observed that people in Europe are the happiest, followed by those in the continent of America. Asia, according to the analysis, exhibits a higher degree of unhappiness among its citizens, with Africa standing out as the region with the highest levels of misery.

Linear & Non-Linear Regression

The third method we used in the report is linear regression with the aim to build a mathematical model of two variables and investigate further relationships between them. Firstly, we formed the testing hypothesis to check whether there were any connections between 2 variables nor not:

Two-sided hypothesis:

\(H_0:\ \beta_1=0\ \ \ versus\ \ \ H_A:\ \beta_1\neq 0\)

Furthermore, we did investigations on the relations between happiness index and corruption index of all the countries in the world to test whether the civilians in the nations with less corruption behaviors (in both state and non-state organizations) would experience better quality of life. In addition, we also investigated how well this relationship is suitable for each continent.

1. Relationship between Happiness index and Corruption index

Happiness index: Happiness index is measured by collecting the data from the people of each country through a big survey in the scale from 0to10. This index expresses how civilians in this nation content with the quality of life and general problems related to their community.
Corruption index: Corruption index or Corruption Perceptions Index (CPI) ranks countries and territories worldwide by their perceived levels of public sector corruption, with the scores ranging from 0 (highly corrupt) to 100 (very clean)

Here, we denote corruption index is independent variable and happiness index is dependent variable.

First of all, we set up testing two-sided hypotheses to obtain the an overview of this relationship:

\(\ \ \ H_0: \beta_1 = 0\ \ \ versus \ \ \ H_A: \beta_1 \neq 0\)

with: \(\beta_1:\) is the slope parameter

Then, we need to figure out the p-value to reach the goal, which is the very first idea about the relationship between the two variables which are happiness index and corruption index. We use the following code to evaluate the p-value:

model1 <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers) #Intercept parameter & Slope parameter
summary_table <- summary(model1)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Slope parameter')
summary(model1)

## 
## Call:
## lm(formula = Happiness_index ~ Corruption_index, data = GDP_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8240 -0.4516  0.1294  0.5073  1.0067 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.800112   0.236255  16.085  < 2e-16 ***
## Corruption_index 0.042663   0.005177   8.241 9.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6652 on 48 degrees of freedom
## Multiple R-squared:  0.5859, Adjusted R-squared:  0.5772 
## F-statistic: 67.91 on 1 and 48 DF,  p-value: 9.56e-11

kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	3.8001	0.2363	16.0848	0
Slope parameter	0.0427	0.0052	8.2405	0

From the result we obtain from the code above, it indicates that the p-value is very low:(~0), which means the null hypothesis (\(H_0\)) is implausible; in other words, the slope parameter is non-zero. Therefore, there must be a relationship between happiness index and corruption index or the happiness index has been shown to depend on the corruption index.

However, we need to investigate further more to understand how close-knit this relationship is and how we can utilize this model to forecast happiness index if we know the corruption score of a country.

ggplot(GDP_no_outliers, aes(y=Happiness_index, x=Corruption_index)) +
  geom_point(color='black', fill='darkorange', shape=21, alpha=1, size=3.5, stroke=1) +
  geom_smooth(method='lm', color='blue4', linewidth=1.5) +
  labs(title='Relationship between happiness index and corruption index', 
       y="Happiness index", x='Corruption index',
       subtitle='Happiness index: scale of 10 | Corruption index: scale of 100', caption='OSTA 2023 - Group 2')

## `geom_smooth()` using formula = 'y ~ x'

From the graph above, we can observe that there is obviously a positive correlation between corruption index and happiness index and there should be a linear combination of the two index. Particularly, we obtain from the results above: \(R^2=0.5859\) indicate a relatively strong relationship between corruption index and happiness index.

Furthermore, we will form a particular formula for this relationship. We also obtain the slope parameter \(\beta_1\) and the intercept parameter \(\beta_0\) from the results above. As we can see in the table, the slope parameter is: \(\beta_1: 0.0427\) and the intercept parameter is: \(\beta_0: 3.8001\). Therefore, the simple linear regression model is:\(y_i=3.8001\ +\ 0.0427*x_i\)or the data values \((x_i, y_i)\) will lie closer to t he line\(y_i=3.8001\ +\ 0.0427*x_i\)as the error variance decreases.

In addition, we can also predict confidence interval of the happiness index for a particular value of corruption index. We use the following code to find a 80% confidence level two-sided prediction interval.

For example, we have corruption indexes which is: 42 (which is the Vietnam’s corruption index) and 69 (which is the corruption index of the USA). Following the code, we will have the result:

model_prediction <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers)
new_corruption_index <- data.frame(Corruption_index = c(42,69))
prediction_happiness <- predict(model_prediction, interval = "prediction", newdata = new_corruption_index, level = 0.80)

Corruption_index = new_corruption_index$Corruption_index
Predicted_happiness = prediction_happiness[, 1]
Lower_CI = prediction_happiness[, 2]
Upper_CI = prediction_happiness[, 3]

summary_happiness <- cbind(Corruption_index, Lower_CI, Predicted_happiness, Upper_CI)
colnames(summary_happiness) <- c('Corruption index selected', 'Lower', 'Fit', 'Upper')
rownames(summary_happiness) <- c('Viet Nam', 'the USA')
kable(summary_happiness, caption='Confidence Interval of the Country Selected', digits = c(1,4,4,4), align = 'cccc')

Confidence Interval of the Country Selected
	Corruption index selected	Lower	Fit	Upper
Viet Nam	42	4.7190	5.5920	6.4649
the USA	69	5.8521	6.7439	7.6357

The result does make sense as the Vietnam’s happiness index in real life is: 5.5 point and the figure for the USA is: 7.0 point. The results can be illustrated as the graph below:

ggplot(GDP_no_outliers, aes(y=Happiness_index, x=Corruption_index)) +
  geom_point(color='black', fill='darkorange', shape=21, alpha=1, size=3.5, stroke=1) +
  geom_smooth(method='lm', color='blue4', linewidth=1.5) +
  labs(title='Relationship between happiness index and corruption index', 
       y="Happiness index", x='Corruption index',
       subtitle='Happiness index: scale of 10 | Corruption index: scale of 100', caption='OSTA 2023 - Group 2') +
  geom_segment(aes(x = 42, xend = 42, y = prediction_happiness[1,2], yend = prediction_happiness[1,3]), linetype = "solid", color = "red", linewidth = 1.5) +
  geom_text(aes(x = 42, y = prediction_happiness[1,3]+0.35, label = "Vietnam"), size=5) +
  
  geom_segment(aes(x = 69, xend = 69, y = prediction_happiness[2,2], yend = prediction_happiness[2,3]), linetype = "solid", color = "red", linewidth = 1.5) +
   geom_text(aes(x = 69, y = prediction_happiness[2,3]+0.25, label = "the USA"), size=5)

## `geom_smooth()` using formula = 'y ~ x'

In conclusion, there is obvious a positive relationship between corruption index and happiness index. To be more particular, the countries with higher CPI point tend to have higher scores of happiness level; in other words, the civilians in the nations with more transparent political systems and better in minimizing corruption behaviors would experience better quality of life. The results obtained above do enhance the results of the paper “The Most Influential Factors in Determining the Happiness of Nations” by Julie Lang (University of Northern Iowa). According to this investigation, corrupt condition does play a noticeable role (Appendix) in determining the life satisfaction of the civilians in a country as the better the control of corruption is, the higher the life satisfaction index is.

2. Relationship between Happiness index and GDP per capita

GDP per capita: is an economic metric that breaks down a country’s economic output per person and is calculated by dividing the total GDP of a country to its total population. Economists often use this index to determine the prosperity of a nation.

IQR2 <- IQR(GDP_no_outliers$GDP_per_capita)
Lower_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs =0.75) + 1.5*IQR2

GDP_NO_outliers <- subset(GDP_no_outliers, GDP_per_capita>Lower_limit2 & GDP_per_capita<Upper_limit2)

Here, we denote the independent variable is GDP per capita while happiness index is dependent variable.

First of all, we take a general look to the relationship between happiness index and GDP per capita:

ggplot(GDP_NO_outliers, aes(x=GDP_per_capita, y=Happiness_index)) +
  geom_point(color='black', fill='darkorange', shape=21, alpha=1, size=3.5, stroke=1) +
  geom_smooth(method='loess', color='blue4', linewidth=1.5) +
  labs(title='Relationship between happiness index and GDP per capita', 
       y="Happiness index", x='GDP per capita',
       subtitle='Happiness index: scale of 10 | GDP per capita: thousand USD', caption='OSTA 2023 - Group 2')

## `geom_smooth()` using formula = 'y ~ x'

From the graph above, we can figure out that there is an obvious connection between happiness index and GDP per capita and there should be a non-linear connection here. To be more particularly, the line is quite similar to the plot of the function: \(y=\ m\ + \ log(x) \ (x>0)\) ; therefore, we assume that the non-linear formula between y=happiness index an x=GDP per capita would be: \(y_i=\beta_0\ + \beta_1*log(x_i)\) . Then, it it necessary to find out the values of \(\beta_0\) and \(\beta_1\) expected and how strong this relationship is.

model_non <- nls(Happiness_index ~ b + a*log(GDP_per_capita), 
             data = GDP_NO_outliers, start = list(a=1, b=1))
summary(model_non)

## 
## Formula: Happiness_index ~ b + a * log(GDP_per_capita)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a  0.54147    0.06954   7.786 7.07e-10 ***
## b  4.42452    0.16106  27.472  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6151 on 45 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 1.205e-08

summary_table_non <- summary(model_non)$parameters
rownames(summary_table_non) <- c('Beta 1 (a)', 'Beta 0 (b)')
colnames(summary_table_non) <- c('Estimate', 'Standard Error', 't-value', 'p-value')
kable(summary_table_non, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Beta 1 (a)	0.5415	0.0695	7.7864	0
Beta 0 (b)	4.4245	0.1611	27.4717	0

From the results obtained from the table above:\(\beta_0=4.4245\) and \(\beta_1=0.5415\). Therefore, the formula of this model will be:\(y_i=4.4245\ +\ 0.5415*log(x_i)\) and this represents a positive relationship. Furthermore, the achieved convergence tolerance of this model is quite small (~0), which indicates a high level of precision and accuracy in this estimation process.

In conclusion, we can conclude in the countries with higher GDP per capita (or in other words, the more prosperous the country is) will provide better welfare for their citizens and the civilians will be also more content with their life. Beside that, as the relationship between these two variables is indicated by the line which is similar to the graph of the function: \(y=log(x)\), the countries with lower GDP per capita will obtain the more significant increase in the happiness level with the same increase in the GDP per capita. This result does support more or less an investigation published in the journal “Beyond GDP: Economics and Happiness” of Berkeley Economic Review (the non-profit publication of the University of California with aim to fostering the undergraduate writing and research on economics issues). According to this publication, there is a positive relationship between GDP per capita and happiness index and a 1% change in GDP per capita will cause about 0.3 unit change in happiness.

OSTA Project 2023

Group 2 - BFA/BBA2022