I. INTRODUCTION TO DATA

Traditionally, economic prosperity is considered a success indicator of a country. However, societal well-being is being increasingly noticed apart from the economy’s flourishing. It is challenging and essential to comprehensively understand various factors shaping global happiness and how significantly each factor influences the targeted figure.

We acknowledge the importance of life satisfaction and the influential features that affect the figures considering residents’ happiness. This statistical report delves into the relationship between key indicators and the Happiness index across 150 countries and 4 continents in the year 2022.

This study aims to identify the main factors contributing to nations’ happiness levels by looking at the economic factors (GDP, GDP per capita, GDP growth rate, and Inflation rate) and social metrics (Population, Corruption index, and Unemployment rate).

The comprehensive dataset of nations with varied economic structures, cultural backgrounds, and governance models is expected to uncover the patterns and correlations of investigated elements. As a result, the Happiness index of a particular country can be predicted, and the most contributive variables can be revealed.

The dataset comprehensively overviews various countries, including economic and socio-cultural indicators. Each entry in the dataset represents a different country and includes the following information:

Rank: The ranking of the country based on GDP.
Country: The name of the country.
GDP (Gross Domestic Product): The country’s total economic output indicates its size.
Population: The total number of people living in the country.
Inflation Rate: The percentage increase in the general price level of goods and services over one year.
GDP Growth Rate: The rate at which the country’s economy expands or contracts over a specific period.
Happiness Index: A measure of the subjective well-being and happiness of the country’s citizens.
GDP per Capita: The GDP divided by the population, providing an average economic output per person.
Continent: The continent to which the country belongs.
Corruption Index: A measure of perceived corruption in the country, though the specific index used is not defined in the provided dataset.
Unemployment Rate: The percentage of the labor force that is unemployed and actively seeking employment.

#build Dataset
group <- GDP$Continent
subgroup <- GDP$Country
value <- GDP$GDP
data <- data.frame(group,subgroup,value)

#box plot for GDP per capita
ggplot(data = GDP, aes(x = reorder(Continent, -GDP_per_capita, FUN = median), y = GDP_per_capita, color = Continent)) +
  geom_line(linewidth = 0.8) +
  geom_boxplot(position = position_dodge(width = 5), outlier.shape = NA, linewidth = 0.8) +
  labs(title = "Average GDP per capita of the four Continents",
       x = "Continent",
       y = "GDP per capita") +
  theme_bw()

Description:

Most countries have a GDP per capita lower than 50. While Europe’s figures are the most noticeable, Africa has insignificant numbers.
Europe has the highest lower and upper bounds and the average number at 25. Although America’s upper bounds are lower than Asia, its average rate is higher at around 10. This can be explained by the huge variance in the figures of Asia from 5 to 30 compared to around 10 to 20 in America.

#histogram for happiness index
GDP %>%
  ggplot(aes(x=Happiness_index)) +
  geom_histogram(binwidth=0.25,
                 fill="#B0A695",
                 color="#F3EEEA", alpha=0.9) +
  theme_minimal()+
  labs(title = "Happiness Index distribution",
       x = "Happiness index (bin width = 0.2)",
       y = "Frequency")

Description:

The overall Happiness index represents a symmetric distribution.
The figures from 5.8 to 6.2 are outstandingly high, appearing the most for 14 counts at 6.0. The 5.0 rate also accounts for a high frequency of 13.
There are some data gaps before reaching endpoint 2.4, which should be considered an outlier.

GDP %>%
  ggplot(aes(x=Happiness_index)) +
  geom_histogram(binwidth=0.5,
                 fill="#B0A695",
                 color="#F3EEEA", alpha=0.9) +
  theme_minimal()+
  labs(title = "Happiness Index distribution by Continents",
       x = "Happiness index (bin width = 0.5)",
       y = "Frequency")+
  facet_wrap(~Continent)

Description:

However, the figures are moderately different when looking at the graph categorized by Continents with 0.5 bin width.
The peak points of America and Asia are roughly the same with the former graph at around 6. Interestingly, Africa has most recorded rates lower than 5.5. On the other hand, the Europe graph shows high counts at 7 to 7.5 apart from the similar peaks at around 6.
We may conclude that Europeans are slightly happier and Africa’s happiness rate is quite low.

#bar chart for Corruption index
mean_values <- aggregate(`Corruption_index` ~ Continent, data = GDP, mean)

ggplot(mean_values, aes(x = Continent, y = `Corruption_index`, fill = Continent)) +
    geom_bar(stat = "identity", width = 0.7) +
    geom_text(aes(label = round(`Corruption_index`, 2)),
              position = position_nudge(y = 1), size = 3, color = "black") +
    labs(title = "Average Corruption Rate by Continent",
         x = "Continent",
         y = "Average Corruption Rate") +
    theme_minimal()

Description:

The bar chart represents the average corruption rates across four continents—Africa, America, Asia, and Europe.
The analysis reveals distinct levels of corruption across continents, with Africa exhibiting the lowest average rate at 32.62, followed by America at 40.11, Asia at 41.35, and Europe with the highest average rate at 58.49.
The ascending order of corruption rates is clearly illustrated, offering a comparative view of the relative corruption levels.

II. ANOVA

1. Introduction:

We have been taught, or at least assumed, “the wealthier, the merrier” by intuition. But is it actually true in reality?

To answer this question, our team aims to test whether there is statistical equality in the average Happiness Indexes among the concerned continents by utilizing one-factor analysis of variances technique (ANOVA). In case there are differences among them, in other words, when the null hypothesis \(H_0\) is rejected, we further apply the Tukey multiple comparisons procedure (or Tukey’s HSD) to construct the confidence intervals for these differences. Consequently, we form the testing hypothesis as follows:

\([H_{0}]\) : \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus \([H_{A}]\): at least one pair of means are different from each other

2. Formulas: (Appendix)

3. Detailed analysis:

Hypothesis testing at a 5% (=0.05) level of significance: \(H_0\): \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus \(H_A\): at least one pair of means are different from each other
\(𝝁_{1}\): the mean of Happiness Index for America (Continent)
\(𝝁_{2}\): the mean of Happiness Index for Europe
\(𝝁_{3}\): the mean of Happiness Index for Africa
\(𝝁_{4}\): the mean of Happiness Index for Asia

First, R empowers us to effortlessly construct the boxplot. The detailed R code is as follows:

#Take out a portion of dataset
SubsetGDP <- subset(GDP_sample, select = c("Continent", "Happiness_index"))

#Draw boxplot
par(mar = c(3, 4, 3, 1))
boxplot(SubsetGDP$Happiness_index ~ SubsetGDP$Continent, 
        xlab = "Continents", ylab = "Happiness_index",
        main = "Happiness Index in accordance to Continents")

Next, the precise calculations for ANOVA can be easily computed thanks to the R programming language. The detailed R code is as follows:

#One-way ANOVA
oneway <- aov(Happiness_index ~ Continent, 
              data = SubsetGDP)
anova_table <- summary(oneway)
SumSq <- anova_table[[1]]$'Sum Sq'
Df <- anova_table[[1]]$'Df'
MeanSq <- anova_table[[1]]$'Mean Sq'
F_value <- anova_table[[1]]$'F value'
p_value <- anova_table[[1]]$'Pr(>F)'
table <- cbind(Df, SumSq, MeanSq, F_value, p_value)

# Display the ANOVA summary as a table
colnames(table) <- c('Degree of freedom', 'Sum squared', 'Mean squared', 'F-value', 'p-value')
rownames(table) <-c('Continent','Residuals')
kable(table, caption = "ANOVA Summary Table")

ANOVA Summary Table
	Degree of freedom	Sum squared	Mean squared	F-value	p-value
Continent	3	21.30352	7.101172	10.89693	1.58e-05
Residuals	46	29.97668	0.651667	NA	NA

From the ANOVA table above, it is unambiguous that the p-value is smaller than 5% (even far smaller than 1%), the null hypothesis \(H_0\) thereby is clearly rejected. Hence, it can be concluded that there is sufficient evidence that at least the Happiness Index of one pair of Continents are different from each other.

4. Constructing confidence intervals:

But we want to delve into the data further. We wonder \(𝝁_{i}\) of how many pairs of Continents are different from each other and by how much?
First, recall that in terms of GDP per capita, Europe is considered to be the richest, followed by Asia and America (Continent), while Africa is, as expected, the poorest among the five.
Next, to satisfy our concern, our team further conduct the Tukey’s honestly significant difference test (also known as Tukey multiple comparisons procedure) on this dataset. The R code is as follows:

#Tukey's HSD test
tukey <- TukeyHSD(oneway, conf.level=0.95)
table2 <- tukey$Continent
colnames(table2) <- c('Difference', 'Lower', 'Upper', 'p-value')
kable(table2, caption="Tukey's HSD Table")

Tukey’s HSD Table
	Difference	Lower	Upper	p-value
America-Africa	1.0000000	0.0511693	1.9488307	0.0353294
Asia-Africa	1.0187500	0.1970385	1.8404615	0.0096522
Europe-Africa	1.8461538	0.9847662	2.7075415	0.0000046
Asia-America	0.0187500	-0.8778107	0.9153107	0.9999369
Europe-America	0.8461538	-0.0869057	1.7792134	0.0879329
Europe-Asia	0.8274038	0.0239549	1.6308528	0.0413377

However, the result came out a bit vague and is indeed quite challenging to observe and make further comparisons. To make it more illustrative, we decided to plot the results. The R code is as follows:

#Confidence level plotting
par(mar = c(5, 7, 3, 1))
plot(tukey, las = 2)

Overall, it is clear that the 95% confidence interval for the difference in Happiness Index between Asia and America (Continent) and between Europe and America (Continent) does indeed contain 0, while that between the others does not have the same property.
It is therefore highly plausible that the Happiness Indices of Asia and America are extremely close to each other, implying that people in Asia and in America are equally happy in general. Particularly, the 95% confidence interval for the difference between Asia and America is:

\(𝝁_{4}\) - \(𝝁_{1}\) ∈ (-0.8778107, 0.9153107)

Similarly, there is sufficient evidence that it is plausible that the Happiness Indices of Europe and America are equivalent to each other, suggesting that in general, people in Europe and in America are comparably satisfied with their lives. The 95% confidence interval for the difference between Europe and America is:

\(𝝁_{2}\) - \(𝝁_{1}\) ∈ (-0.0869057, 1.7792134)

On the other hand, the most noticeable difference we may be aware of must be between Europe and Africa. The 95% confidence interval for the difference between Europe and Africa is:

\(𝝁_{2}\) - \(𝝁_{3}\) ∈ (0.9847662, 2.7075415)

This result implies that people in Africa are generally very unhappier than those in Europe. Given that the GDP per capita of Europe is approximately 20 times higher than that of Africa, leading to significantly higher standard welfare and much better living environment, the result above is understandable.
Similarly, the 95% confidence interval for the differences between America and Africa, between Asia and Africa, and between Europe and Asia are in turn below:

\(𝝁_{1}\) - \(𝝁_{3}\) ∈ (0.0511693, 1.9488307)

\(𝝁_{4}\) - \(𝝁_{3}\) ∈ (0.1970385, 1.8404615)

\(𝝁_{2}\) - \(𝝁_{4}\) ∈ (0.0239549, 1.6308528)

5. Conclusion:

All things considered, the analysis above suggests that our initial intuition is plausible, i.e. as the wealth of continents increases, so does the overall happiness and the well-being of their populations. In particular, it can be observed that people in Europe are the happiest, followed by those in the continent of Asia and America. Notwithstanding, Africa certainly exhibits the highest degree of unhappiness among its citizens.

III. LINEAR & NON-LINEAR REGRESSION

1. Introduction:

The third method we used in the report is linear regression with the aim to build a mathematical model of two variables and investigate further relationships between them. Firstly, we formed the testing hypothesis to check whether there were any connections between 2 variables nor not:

Two-sided hypothesis:

[\(H_0:\ \beta_1=0\ \ \ versus\ \ \ H_A:\ \beta_1\neq 0\)]

Furthermore, we did investigations on the relations between happiness index and corruption index of all the countries in the world to test whether the civilians in the nations with less corruption behaviors (in both state and non-state organizations) would experience better quality of life. In addition, we also investigated how well this relationship is suitable for each continent.

2. Relationship between Happiness index and Corruption index:

Happiness index: Happiness index is measured by collecting the data from the people of each country through a big survey in the scale from 0to10. This index expresses how civilians in this nation content with the quality of life and general problems related to their community.
Corruption index: Corruption index or Corruption Perceptions Index (CPI) ranks countries and territories worldwide by their perceived levels of public sector corruption, with the scores ranging from 0 (highly corrupt) to 100 (very clean)

Here, we denote corruption index is independent variable and happiness index is dependent variable.

First of all, we set up testing two-sided hypotheses to obtain the an overview of this relationship:

[\(\ \ \ H_0: \beta_1 = 0\ \ \ versus \ \ \ H_A: \beta_1 \neq 0\)]

with: \(\beta_1:\) is the slope parameter

Then, we need to figure out the p-value to reach the goal, which is the very first idea about the relationship between the two variables which are happiness index and corruption index. We use the following code to evaluate the p-value:

model1 <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers) #Intercept parameter & Slope parameter
summary_table <- summary(model1)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Slope parameter')
summary(model1)

## 
## Call:
## lm(formula = Happiness_index ~ Corruption_index, data = GDP_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8240 -0.4516  0.1294  0.5073  1.0067 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.800112   0.236255  16.085  < 2e-16 ***
## Corruption_index 0.042663   0.005177   8.241 9.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6652 on 48 degrees of freedom
## Multiple R-squared:  0.5859, Adjusted R-squared:  0.5772 
## F-statistic: 67.91 on 1 and 48 DF,  p-value: 9.56e-11

kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	3.8001	0.2363	16.0848	0
Slope parameter	0.0427	0.0052	8.2405	0

From the result we obtain from the code above, it indicates that the p-value is very low:(~0), which means the null hypothesis (\(H_0\)) is implausible; in other words, the slope parameter is non-zero. Therefore, there must be a relationship between happiness index and corruption index or the happiness index has been shown to depend on the corruption index.

However, we need to investigate further more to understand how close-knit this relationship is and how we can utilize this model to forecast happiness index if we know the corruption score of a country.

ggplot(GDP_no_outliers, aes(y = Happiness_index, x = Corruption_index)) +
  geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
  geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
  labs(
    title = 'Relationship between happiness index and corruption index',
    y = "Happiness index", x = 'Corruption index',
    subtitle = 'Happiness index: scale of 10 | Corruption index: scale of 100'
  ) +
  theme_minimal()

From the graph above, we can observe that there is obviously a positive correlation between corruption index and happiness index and there should be a linear combination of the two index. Particularly, we obtain from the results above: \(R^2=0.5859\) indicate a relatively strong relationship between corruption index and happiness index.

Furthermore, we will form a particular formula for this relationship. We also obtain the slope parameter \(\beta_1\) and the intercept parameter \(\beta_0\) from the results above. As we can see in the table, the slope parameter is: \(\beta_1: 0.0427\) and the intercept parameter is: \(\beta_0: 3.8001\). Therefore, the simple linear regression model is:\(y_i=3.8001\ +\ 0.0427*x_i\) or the data values \((x_i, y_i)\) will lie closer to t he line\(y_i=3.8001\ +\ 0.0427*x_i\) as the error variance decreases.

In addition, we can also predict confidence interval of the happiness index for a particular value of corruption index. We use the following code to find a 80% confidence level two-sided prediction interval.

For example, we have corruption indexes which is: 42 (which is the Vietnam’s corruption index) and 69 (which is the corruption index of the USA). Following the code, we will have the result:

model_prediction <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers)
new_corruption_index <- data.frame(Corruption_index = c(42,69))
prediction_happiness <- predict(model_prediction, interval = "prediction", newdata = new_corruption_index, level = 0.80)

Corruption_index = new_corruption_index$Corruption_index
Predicted_happiness = prediction_happiness[, 1]
Lower_CI = prediction_happiness[, 2]
Upper_CI = prediction_happiness[, 3]

summary_happiness <- cbind(Corruption_index, Lower_CI, Predicted_happiness, Upper_CI)
colnames(summary_happiness) <- c('Corruption index selected', 'Lower', 'Fit', 'Upper')
rownames(summary_happiness) <- c('Viet Nam', 'the USA')
kable(summary_happiness, caption='Confidence Interval of the Country Selected', digits = c(1,4,4,4), align = 'cccc')

Confidence Interval of the Country Selected
	Corruption index selected	Lower	Fit	Upper
Viet Nam	42	4.7190	5.5920	6.4649
the USA	69	5.8521	6.7439	7.6357

The result does make sense as the Vietnam’s happiness index in real life is: 5.5 point and the figure for the USA is: 7.0 point. The results can be illustrated as the graph below:

ggplot(GDP_no_outliers, aes(y = Happiness_index, x = Corruption_index)) +
  geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
  geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
  labs(title = 'Relationship between happiness index and corruption index',
    y = "Happiness index", x = 'Corruption index',
    subtitle = 'Happiness index: scale of 10 | Corruption index: scale of 100') +
  geom_segment(aes(x = 42, xend = 42, y = prediction_happiness[1, 2], yend = prediction_happiness[1, 3]),
    linetype = "solid", color = "red", linewidth = 1.5) +
  geom_text(aes(x = 42, y = prediction_happiness[1, 3] + 0.35, label = "Vietnam"), size = 5) +
  geom_segment(aes(x = 69, xend = 69, y = prediction_happiness[2, 2], yend = prediction_happiness[2, 3]),
    linetype = "solid", color = "red", linewidth = 1.5) +
  geom_text(aes(x = 69, y = prediction_happiness[2, 3] + 0.25, label = "the USA"), size = 5)

In conclusion, there is obvious a positive relationship between corruption index and happiness index. To be more particular, the countries with higher CPI point tend to have higher scores of happiness level; in other words, the civilians in the nations with more transparent political systems and better in minimizing corruption behaviors would experience better quality of life. The results obtained above do enhance the results of the paper “The Most Influential Factors in Determining the Happiness of Nations” by Julie Lang (University of Northern Iowa). According to this investigation, corrupt condition does play a noticeable role (Appendix) in determining the life satisfaction of the civilians in a country as the better the control of corruption is, the higher the life satisfaction index is.

3. Relationship between Happiness index and GDP per capita:

GDP per capita: is an economic metric that breaks down a country’s economic output per person and is calculated by dividing the total GDP of a country to its total population. Economists often use this index to determine the prosperity of a nation.

IQR2 <- IQR(GDP_no_outliers$GDP_per_capita)
Lower_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs =0.75) + 1.5*IQR2

GDP_NO_outliers <- subset(GDP_no_outliers, GDP_per_capita>Lower_limit2 & GDP_per_capita<Upper_limit2)

*Here, we denote the independent variable is GDP per capita while happiness index is dependent variable

First of all, we take a general look to the relationship between happiness index and GDP per capita:

ggplot(GDP_NO_outliers, aes(x = GDP_per_capita, y = Happiness_index)) +
  geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
  geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
  labs(
    title = 'Relationship between happiness index and GDP per capita',
    y = 'Happiness index', x = 'GDP per capita',
    subtitle = 'Happiness index: scale of 10 | GDP per capita: thousand USD')

From the graph above, we can figure out that there is an obvious connection between happiness index and GDP per capita and there should be a non-linear connection here. To be more particularly, the line is quite similar to the plot of the function: \(y=\ m\ + \ log(x) \ (x>0)\) ; therefore, we assume that the non-linear formula between y=happiness index an x=GDP per capita would be: \(y_i=\beta_0\ + \beta_1*log(x_i)\) . Then, it it necessary to find out the values of \(\beta_0\) and \(\beta_1\) expected and how strong this relationship is.

model_non <- nls(Happiness_index ~ b + a*log(GDP_per_capita), 
             data = GDP_NO_outliers, start = list(a=1, b=1))
summary(model_non)

## 
## Formula: Happiness_index ~ b + a * log(GDP_per_capita)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a  0.54147    0.06954   7.786 7.07e-10 ***
## b  4.42452    0.16106  27.472  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6151 on 45 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 1.205e-08

summary_table_non <- summary(model_non)$parameters
rownames(summary_table_non) <- c('Beta 1 (a)', 'Beta 0 (b)')
colnames(summary_table_non) <- c('Estimate', 'Standard Error', 't-value', 'p-value')
kable(summary_table_non, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Beta 1 (a)	0.5415	0.0695	7.7864	0
Beta 0 (b)	4.4245	0.1611	27.4717	0

From the results obtained from the table above:\(\beta_0=4.4245\) [and] \(\beta_1=0.5415\). Therefore, the formula of this model will be:\(y_i=4.4245\ +\ 0.5415*log(x_i)\) and this represents a positive relationship. Furthermore, the achieved convergence tolerance of this model is quite small (~0), which indicates a high level of precision and accuracy in this estimation process.

In conclusion, we can conclude in the countries with higher GDP per capita (or in other words, the more prosperous the country is) will provide better welfare for their citizens and the civilians will be also more content with their life. Beside that, as the relationship between these two variables is indicated by the line which is similar to the graph of the function: \(y=log(x)\), the countries with lower GDP per capita will obtain the more significant increase in the happiness level with the same increase in the GDP per capita. This result does support more or less an investigation published in the journal “Beyond GDP: Economics and Happiness” of Berkeley Economic Review (the non-profit publication of the University of California with aim to fostering the undergraduate writing and research on economics issues). According to this publication, there is a positive relationship between GDP per capita and happiness index and a 1% change in GDP per capita will cause about 0.3 unit change in happiness.

FURTHER ANALYSIS

#Calculate to eliminate outliers
IQR1 <- IQR(GDP_sample$Happiness_index)
Lower_limit1 <- quantile(GDP_sample$Happiness_index, probs = 0.25) - 1.5*IQR1
Upper_limit1 <- quantile(GDP_sample$Happiness_index, probs =0.75) + 1.5*IQR1

IQR2 <- IQR(GDP_sample$Corruption_index)
Lower_limit2 <- quantile(GDP_sample$Corruption_index, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_sample$Corruption_index, probs =0.75) + 1.5*IQR2

IQR3 <- IQR(GDP_sample$Inflation_rate)
Lower_limit3 <- quantile(GDP_sample$Inflation_rate, probs = 0.25) - 1.5*IQR3
Upper_limit3 <- quantile(GDP_sample$Inflation_rate, probs =0.75) + 1.5*IQR3

IQR4 <- IQR(GDP_sample$GDP_per_capita)
Lower_limit4 <- quantile(GDP_sample$GDP_per_capita, probs = 0.25) - 1.5*IQR4
Upper_limit4 <- quantile(GDP_sample$GDP_per_capita, probs =0.75) + 1.5*IQR4

IQR5 <- IQR(GDP_sample$Unemployment)
Lower_limit5 <- quantile(GDP_sample$Unemployment, probs = 0.25) - 1.5*IQR5
Upper_limit5 <- quantile(GDP_sample$Unemployment, probs =0.75) + 1.5*IQR5

#build dataset
GDP_5_stats <- subset(GDP_sample,
                      select=c("Happiness_index","Corruption_index",
                               "Inflation_rate","GDP_per_capita",
                               "Unemployment","Continent"))

GDP_no_outliers <- GDP_5_stats %>%
  filter(Happiness_index>Lower_limit1
         & Happiness_index<Upper_limit1) %>%
  filter(Inflation_rate>Lower_limit3
         & Inflation_rate<Upper_limit3) %>%
  filter(Corruption_index<Upper_limit2
         & Corruption_index>Lower_limit2) %>%
  filter(GDP_per_capita<Upper_limit4
         & GDP_per_capita>Lower_limit4) %>%
  filter(Unemployment<Upper_limit5
         & Unemployment>Lower_limit5)

1. Linear Regression

Hypothesis: The following factors does not make a difference on happiness.

#table test of multiple linear regression
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita + Unemployment,
            data=GDP_no_outliers)
summary(model)

## 
## Call:
## lm(formula = Happiness_index ~ Corruption_index + Inflation_rate + 
##     GDP_per_capita + Unemployment, data = GDP_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.7925 -0.4147  0.1235  0.3673  0.8375 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.741614   0.320106  14.813 7.08e-16 ***
## Corruption_index  0.020847   0.007879   2.646  0.01253 *  
## Inflation_rate   -0.018497   0.017773  -1.041  0.30581    
## GDP_per_capita    0.025302   0.008305   3.047  0.00461 ** 
## Unemployment     -0.024810   0.024752  -1.002  0.32369    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5096 on 32 degrees of freedom
## Multiple R-squared:  0.7073, Adjusted R-squared:  0.6707 
## F-statistic: 19.33 on 4 and 32 DF,  p-value: 3.582e-08

summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Corruption index',
                             'Inflation rate','GDP per capita','Unemployment')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.7416	0.3201	14.8126	0.0000
Corruption index	0.0208	0.0079	2.6459	0.0125
Inflation rate	-0.0185	0.0178	-1.0407	0.3058
GDP per capita	0.0253	0.0083	3.0468	0.0046
Unemployment	-0.0248	0.0248	-1.0023	0.3237

p_value of Inflation rate and Corruption index is ~ 0 => reject p_value of Unemployment is 0.32 > 0.1 => accept

=> Unemployment is not needed

#multiple-linear regression without unemployment
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter','Corruption index',
                             'Inflation rate','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.6753	0.3132	14.9266	0.0000
Corruption index	0.0198	0.0078	2.5316	0.0163
Inflation rate	-0.0245	0.0167	-1.4678	0.1516
GDP per capita	0.0264	0.0082	3.2026	0.0030

p_value of Corruption index and GDP per capita is ~ 0 => reject p_value of inflation rate is 0.15 > 0.1 => accept

=> Inflation rate is not needed

#multiple-linear regression without inflation rate
model <- lm(Happiness_index ~ Corruption_index + GDP_per_capita,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter',
                             'Corruption index','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.4054	0.2578	17.0863	0.000
Corruption index	0.0195	0.0079	2.4633	0.019
GDP per capita	0.0292	0.0081	3.5880	0.001

all p values are near to 0 < 0.1 => Cannot reject and the estimate are positive

Conclusion: Corruption index and GDP per capita are two driving factors.

#correlation test of each pair to test the influence of corruption index and GDP per capita

cor2 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$GDP_per_capita)$estimate
cor4 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$Corruption_index)$estimate

cor2 <- data.frame(cbind(round(cor2,3)))
cor4 <- data.frame(cbind(round(cor4,3)))

cor2 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$GDP_per_capita)$estimate
cor4 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$Corruption_index)$estimate

correlation_table <- rbind(cor2,cor4)

colnames(correlation_table) <- c('Correlation')
rownames(correlation_table) <- c('Happiness index & GDP per capita',
                                 'Happiness index & Corruption index')

kable(correlation_table,
      caption = "Correlation results of GDP per capita and Corruption index",
      digits = c(4,4,4,4),
      align = 'cccc')

Correlation results of GDP per capita and Corruption index
	Correlation
Happiness index & GDP per capita	0.788
Happiness index & Corruption index	0.746

The correlation of GDP per capita (0.8) and Corruption index (0.73) is positively correlated and the rate is really high.

#plot the graph to illustrate
ggplot(GDP_no_outliers, aes(x = GDP_per_capita, y = Happiness_index)) +
  geom_point(color = 'black', fill = 'orange', shape = 21, alpha = 1, size = 3, stroke = 0.9, 
    position = "jitter") +
  geom_smooth(method = "lm", formula = y ~ x, linewidth = 1.2, color = "blue", fill = "transparent") +
  labs(title = 'Relationship between GDP per Capita and Happiness index',
    y = 'Happiness index', x = 'GDP per Capita')

ggplot(GDP_no_outliers, aes(x = Corruption_index, y = Happiness_index)) +
  geom_point(color = 'black', fill = 'orange', shape = 21, alpha = 1, size = 3, stroke = 0.9, 
    position = "jitter") +
  geom_smooth(method = "lm", formula = y ~ x, linewidth = 1.2, color = "blue", fill = "transparent") +
  labs(title = 'Relationship between GDP per Capita and Happiness index',
    y = 'Happiness index', x = 'Corruption_index')

#R square test to test the fitness of the model
Rsquared <- summary(model)$r.squared
print(Rsquared)

## [1] 0.6783591

68.16% of the variation in the y values is accounted for by the x values => good

Country_test <- subset(GDP_sample, Country == 'South Africa',
                       select=c("Country","Corruption_index",
                              "GDP_per_capita","Happiness_index"))
Country_test

## # A tibble: 1 × 4
##   Country      Corruption_index GDP_per_capita Happiness_index
##   <chr>                   <dbl>          <dbl>           <dbl>
## 1 South Africa               43            6.8             5.2

new_data <- data.frame(Corruption_index = Country_test[1,2],
                       GDP_per_capita = Country_test[1,3])

prediction_interval <- predict(model,
                               newdata = new_data,
                               interval = "prediction",
                               level = 0.9)
print(prediction_interval)

##        fit     lwr     upr
## 1 5.464745 4.61857 6.31092

Lower <- prediction_interval[1,2]
Upper <- prediction_interval[1,3]
Fit <- prediction_interval[1,1]
Real <- Country_test[1,4]
summary_table <- cbind(Lower, Fit, Upper,Real)
colnames(summary_table) <- c('Lower', 'Fit', 'Upper','Real')
rownames(summary_table) <- c('Chosen country')
kable(summary_table, caption="Confidence Interval of Predicted Nation",
      digits = c(4,4,4,4), align = 'cccc')

Confidence Interval of Predicted Nation
	Lower	Fit	Upper	Real
Chosen country	4.6186	5.4647	6.3109	5.2

“We can see from the chosen country (South Africa), the predicted happiness index from 2 factors GDP per capita and Corruption index is 5.359 and it is really close to 5.9

=> the prediction is really good :DDDDDD”

Confidence Interval of Predicted Nation
	Lower	Fit	Upper	Real	Manual fit
Chosen country	4.6186	5.4647	6.3109	5.2	5.4647

The manual result is exactly the same the resulted taken from the predict function of R.

#Confidence interval of the sample mean = 95% through the t.test

# Create two example data frames
lower_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$conf.int[1]
upper_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$conf.int[2]
lower_2 <- t.test(GDP_no_outliers$GDP_per_capita)$conf.int[1]
upper_2 <- t.test(GDP_no_outliers$GDP_per_capita)$conf.int[2]
lower_3 <- t.test(GDP_no_outliers$Inflation_rate)$conf.int[1]
upper_3 <- t.test(GDP_no_outliers$Inflation_rate)$conf.int[2]

fit_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$estimate
fit_2 <- t.test(GDP_no_outliers$GDP_per_capita)$estimate
fit_3 <- t.test(GDP_no_outliers$Inflation_rate)$estimate

A <- c(lower_1,fit_1,upper_1)
B <-  c(lower_2,fit_2,upper_2)
C <-  c(lower_3,fit_3,upper_3)

table_1 <- rbind(A,B,C)

rownames(table_1) <- c('GDP growth rate','GDP per capita','Inflation rate')
colnames(table_1) <- c('Lower','Sample','Upper')
kable(table_1, caption="Confidence Interval of sample mean with 95% confidence",
      digits = c(4,4,4), align = 'ccc')

Confidence Interval of sample mean with 95% confidence
	Lower	Sample	Upper
GDP growth rate	3.0759	4.1286	5.1813
GDP per capita	10.2692	15.7771	21.2851
Inflation rate	7.3731	8.7886	10.2040

#draw table
table_2 <- rbind(confidence_interval_1, confidence_interval_2, confidence_interval_3)
rownames(table_2) <- c('GDP growth rate','GDP per capita','Inflation rate')
colnames(table_2) <- c('Lower','Upper')

kable(table_1, caption="Confidence Interval of sample mean with 95% confidence",
      digits = c(4,4,4), align = 'ccc')

Confidence Interval of sample mean with 95% confidence
	Lower	Sample	Upper
GDP growth rate	3.0759	4.1286	5.1813
GDP per capita	10.2692	15.7771	21.2851
Inflation rate	7.3731	8.7886	10.2040

kable(table_2, caption="Confidence Interval of sample mean calculated manually",
      digits = c(4,4,4), align = 'ccc')

Confidence Interval of sample mean calculated manually
	Lower	Upper
GDP growth rate	3.2792	4.9780
GDP per capita	11.3328	20.2215
Inflation rate	7.6465	9.9307

After all of those calculations, we can see that the sample mean and confidence interval from t.test() is quite equivalent tothe manual calculations we have been making, but the latter is much more precise than t.test().

OSTA Project 2023

Group 2 - BFA/BBA2022

I. INTRODUCTION TO DATA

II. ANOVA

1. Introduction:

2. Formulas: (Appendix)

3. Detailed analysis:

4. Constructing confidence intervals:

5. Conclusion:

III. LINEAR & NON-LINEAR REGRESSION

1. Introduction:

2. Relationship between Happiness index and Corruption index:

3. Relationship between Happiness index and GDP per capita:

FURTHER ANALYSIS

1. Linear Regression

APPENDIX