INTRODUCTION TO DATA

Traditionally, economic prosperity is considered a success indicator of a country. However, societal well-being is being increasingly noticed apart from the economy’s flourishing. It is challenging and essential to comprehensively understand various factors shaping global happiness and how significantly each factor influences the targeted figure.

We acknowledge the importance of life satisfaction and the influential features that affect the figures considering residents’ happiness. This statistical report delves into the relationship between key indicators and the Happiness index across 150 countries and 4 continents.

This study aims to identify the main factors contributing to nations’ happiness levels by looking at the economic factors (GDP, GDP per capita, GDP growth rate, and Inflation rate) and social metrics (Population, Corruption index, and Unemployment rate).

The comprehensive dataset of nations with varied economic structures, cultural backgrounds, and governance models is expected to uncover the patterns and correlations of investigated elements. As a result, the Happiness index of a particular country can be predicted, and the most contributive variables can be revealed.

The dataset comprehensively overviews various countries, including economic and socio-cultural indicators. Each entry in the dataset represents a different country and includes the following information:

Rank: The ranking of the country based on GDP.

Country: The name of the country.

GDP (Gross Domestic Product): The country’s total economic output indicates its size.

Population: The total number of people living in the country.

Inflation Rate: The percentage increase in the general price level of goods and services over one year.

GDP Growth Rate: The rate at which the country’s economy expands or contracts over a specific period.

Happiness Index: A measure of the subjective well-being and happiness of the country’s citizens.

GDP per Capita: The GDP divided by the population, providing an average economic output per person.

Continent: The continent to which the country belongs.

Corruption Index: A measure of perceived corruption in the country, though the specific index used is not defined in the provided dataset.

Unemployment Rate: The percentage of the labor force that is unemployed and actively seeking employment.]

#build Dataset
group <- GDP_sample$Continent
subgroup <- GDP_sample$Country
value <- GDP_sample$GDP
data <- data.frame(group,subgroup,value)

#treemap for GDP distribution
treemap(data,
        index=c("group","subgroup"),
        vSize="value",
        type="index",
        fontsize.labels = c(11,8),
        fontcolor.labels = c("#45474B","black"),
        fontface.labels = c(2,3),
        bg.labels = c("#F3EEEA"),
        align.labels = list(c("right","top"),
                            c("center","center")),
        #inflate.labels = F,
        border.col=c("black","white"),
        border.lwds=c(3,2),
        fontsize.title = 18,
        palette = "Set3",
        theme_fivethirtyeight(),
        title = "GDP distribution around the world")

#boxplot for inflation rate
GDP %>%
  filter(Inflation_rate< 30) %>%
  filter(Inflation_rate>0) %>%
  ggplot(aes(x=reorder(Continent,-Inflation_rate,
                       FUN=median),
             y=Inflation_rate,
             color=Continent))+
  geom_line(size=0.8)+
  geom_boxplot(position = position_dodge(width = 5),
               outlier.shape=NA,size=0.8)+
  labs(title = "Average inflation rate of the four Continents",
       x="Continent",
       y="Inflation rate")+
  theme_bw()

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#histogram for happiness index
GDP %>%
  ggplot(aes(x=Happiness_index)) +
  geom_histogram(binwidth=0.2,
                 fill="#B0A695",
                 color="#F3EEEA", alpha=0.9) +
  theme_minimal()+
  labs(title = "Happiness Index distribution")

GDP %>%
  ggplot(aes(x=Happiness_index)) +
  geom_histogram(binwidth=0.25,
                 fill="#B0A695",
                 color="#F3EEEA", alpha=0.9) +
  theme_minimal()+
  labs(title = "Happiness Index distribution")+
  facet_wrap(~Continent)

#boxplot for GDP growth rate
GDP %>%
  filter(GDP_growth_rate<20) %>%
  filter(GDP_growth_rate>0) %>%
  ggplot(aes(x=reorder(Continent,-GDP_growth_rate, FUN=median),
             y=GDP_growth_rate,
             color=Continent,
             group=Continent))+
  geom_line(size=0.8)+
  geom_boxplot(position = position_dodge(width = 5),
               outlier.shape=NA,size=0.8)+
  labs(title = "Average GDP growth rate of the four Continents",
       x="Continent",
       y="GDP growth rate")+
  theme_bw()

#pie chart for GDP per capita
continent_data <- aggregate(GDP$GDP_per_capita, by = list(Continent = GDP$Continent), FUN = sum)
continent_data$percentage <- continent_data$x / sum(continent_data$x)
ggplot(continent_data, aes(x = "", y = percentage, fill = Continent)) +
    geom_bar(stat = "identity", width = 1, color = "white") +
    coord_polar("y") +
    theme_void() +
    ggtitle("GDP per Capita Distribution by Continent") +
    geom_text(
        aes(label = scales::percent(percentage), fontface = "bold"),
        position = position_stack(vjust = 0.5),
        size = 3.5,
        check_overlap = TRUE,
        angle = 45
    ) +
    scale_fill_brewer(palette = "Set3") +
    theme(
        axis.text = element_blank(),
        axis.title = element_blank(),
        text = element_text(size = 8))

#bar chart for Population
total_population <- GDP %>%
    group_by(Continent) %>%
    summarize(Total_Population = sum(Population, na.rm = TRUE))
GDP <- left_join(GDP, total_population, by = "Continent")
ggplot(data = GDP, aes(x = reorder(Continent, -Total_Population), y = Total_Population / 1e6, fill = Continent)) +
    geom_bar(stat = "identity") +
    geom_text(aes(x = Continent, y = Total_Population / 1e6, label = sprintf("%.1f", Total_Population / 1e6)),
              vjust = -0.3, color = "white", size = 3, fontface = "bold") +  # Adjust the position and size of the labels
    labs(x = "Continent", y = "", fill = "Continent", 
         title = "Total Population across Continents",
         subtitle = "Group 2") +
    theme_minimal() +
    theme(axis.text.y = element_blank(),
        axis.ticks.y = element_blank(),
        panel.grid = element_blank()) +
    scale_fill_brewer(palette = "Set1") +
    scale_y_continuous(labels = scales::comma_format(scale = 1e-6), breaks = seq(0, 100, by = 20)) +
    labs(caption = "Note: Population values are in millions.")

#line chart for Corruption rate
mean_values <- aggregate(`Corruption_index` ~ Continent, data = GDP, mean)
ggplot(GDP, aes(x = Continent, y = `Corruption_index`, group = 1)) +
    geom_line(stat = "summary", fun = "mean", color = "darkblue", size = 1.5) +
    geom_text(data = mean_values, aes(label = round(`Corruption_index`, 2)),
              position = position_nudge(y = 1.5, x=-0.1), size = 3, color = "darkgreen") +
    geom_point(data = mean_values, aes(x = Continent, y = `Corruption_index`), 
               size = 3, color = "red") +
    labs(title = "Average Corruption Rate by Continent",
         x = "Continent",
         y = "Average Corruption Rate") +
    theme_minimal() +
    theme(axis.text.x = element_text(angle = 45, hjust = 1))

ggplot(GDP, aes(x = Continent, y = Corruption_index, group = 1)) + geom_line(stat = “summary”, fun = “mean”, color = “darkblue”, size = 1.5) + geom_text(data = mean_values, aes(label = round(Corruption_index, 2)), position = position_nudge(y = 1.5, x=-0.1), size = 3, color = “darkgreen”) + geom_point(data = mean_values, aes(x = Continent, y = Corruption_index), size = 3, color = “red”) + labs(title = “Average Corruption Rate by Continent”, x = “Continent”, y = “Average Corruption Rate”) + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

#violin plot for Unemployment rate

average_data <- aggregate(GDP$Unemployment, by = list(GDP$Continent), mean)
colnames(average_data) <- c("Continent", "Unemployment_avg")

GDP <- merge(GDP, average_data, by = "Continent", suffixes = c("", "_avg"))

library(ggplot2)
ggplot(GDP, aes(x = Continent, y = Unemployment, fill = Continent)) +
    geom_violin() +
    geom_text(aes(x = Continent, y = Unemployment_avg, label = round(Unemployment_avg, 1)),
              vjust = -0.5, color = "white", size = 3, fontface = "bold") +
    labs(title = "Distribution of Unemployment Rates by Continent",
        x = "Continent",
        y = "Unemployment Rate (%)") +
    theme_minimal()

ANNOVA

Introduction:

We have been taught, or at least assumed, “the richer the happier” by intuition. But is it actually true in reality? To answer this question, our team aims to test whether there is statistical equality in the average Happiness Indexes among the concerned continents by utilizing one-factor analysis of variances technique (ANOVA). In case there are differences among them, in other words, when the null hypothesis H0 is rejected, we further apply the Tukey multiple comparisons procedure (or Tukey’s HSD) to construct the confidence intervals for these differences. Consequently, we form the testing hypothesis as follows:

$H_{0}$ : $𝝁_{1}$ = $𝝁_{2}$ = $𝝁_{3}$ = $𝝁_{4}$ versus $[H_{A}]$: at least one pair of means are different from each other

Formulas: (Appendix):

Data analysis:
Hypothesis testing at a 10% (=0.1) level of significance: [$H_0$] : $𝝁_{1}$ = $𝝁_{2}$ = $𝝁_{3}$ = $𝝁_{4}$ versus [$H_A$] : at least one pair of means are different from each other
$𝝁_{1}$: the mean of Happiness Index for America (Continent)
$𝝁_{2}$: the mean of Happiness Index for Europe
$𝝁_{3}$: the mean of Happiness Index for Africa
$𝝁_{4}$: the mean of Happiness Index for Asia

First, R empowers us to effortlessly construct the boxplot. The detailed R code is as follows:

#Importing dataset
setwd("F:/VGU/ACADEMIC YEAR 1/OSTA") 
GDP <- read.csv("F:/VGU/ACADEMIC YEAR 1/OSTA/GDP.csv") #data input

#Take out a portion of dataset
SubsetGDP <- subset(GDP, select = c("Continent", "Happiness.index"))

#Draw boxplot
par(mar = c(3, 4, 3, 1))
boxplot(SubsetGDP$Happiness.index ~ SubsetGDP$Continent, 
        xlab = "Continents", ylab = "Happiness.Index",
        main = "Happiness Index in accordance to Continents")

#One-way ANOVA
oneway <- aov(Happiness.index ~ Continent, 
              data = SubsetGDP)
anova_table <- summary(oneway)

SumSq <- anova_table[[1]]$'Sum Sq'
Df <- anova_table[[1]]$'Df'
MeanSq <- anova_table[[1]]$'Mean Sq'
F_value <- anova_table[[1]]$'F value'
p_value <- anova_table[[1]]$'Pr(>F)'
table <- cbind(Df, SumSq, MeanSq, F_value, p_value)

# Display the ANOVA summary as a table
colnames(table) <- c('Degree of freedom', 'Sum squared', 'Mean squared', 'F-value', 'p-value')
kable(table, caption = "ANOVA Summary Table")

ANOVA Summary Table
Degree of freedom	Sum squared	Mean squared	F-value	p-value
3	76.73511	25.578369	41.91498	0
146	89.09563	0.610244	NA	NA

From the ANOVA table above, it is unambiguous that the p-value is smaller than 10% (even far smaller than 1%), the null hypothesis H0 thereby is clearly rejected. Hence, it can be concluded that there is sufficient evidence that at least the Happiness Index of one pair of Continents are different from each other.

Constructing confidence intervals:

But $𝝁_{i}$ of how many pairs of Continents are different from each other and by how much? To answer this question, we further conduct the Tukey’s honestly significant difference test on this dataset. The R code is as follows:

#Tukey's HSD test
tukey <- TukeyHSD(oneway, conf.level=0.90)
table2 <- tukey$Continent
colnames(table2) <- c('Difference', 'Lower', 'Upper', 'p-value')
kable(table2, caption="Tukey's HSD Table")

Tukey’s HSD Table
	Difference	Lower	Upper	p-value
America-Africa	1.3732194	0.9210694	1.8253693	0.0000000
Asia-Africa	0.8862254	0.4868649	1.2855859	0.0000054
Europe-Africa	1.9037523	1.4997857	2.3077190	0.0000000
Asia-America	-0.4869940	-0.9304572	-0.0435307	0.0581966
Europe-America	0.5305330	0.0829172	0.9781487	0.0344875
Europe-Asia	1.0175269	0.6233073	1.4117465	0.0000001

However, the result came out a bit vague and is indeed quite challenging to observe and make further comparisons. To make it more illustrative, we decided to plot the results. The R code is as follows:

#Confidence level plotting
par(mar = c(5, 7, 3, 1))
plot(tukey, las = 2)

Overall, it is clear that the 90% confidence interval for the difference between the considered continents does not contain 0, implying that it is plausible that the Happiness Index of the continents are different with 90% confidence level.

The most noticeable difference we may be aware of must be between Asia and America - the two richest (using total GDP as a measurement) continents. The 90% confidence interval for the difference between Asia and America is:

$𝝁_{4}$ - $𝝁_{1}$ ∈ (-0.93045722, -0.04353072)

The result implies that people in Asia are generally unhappier than those in America. This can be explained by the huge culture gap between the two, i.e. while Western (in this case American) perspectives on creativity tend to emphasize the individual traits of creative individuals, Eastern (in this case Asian) concepts center more on social aspects, such as teamwork and having support from others (Zotero).

Conclusion:

All things considered, the analysis above suggests that our initial intuition is plausible, i.e. as the wealth of continents increases, so does the overall happiness and well-being of their populations. Furthermore, it can be observed that people in Europe are the happiest, followed by those in the continent of America. Asia, according to the analysis, exhibits a higher degree of unhappiness among its citizens, with Africa standing out as the region with the highest levels of misery.

LINEAR & NON-LINEAR REGRESSION

The third method we used in the report is linear regression with the aim to build a mathematical model of two variables and investigate further relationships between them. Firstly, we formed the testing hypothesis to check whether there were any connections between 2 variables nor not:

Two-sided hypothesis:

$H_0:\ \beta_1=0\ \ \ versus\ \ \ H_A:\ \beta_1\neq 0$

Furthermore, we did investigations on the relations between happiness index and corruption index of all the countries in the world to test whether the civilians in the nations with less corruption behaviors (in both state and non-state organizations) would experience better quality of life. In addition, we also investigated how well this relationship is suitable for each continent.

1. Relationship between Happiness index and Corruption index

Happiness index: Happiness index is measured by collecting the data from the people of each country through a big survey in the scale from 0to10. This index expresses how civilians in this nation content with the quality of life and general problems related to their community.
Corruption index: Corruption index or Corruption Perceptions Index (CPI) ranks countries and territories worldwide by their perceived levels of public sector corruption, with the scores ranging from 0 (highly corrupt) to 100 (very clean)

Here, we denote corruption index is independent variable and happiness index is dependent variable.

First of all, we set up testing two-sided hypotheses to obtain the an overview of this relationship:

$\ \ \ H_0: \beta_1 = 0\ \ \ versus \ \ \ H_A: \beta_1 \neq 0$

with: $\beta_1:$ is the slope parameter

Then, we need to figure out the p-value to reach the goal, which is the very first idea about the relationship between the two variables which are happiness index and corruption index. We use the following code to evaluate the p-value:

model1 <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers) #Intercept parameter & Slope parameter
summary_table <- summary(model1)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Slope parameter')
summary(model1)

## 
## Call:
## lm(formula = Happiness_index ~ Corruption_index, data = GDP_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8240 -0.4516  0.1294  0.5073  1.0067 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      3.800112   0.236255  16.085  < 2e-16 ***
## Corruption_index 0.042663   0.005177   8.241 9.56e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6652 on 48 degrees of freedom
## Multiple R-squared:  0.5859, Adjusted R-squared:  0.5772 
## F-statistic: 67.91 on 1 and 48 DF,  p-value: 9.56e-11

kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	3.8001	0.2363	16.0848	0
Slope parameter	0.0427	0.0052	8.2405	0

From the result we obtain from the code above, it indicates that the p-value is very low:(~0), which means the null hypothesis ($H_0$) is implausible; in other words, the slope parameter is non-zero. Therefore, there must be a relationship between happiness index and corruption index or the happiness index has been shown to depend on the corruption index.

However, we need to investigate further more to understand how close-knit this relationship is and how we can utilize this model to forecast happiness index if we know the corruption score of a country.

ggplot(GDP_no_outliers, aes(y=Happiness_index, x=Corruption_index)) +
  geom_point(color='black', fill='darkorange', shape=21, alpha=1, size=3.5, stroke=1) +
  geom_smooth(method='lm', color='blue4', linewidth=1.5) +
  labs(title='Relationship between happiness index and corruption index', 
       y="Happiness index", x='Corruption index',
       subtitle='Happiness index: scale of 10 | Corruption index: scale of 100', caption='OSTA 2023 - Group 2')

## `geom_smooth()` using formula = 'y ~ x'

From the graph above, we can observe that there is obviously a positive correlation between corruption index and happiness index and there should be a linear combination of the two index. Particularly, we obtain from the results above: $R^2=0.5859$ indicate a relatively strong relationship between corruption index and happiness index.

Furthermore, we will form a particular formula for this relationship. We also obtain the slope parameter $\beta_1$ and the intercept parameter $\beta_0$ from the results above. As we can see in the table, the slope parameter is: $\beta_1: 0.0427$ and the intercept parameter is: $\beta_0: 3.8001$. Therefore, the simple linear regression model is:$y_i=3.8001\ +\ 0.0427*x_i$or the data values $(x_i, y_i)$ will lie closer to t he line$y_i=3.8001\ +\ 0.0427*x_i$as the error variance decreases.

In addition, we can also predict confidence interval of the happiness index for a particular value of corruption index. We use the following code to find a 80% confidence level two-sided prediction interval.

For example, we have corruption indexes which is: 42 (which is the Vietnam’s corruption index) and 69 (which is the corruption index of the USA). Following the code, we will have the result:

model_prediction <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers)
new_corruption_index <- data.frame(Corruption_index = c(42,69))
prediction_happiness <- predict(model_prediction, interval = "prediction", newdata = new_corruption_index, level = 0.80)

Corruption_index = new_corruption_index$Corruption_index
Predicted_happiness = prediction_happiness[, 1]
Lower_CI = prediction_happiness[, 2]
Upper_CI = prediction_happiness[, 3]

summary_happiness <- cbind(Corruption_index, Lower_CI, Predicted_happiness, Upper_CI)
colnames(summary_happiness) <- c('Corruption index selected', 'Lower', 'Fit', 'Upper')
rownames(summary_happiness) <- c('Viet Nam', 'the USA')
kable(summary_happiness, caption='Confidence Interval of the Country Selected', digits = c(1,4,4,4), align = 'cccc')

Confidence Interval of the Country Selected
	Corruption index selected	Lower	Fit	Upper
Viet Nam	42	4.7190	5.5920	6.4649
the USA	69	5.8521	6.7439	7.6357

The result does make sense as the Vietnam’s happiness index in real life is: 5.5 point and the figure for the USA is: 7.0 point. The results can be illustrated as the graph below:

ggplot(GDP_no_outliers, aes(y=Happiness_index, x=Corruption_index)) +
  geom_point(color='black', fill='darkorange', shape=21, alpha=1, size=3.5, stroke=1) +
  geom_smooth(method='lm', color='blue4', linewidth=1.5) +
  labs(title='Relationship between happiness index and corruption index', 
       y="Happiness index", x='Corruption index',
       subtitle='Happiness index: scale of 10 | Corruption index: scale of 100', caption='OSTA 2023 - Group 2') +
  geom_segment(aes(x = 42, xend = 42, y = prediction_happiness[1,2], yend = prediction_happiness[1,3]), linetype = "solid", color = "red", linewidth = 1.5) +
  geom_text(aes(x = 42, y = prediction_happiness[1,3]+0.35, label = "Vietnam"), size=5) +
  
  geom_segment(aes(x = 69, xend = 69, y = prediction_happiness[2,2], yend = prediction_happiness[2,3]), linetype = "solid", color = "red", linewidth = 1.5) +
   geom_text(aes(x = 69, y = prediction_happiness[2,3]+0.25, label = "the USA"), size=5)

## `geom_smooth()` using formula = 'y ~ x'

In conclusion, there is obvious a positive relationship between corruption index and happiness index. To be more particular, the countries with higher CPI point tend to have higher scores of happiness level; in other words, the civilians in the nations with more transparent political systems and better in minimizing corruption behaviors would experience better quality of life. The results obtained above do enhance the results of the paper “The Most Influential Factors in Determining the Happiness of Nations” by Julie Lang (University of Northern Iowa). According to this investigation, corrupt condition does play a noticeable role (Appendix) in determining the life satisfaction of the civilians in a country as the better the control of corruption is, the higher the life satisfaction index is.

2. Relationship between Happiness index and GDP per capita

GDP per capita: is an economic metric that breaks down a country’s economic output per person and is calculated by dividing the total GDP of a country to its total population. Economists often use this index to determine the prosperity of a nation.

IQR2 <- IQR(GDP_no_outliers$GDP_per_capita)
Lower_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs =0.75) + 1.5*IQR2

GDP_NO_outliers <- subset(GDP_no_outliers, GDP_per_capita>Lower_limit2 & GDP_per_capita<Upper_limit2)

Here, we denote the independent variable is GDP per capita while happiness index is dependent variable.

First of all, we take a general look to the relationship between happiness index and GDP per capita:

ggplot(GDP_NO_outliers, aes(x=GDP_per_capita, y=Happiness_index)) +
  geom_point(color='black', fill='darkorange', shape=21, alpha=1, size=3.5, stroke=1) +
  geom_smooth(method='loess', color='blue4', linewidth=1.5) +
  labs(title='Relationship between happiness index and GDP per capita', 
       y="Happiness index", x='GDP per capita',
       subtitle='Happiness index: scale of 10 | GDP per capita: thousand USD', caption='OSTA 2023 - Group 2')

## `geom_smooth()` using formula = 'y ~ x'

From the graph above, we can figure out that there is an obvious connection between happiness index and GDP per capita and there should be a non-linear connection here. To be more particularly, the line is quite similar to the plot of the function: $y=\ m\ + \ log(x) \ (x>0)$ ; therefore, we assume that the non-linear formula between y=happiness index an x=GDP per capita would be: $y_i=\beta_0\ + \beta_1*log(x_i)$ . Then, it it necessary to find out the values of $\beta_0$ and $\beta_1$ expected and how strong this relationship is.

model_non <- nls(Happiness_index ~ b + a*log(GDP_per_capita), 
             data = GDP_NO_outliers, start = list(a=1, b=1))
summary(model_non)

## 
## Formula: Happiness_index ~ b + a * log(GDP_per_capita)
## 
## Parameters:
##   Estimate Std. Error t value Pr(>|t|)    
## a  0.54147    0.06954   7.786 7.07e-10 ***
## b  4.42452    0.16106  27.472  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.6151 on 45 degrees of freedom
## 
## Number of iterations to convergence: 1 
## Achieved convergence tolerance: 1.205e-08

summary_table_non <- summary(model_non)$parameters
rownames(summary_table_non) <- c('Beta 1 (a)', 'Beta 0 (b)')
colnames(summary_table_non) <- c('Estimate', 'Standard Error', 't-value', 'p-value')
kable(summary_table_non, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Beta 1 (a)	0.5415	0.0695	7.7864	0
Beta 0 (b)	4.4245	0.1611	27.4717	0

From the results obtained from the table above:$\beta_0=4.4245$ and $\beta_1=0.5415$. Therefore, the formula of this model will be:$y_i=4.4245\ +\ 0.5415*log(x_i)$ and this represents a positive relationship. Furthermore, the achieved convergence tolerance of this model is quite small (~0), which indicates a high level of precision and accuracy in this estimation process.

In conclusion, we can conclude in the countries with higher GDP per capita (or in other words, the more prosperous the country is) will provide better welfare for their citizens and the civilians will be also more content with their life. Beside that, as the relationship between these two variables is indicated by the line which is similar to the graph of the function: $y=log(x)$, the countries with lower GDP per capita will obtain the more significant increase in the happiness level with the same increase in the GDP per capita. This result does support more or less an investigation published in the journal “Beyond GDP: Economics and Happiness” of Berkeley Economic Review (the non-profit publication of the University of California with aim to fostering the undergraduate writing and research on economics issues). According to this publication, there is a positive relationship between GDP per capita and happiness index and a 1% change in GDP per capita will cause about 0.3 unit change in happiness.

FURTHER ANALYSIS

#Change the column names colnames(GDP_sample)[5] <- “Inflation_rate” colnames(GDP_sample)[6] <- “GDP_growth_rate” colnames(GDP_sample)[7] <- “Happiness_index” colnames(GDP_sample)[8] <- “GDP_per_capita” colnames(GDP_sample)[10] <- “Corruption_index”

#Calculate to eliminate outliers IQR1 <- IQR(GDP_sample$Happiness_index) Lower_limit1 <- quantile(GDP_sample$Happiness_index, probs = 0.25) - 1.5IQR1 Upper_limit1 <- quantile(GDP_sample$Happiness_index, probs =0.75) + 1.5IQR1

IQR2 <- IQR(GDP_sample$Corruption_index) Lower_limit2 <- quantile(GDP_sample$Corruption_index, probs = 0.25) - 1.5IQR2 Upper_limit2 <- quantile(GDP_sample$Corruption_index, probs =0.75) + 1.5IQR2

IQR3 <- IQR(GDP_sample$Inflation_rate) Lower_limit3 <- quantile(GDP_sample$Inflation_rate, probs = 0.25) - 1.5IQR3 Upper_limit3 <- quantile(GDP_sample$Inflation_rate, probs =0.75) + 1.5IQR3

IQR4 <- IQR(GDP_sample$GDP_per_capita) Lower_limit4 <- quantile(GDP_sample$GDP_per_capita, probs = 0.25) - 1.5IQR4 Upper_limit4 <- quantile(GDP_sample$GDP_per_capita, probs =0.75) + 1.5IQR4

IQR5 <- IQR(GDP_sample$Unemployment) Lower_limit5 <- quantile(GDP_sample$Unemployment, probs = 0.25) - 1.5IQR5 Upper_limit5 <- quantile(GDP_sample$Unemployment, probs =0.75) + 1.5IQR5

#build dataset GDP_5_stats <- subset(GDP_sample, select=c(“Happiness_index”,“Corruption_index”, “Inflation_rate”,“GDP_per_capita”, “Unemployment”,“Continent”))

GDP_no_outliers <- GDP_5_stats %>% filter(Happiness_index>Lower_limit1 & Happiness_index<Upper_limit1) %>% filter(Inflation_rate>Lower_limit3 & Inflation_rate<Upper_limit3) %>% filter(Corruption_index<Upper_limit2 & Corruption_index>Lower_limit2) %>% filter(GDP_per_capita<Upper_limit4 & GDP_per_capita>Lower_limit4) %>% filter(Unemployment<Upper_limit5 & Unemployment>Lower_limit5)


# **<span style="color: #008489;">Linear Regression</span>**

*<span style="color: #9B30FF;">Hypothesis: The following factors does not make a difference on happiness.</span>*


```r
#table test of multiple linear regression
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita + Unemployment,
            data=GDP_no_outliers)
summary(model)

## 
## Call:
## lm(formula = Happiness_index ~ Corruption_index + Inflation_rate + 
##     GDP_per_capita + Unemployment, data = GDP_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.2238 -0.4336  0.1074  0.4200  1.0041 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4.597567   0.282768  16.259  < 2e-16 ***
## Corruption_index  0.026891   0.007177   3.747 0.000508 ***
## Inflation_rate   -0.008407   0.002822  -2.979 0.004653 ** 
## GDP_per_capita    0.009815   0.005018   1.956 0.056675 .  
## Unemployment     -0.022976   0.014768  -1.556 0.126760    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5776 on 45 degrees of freedom
## Multiple R-squared:  0.7072, Adjusted R-squared:  0.6812 
## F-statistic: 27.18 on 4 and 45 DF,  p-value: 1.679e-11

summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Corruption index',
                             'Inflation rate','GDP per capita','Unemployment')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.5976	0.2828	16.2592	0.0000
Corruption index	0.0269	0.0072	3.7467	0.0005
Inflation rate	-0.0084	0.0028	-2.9786	0.0047
GDP per capita	0.0098	0.0050	1.9561	0.0567
Unemployment	-0.0230	0.0148	-1.5558	0.1268

p_value of Inflation rate and Corruption index is ~ 0 => reject p_value of Unemployment is 0.32 > 0.1 => accept

=> Unemployment is not needed

#multiple-linear regression without unemployment
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter','Corruption index',
                             'Inflation rate','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.4493	0.2703	16.4605	0.0000
Corruption index	0.0261	0.0073	3.5856	0.0008
Inflation rate	-0.0093	0.0028	-3.3345	0.0017
GDP per capita	0.0115	0.0050	2.3014	0.0259

p_value of Corruption index and GDP per capita is ~ 0 => reject p_value of inflation rate is 0.15 > 0.1 => accept

=> Inflation rate is not needed

#multiple-linear regression without inflation rate


model <- lm(Happiness_index ~ Corruption_index + GDP_per_capita,
            data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter',
                             'Corruption index','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
      digits = c(4,4,4,4), align = 'cccc')

Summary table of all nations worldwide
	Estimate	Standard Error	t-value	p-value
Intercept parameter	4.0885	0.2731	14.9727	0.0000
Corruption index	0.0309	0.0078	3.9381	0.0003
GDP per capita	0.0107	0.0055	1.9515	0.0570

"all p values are near to 0 < 0.1 => cannot reject
and the estimate are positive

Conclusion: Corruption index and GDP per capita are two driving factors"

## [1] "all p values are near to 0 < 0.1 => cannot reject\nand the estimate are positive\n\nConclusion: Corruption index and GDP per capita are two driving factors"

all p values are near to 0 < 0.1 => cannot reject and the estimate are positive

Conclusion: Corruption index and GDP per capita are two driving factors.

#correlation test of each pair to test the influence of corruption index and GDP per capita

cor2 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$GDP_per_capita)$estimate
cor4 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$Corruption_index)$estimate

cor2 <- data.frame(cbind(round(cor2,3)))
cor4 <- data.frame(cbind(round(cor4,3)))

cor2 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$GDP_per_capita)$estimate
cor4 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$Corruption_index)$estimate

correlation_table <- rbind(cor2,cor4)

colnames(correlation_table) <- c('Correlation')
rownames(correlation_table) <- c('Happiness index & GDP per capita',
                                 'Happiness index & Corruption index')

kable(correlation_table,
      caption = "Correlation results of GDP per capita and Corruption index",
      digits = c(4,4,4,4),
      align = 'cccc')

Correlation results of GDP per capita and Corruption index
	Correlation
Happiness index & GDP per capita	0.7004
Happiness index & Corruption index	0.7654

The correlation of GDP per capita (0.8) and Corruption index (0.73) is positively correlated and the rate is really high.

#plot the graph to illustrate
ggplot(GDP_no_outliers, aes(x = GDP_per_capita, y = Happiness_index)) +
  geom_point(color='black', fill='orange', shape=21, alpha=1, size=3, stroke=.9, 
             position = "jitter") +
  geom_smooth(method = "lm", size = 1.2,bg="transparent") +
  labs(title='Relationship between GDP per Capita and Happiness index', 
       y="Happiness index", x='GDP per Capita',
       caption='OSTA 2023 - Group 2')

## `geom_smooth()` using formula = 'y ~ x'

ggplot(GDP_no_outliers, aes(x = Corruption_index, y = Happiness_index)) +
  geom_point(color='black', fill='orange', shape=21, alpha=1, size=3, stroke=.9, 
             position = "jitter") +
  geom_smooth(method = "lm", size = 1.2,bg="transparent") +
  labs(title='Relationship between GDP per Capita and Happiness index', 
       y="Happiness index", x='Corruption_index',
       caption='OSTA 2023 - Group 2')

## `geom_smooth()` using formula = 'y ~ x'

#R square test to test the fitness of the model
Rsquared <- summary(model)$r.squared
print(Rsquared)

## [1] 0.6169141

68.16% of the variation in the y values is accounted for by the x values => good

Country_test <- subset(GDP_sample, Country == 'South Africa',
                       select=c("Country","Corruption_index",
                              "GDP_per_capita","Happiness_index"))
Country_test

## # A tibble: 1 × 4
##   Country      Corruption_index GDP_per_capita Happiness_index
##   <chr>                   <dbl>          <dbl>           <dbl>
## 1 South Africa               43            6.8             5.2

new_data <- data.frame(Corruption_index = Country_test[1,2],
                       GDP_per_capita = Country_test[1,3])

prediction_interval <- predict(model,
                               newdata = new_data,
                               interval = "prediction",
                               level = 0.9)
print(prediction_interval)

##        fit     lwr     upr
## 1 5.464745 4.61857 6.31092

Lower <- prediction_interval[1,2]
Upper <- prediction_interval[1,3]
Fit <- prediction_interval[1,1]
Real <- Country_test[1,4]
summary_table <- cbind(Lower, Fit, Upper,Real)
colnames(summary_table) <- c('Lower', 'Fit', 'Upper','Real')
rownames(summary_table) <- c('Chosen country')
kable(summary_table, caption="Confidence Interval of Predicted Nation",
      digits = c(4,4,4,4), align = 'cccc')

Confidence Interval of Predicted Nation
	Lower	Fit	Upper	Real
Chosen country	4.6186	5.4647	6.3109	5.2

If we take the estimate and take that and times the corresponding number of the variable, we will have the happiness index that is equivalent to the happiness index in the dataset.

Confidence Interval of Predicted Nation
	Lower	Fit	Upper	Real	Manual fit
Chosen country	4.6186	5.4647	6.3109	5.2	5.4647

The manual result is exactly the same the resulted taken from the predict function of R.

#Confidence interval of the sample mean = 95% through the t.test

# Create two example data frames
lower_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$conf.int[1]
upper_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$conf.int[2]
lower_2 <- t.test(GDP_no_outliers$GDP_per_capita)$conf.int[1]
upper_2 <- t.test(GDP_no_outliers$GDP_per_capita)$conf.int[2]
lower_3 <- t.test(GDP_no_outliers$Inflation_rate)$conf.int[1]
upper_3 <- t.test(GDP_no_outliers$Inflation_rate)$conf.int[2]

fit_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$estimate
fit_2 <- t.test(GDP_no_outliers$GDP_per_capita)$estimate
fit_3 <- t.test(GDP_no_outliers$Inflation_rate)$estimate

A <- c(lower_1,fit_1,upper_1)
B <-  c(lower_2,fit_2,upper_2)
C <-  c(lower_3,fit_3,upper_3)

table_1 <- rbind(A,B,C)

rownames(table_1) <- c('GDP growth rate','GDP per capita','Inflation rate')
colnames(table_1) <- c('Lower','Sample','Upper')
kable(table_1, caption="Confidence Interval of sample mean with 95% confidence",
      digits = c(4,4,4), align = 'ccc')

Confidence Interval of sample mean with 95% confidence
	Lower	Sample	Upper
GDP growth rate	3.0759	4.1286	5.1813
GDP per capita	10.2692	15.7771	21.2851
Inflation rate	7.3731	8.7886	10.2040

#draw table
table_2 <- rbind(confidence_interval_1, confidence_interval_2, confidence_interval_3)
rownames(table_2) <- c('GDP growth rate','GDP per capita','Inflation rate')
colnames(table_2) <- c('Lower','Upper')

kable(table_1, caption="Confidence Interval of sample mean with 95% confidence",
      digits = c(4,4,4), align = 'ccc')

Confidence Interval of sample mean with 95% confidence
	Lower	Sample	Upper
GDP growth rate	3.0759	4.1286	5.1813
GDP per capita	10.2692	15.7771	21.2851
Inflation rate	7.3731	8.7886	10.2040

kable(table_2, caption="Confidence Interval of sample mean calculated manually",
      digits = c(4,4,4), align = 'ccc')

Confidence Interval of sample mean calculated manually
	Lower	Upper
GDP growth rate	3.2792	4.9780
GDP per capita	11.3328	20.2215
Inflation rate	7.6465	9.9307

After all of those calculations, we can see that the sample mean and confidence interval from t.test() is quite equivalent tothe manual calculations we have been making, but the latter is much more precise than t.test().

OSTA Project 2023

Group 2 - BFA/BBA2022