Traditionally, economic prosperity is considered a success indicator of a country. However, societal well-being is being increasingly noticed apart from the economy’s flourishing. It is challenging and essential to comprehensively understand various factors shaping global happiness and how significantly each factor influences the targeted figure.
We acknowledge the importance of life satisfaction and the influential features that affect the figures considering residents’ happiness. This statistical report delves into the relationship between key indicators and the Happiness index across 150 countries and 4 continents in the year 2022.
This study aims to identify the main factors contributing to nations’ happiness levels by looking at the economic factors (GDP, GDP per capita, GDP growth rate, and Inflation rate) and social metrics (Population, Corruption index, and Unemployment rate).
The comprehensive dataset of nations with varied economic structures, cultural backgrounds, and governance models is expected to uncover the patterns and correlations of investigated elements. As a result, the Happiness index of a particular country can be predicted, and the most contributive variables can be revealed.
The dataset comprehensively overviews various countries, including economic and socio-cultural indicators. Each entry in the dataset represents a different country and includes the following information:
Rank: The ranking of the country based on GDP.
Country: The name of the country.
GDP (Gross Domestic Product): The country’s total economic output indicates its size.
Population: The total number of people living in the country.
Inflation Rate: The percentage increase in the general price level of goods and services over one year.
GDP Growth Rate: The rate at which the country’s economy expands or contracts over a specific period.
Happiness Index: A measure of the subjective well-being and happiness of the country’s citizens.
GDP per Capita: The GDP divided by the population, providing an average economic output per person.
Continent: The continent to which the country belongs.
Corruption Index: A measure of perceived corruption in the country, though the specific index used is not defined in the provided dataset.
Unemployment Rate: The percentage of the labor force that is unemployed and actively seeking employment.
#build Dataset
group <- GDP$Continent
subgroup <- GDP$Country
value <- GDP$GDP
data <- data.frame(group,subgroup,value)
#box plot for GDP per capita
ggplot(data = GDP, aes(x = reorder(Continent, -GDP_per_capita, FUN = median), y = GDP_per_capita, color = Continent)) +
geom_line(linewidth = 0.8) +
geom_boxplot(position = position_dodge(width = 5), outlier.shape = NA, linewidth = 0.8) +
labs(title = "Average GDP per capita of the four Continents",
x = "Continent",
y = "GDP per capita") +
theme_bw()
Description:
Most countries have a GDP per capita lower than 50. While Europe’s figures are the most noticeable, Africa has insignificant numbers.
Europe has the highest lower and upper bounds and the average number at 25. Although America’s upper bounds are lower than Asia, its average rate is higher at around 10. This can be explained by the huge variance in the figures of Asia from 5 to 30 compared to around 10 to 20 in America.
#histogram for happiness index
GDP %>%
ggplot(aes(x=Happiness_index)) +
geom_histogram(binwidth=0.25,
fill="#B0A695",
color="#F3EEEA", alpha=0.9) +
theme_minimal()+
labs(title = "Happiness Index distribution",
x = "Happiness index (bin width = 0.2)",
y = "Frequency")
Description:
The overall Happiness index represents a symmetric distribution.
The figures from 5.8 to 6.2 are outstandingly high, appearing the most for 14 counts at 6.0. The 5.0 rate also accounts for a high frequency of 13.
There are some data gaps before reaching endpoint 2.4, which should be considered an outlier.
GDP %>%
ggplot(aes(x=Happiness_index)) +
geom_histogram(binwidth=0.5,
fill="#B0A695",
color="#F3EEEA", alpha=0.9) +
theme_minimal()+
labs(title = "Happiness Index distribution by Continents",
x = "Happiness index (bin width = 0.5)",
y = "Frequency")+
facet_wrap(~Continent)
Description:
However, the figures are moderately different when looking at the graph categorized by Continents with 0.5 bin width.
The peak points of America and Asia are roughly the same with the former graph at around 6. Interestingly, Africa has most recorded rates lower than 5.5. On the other hand, the Europe graph shows high counts at 7 to 7.5 apart from the similar peaks at around 6.
We may conclude that Europeans are slightly happier and Africa’s happiness rate is quite low.
#bar chart for Corruption index
mean_values <- aggregate(`Corruption_index` ~ Continent, data = GDP, mean)
ggplot(mean_values, aes(x = Continent, y = `Corruption_index`, fill = Continent)) +
geom_bar(stat = "identity", width = 0.7) +
geom_text(aes(label = round(`Corruption_index`, 2)),
position = position_nudge(y = 1), size = 3, color = "black") +
labs(title = "Average Corruption Rate by Continent",
x = "Continent",
y = "Average Corruption Rate") +
theme_minimal()
Description:
The bar chart represents the average corruption rates across four continents—Africa, America, Asia, and Europe.
The analysis reveals distinct levels of corruption across continents, with Africa exhibiting the lowest average rate at 32.62, followed by America at 40.11, Asia at 41.35, and Europe with the highest average rate at 58.49.
The ascending order of corruption rates is clearly illustrated, offering a comparative view of the relative corruption levels.
We have been taught, or at least assumed, “the wealthier, the merrier” by intuition. But is it actually true in reality?
To answer this question, our team aims to test whether there is statistical equality in the average Happiness Indexes among the concerned continents by utilizing one-factor analysis of variances technique (ANOVA). In case there are differences among them, in other words, when the null hypothesis \(H_0\) is rejected, we further apply the Tukey multiple comparisons procedure (or Tukey’s HSD) to construct the confidence intervals for these differences. Consequently, we form the testing hypothesis as follows:
Hypothesis testing at a 5% (=0.05) level of significance: \(H_0\): \(𝝁_{1}\) = \(𝝁_{2}\) = \(𝝁_{3}\) = \(𝝁_{4}\) versus \(H_A\): at least one pair of means are
different from each other
\(𝝁_{1}\): the mean of Happiness Index
for America (Continent)
\(𝝁_{2}\): the mean of Happiness Index
for Europe
\(𝝁_{3}\): the mean of Happiness Index
for Africa
\(𝝁_{4}\): the mean of Happiness Index
for Asia
First, R empowers us to effortlessly construct the boxplot. The detailed R code is as follows:
#Take out a portion of dataset
SubsetGDP <- subset(GDP_sample, select = c("Continent", "Happiness_index"))
#Draw boxplot
par(mar = c(3, 4, 3, 1))
boxplot(SubsetGDP$Happiness_index ~ SubsetGDP$Continent,
xlab = "Continents", ylab = "Happiness_index",
main = "Happiness Index in accordance to Continents")
Next, the precise calculations for ANOVA can be easily computed thanks to the R programming language. The detailed R code is as follows:
#One-way ANOVA
oneway <- aov(Happiness_index ~ Continent,
data = SubsetGDP)
anova_table <- summary(oneway)
SumSq <- anova_table[[1]]$'Sum Sq'
Df <- anova_table[[1]]$'Df'
MeanSq <- anova_table[[1]]$'Mean Sq'
F_value <- anova_table[[1]]$'F value'
p_value <- anova_table[[1]]$'Pr(>F)'
table <- cbind(Df, SumSq, MeanSq, F_value, p_value)
# Display the ANOVA summary as a table
colnames(table) <- c('Degree of freedom', 'Sum squared', 'Mean squared', 'F-value', 'p-value')
rownames(table) <-c('Continent','Residuals')
kable(table, caption = "ANOVA Summary Table")
| Degree of freedom | Sum squared | Mean squared | F-value | p-value | |
|---|---|---|---|---|---|
| Continent | 3 | 21.30352 | 7.101172 | 10.89693 | 1.58e-05 |
| Residuals | 46 | 29.97668 | 0.651667 | NA | NA |
From the ANOVA table above, it is unambiguous that the p-value is smaller than 5% (even far smaller than 1%), the null hypothesis \(H_0\) thereby is clearly rejected. Hence, it can be concluded that there is sufficient evidence that at least the Happiness Index of one pair of Continents are different from each other.
But we want to delve into the data further. We wonder \(𝝁_{i}\) of how many pairs of Continents are
different from each other and by how much?
First, recall that in terms of GDP per capita, Europe is considered to
be the richest, followed by Asia and America (Continent), while Africa
is, as expected, the poorest among the five.
Next, to satisfy our concern, our team further conduct the Tukey’s
honestly significant difference test (also known as Tukey multiple
comparisons procedure) on this dataset. The R code is as follows:
#Tukey's HSD test
tukey <- TukeyHSD(oneway, conf.level=0.95)
table2 <- tukey$Continent
colnames(table2) <- c('Difference', 'Lower', 'Upper', 'p-value')
kable(table2, caption="Tukey's HSD Table")
| Difference | Lower | Upper | p-value | |
|---|---|---|---|---|
| America-Africa | 1.0000000 | 0.0511693 | 1.9488307 | 0.0353294 |
| Asia-Africa | 1.0187500 | 0.1970385 | 1.8404615 | 0.0096522 |
| Europe-Africa | 1.8461538 | 0.9847662 | 2.7075415 | 0.0000046 |
| Asia-America | 0.0187500 | -0.8778107 | 0.9153107 | 0.9999369 |
| Europe-America | 0.8461538 | -0.0869057 | 1.7792134 | 0.0879329 |
| Europe-Asia | 0.8274038 | 0.0239549 | 1.6308528 | 0.0413377 |
However, the result came out a bit vague and is indeed quite challenging to observe and make further comparisons. To make it more illustrative, we decided to plot the results. The R code is as follows:
#Confidence level plotting
par(mar = c(5, 7, 3, 1))
plot(tukey, las = 2)
All things considered, the analysis above suggests that our initial intuition is plausible, i.e. as the wealth of continents increases, so does the overall happiness and the well-being of their populations. In particular, it can be observed that people in Europe are the happiest, followed by those in the continent of Asia and America. Notwithstanding, Africa certainly exhibits the highest degree of unhappiness among its citizens.
The third method we used in the report is linear regression with the aim to build a mathematical model of two variables and investigate further relationships between them. Firstly, we formed the testing hypothesis to check whether there were any connections between 2 variables nor not:
Two-sided hypothesis:
Furthermore, we did investigations on the relations between happiness index and corruption index of all the countries in the world to test whether the civilians in the nations with less corruption behaviors (in both state and non-state organizations) would experience better quality of life. In addition, we also investigated how well this relationship is suitable for each continent.
Happiness index: Happiness index is measured by collecting the data from the people of each country through a big survey in the scale from 0to10. This index expresses how civilians in this nation content with the quality of life and general problems related to their community.
Corruption index: Corruption index or Corruption Perceptions Index (CPI) ranks countries and territories worldwide by their perceived levels of public sector corruption, with the scores ranging from 0 (highly corrupt) to 100 (very clean)
Here, we denote corruption index is independent variable and happiness index is dependent variable.
First of all, we set up testing two-sided hypotheses to obtain the an overview of this relationship:
with: \(\beta_1:\) is the slope parameter
Then, we need to figure out the p-value to reach the goal, which is the very first idea about the relationship between the two variables which are happiness index and corruption index. We use the following code to evaluate the p-value:
model1 <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers) #Intercept parameter & Slope parameter
summary_table <- summary(model1)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Slope parameter')
summary(model1)
##
## Call:
## lm(formula = Happiness_index ~ Corruption_index, data = GDP_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8240 -0.4516 0.1294 0.5073 1.0067
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.800112 0.236255 16.085 < 2e-16 ***
## Corruption_index 0.042663 0.005177 8.241 9.56e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6652 on 48 degrees of freedom
## Multiple R-squared: 0.5859, Adjusted R-squared: 0.5772
## F-statistic: 67.91 on 1 and 48 DF, p-value: 9.56e-11
kable(summary_table, caption="Summary table of all nations worldwide",
digits = c(4,4,4,4), align = 'cccc')
| Estimate | Standard Error | t-value | p-value | |
|---|---|---|---|---|
| Intercept parameter | 3.8001 | 0.2363 | 16.0848 | 0 |
| Slope parameter | 0.0427 | 0.0052 | 8.2405 | 0 |
From the result we obtain from the code above, it indicates that the p-value is very low:(~0), which means the null hypothesis (\(H_0\)) is implausible; in other words, the slope parameter is non-zero. Therefore, there must be a relationship between happiness index and corruption index or the happiness index has been shown to depend on the corruption index.
However, we need to investigate further more to understand how close-knit this relationship is and how we can utilize this model to forecast happiness index if we know the corruption score of a country.
ggplot(GDP_no_outliers, aes(y = Happiness_index, x = Corruption_index)) +
geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
labs(
title = 'Relationship between happiness index and corruption index',
y = "Happiness index", x = 'Corruption index',
subtitle = 'Happiness index: scale of 10 | Corruption index: scale of 100'
) +
theme_minimal()
From the graph above, we can observe that there is obviously a positive correlation between corruption index and happiness index and there should be a linear combination of the two index. Particularly, we obtain from the results above: \(R^2=0.5859\) indicate a relatively strong relationship between corruption index and happiness index.
Furthermore, we will form a particular formula for this relationship. We also obtain the slope parameter \(\beta_1\) and the intercept parameter \(\beta_0\) from the results above. As we can see in the table, the slope parameter is: \(\beta_1: 0.0427\) and the intercept parameter is: \(\beta_0: 3.8001\). Therefore, the simple linear regression model is:\(y_i=3.8001\ +\ 0.0427*x_i\) or the data values \((x_i, y_i)\) will lie closer to t he line\(y_i=3.8001\ +\ 0.0427*x_i\) as the error variance decreases.
In addition, we can also predict confidence interval of the happiness index for a particular value of corruption index. We use the following code to find a 80% confidence level two-sided prediction interval.
For example, we have corruption indexes which is: 42 (which is the Vietnam’s corruption index) and 69 (which is the corruption index of the USA). Following the code, we will have the result:
model_prediction <- lm(Happiness_index ~ Corruption_index, data = GDP_no_outliers)
new_corruption_index <- data.frame(Corruption_index = c(42,69))
prediction_happiness <- predict(model_prediction, interval = "prediction", newdata = new_corruption_index, level = 0.80)
Corruption_index = new_corruption_index$Corruption_index
Predicted_happiness = prediction_happiness[, 1]
Lower_CI = prediction_happiness[, 2]
Upper_CI = prediction_happiness[, 3]
summary_happiness <- cbind(Corruption_index, Lower_CI, Predicted_happiness, Upper_CI)
colnames(summary_happiness) <- c('Corruption index selected', 'Lower', 'Fit', 'Upper')
rownames(summary_happiness) <- c('Viet Nam', 'the USA')
kable(summary_happiness, caption='Confidence Interval of the Country Selected', digits = c(1,4,4,4), align = 'cccc')
| Corruption index selected | Lower | Fit | Upper | |
|---|---|---|---|---|
| Viet Nam | 42 | 4.7190 | 5.5920 | 6.4649 |
| the USA | 69 | 5.8521 | 6.7439 | 7.6357 |
The result does make sense as the Vietnam’s happiness index in real life is: 5.5 point and the figure for the USA is: 7.0 point. The results can be illustrated as the graph below:
ggplot(GDP_no_outliers, aes(y = Happiness_index, x = Corruption_index)) +
geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
labs(title = 'Relationship between happiness index and corruption index',
y = "Happiness index", x = 'Corruption index',
subtitle = 'Happiness index: scale of 10 | Corruption index: scale of 100') +
geom_segment(aes(x = 42, xend = 42, y = prediction_happiness[1, 2], yend = prediction_happiness[1, 3]),
linetype = "solid", color = "red", linewidth = 1.5) +
geom_text(aes(x = 42, y = prediction_happiness[1, 3] + 0.35, label = "Vietnam"), size = 5) +
geom_segment(aes(x = 69, xend = 69, y = prediction_happiness[2, 2], yend = prediction_happiness[2, 3]),
linetype = "solid", color = "red", linewidth = 1.5) +
geom_text(aes(x = 69, y = prediction_happiness[2, 3] + 0.25, label = "the USA"), size = 5)
In conclusion, there is obvious a positive relationship between corruption index and happiness index. To be more particular, the countries with higher CPI point tend to have higher scores of happiness level; in other words, the civilians in the nations with more transparent political systems and better in minimizing corruption behaviors would experience better quality of life. The results obtained above do enhance the results of the paper “The Most Influential Factors in Determining the Happiness of Nations” by Julie Lang (University of Northern Iowa). According to this investigation, corrupt condition does play a noticeable role (Appendix) in determining the life satisfaction of the civilians in a country as the better the control of corruption is, the higher the life satisfaction index is.
IQR2 <- IQR(GDP_no_outliers$GDP_per_capita)
Lower_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_no_outliers$GDP_per_capita, probs =0.75) + 1.5*IQR2
GDP_NO_outliers <- subset(GDP_no_outliers, GDP_per_capita>Lower_limit2 & GDP_per_capita<Upper_limit2)
*Here, we denote the independent variable is GDP per capita while happiness index is dependent variable
First of all, we take a general look to the relationship between happiness index and GDP per capita:
ggplot(GDP_NO_outliers, aes(x = GDP_per_capita, y = Happiness_index)) +
geom_point(color = 'black', fill = 'darkorange', shape = 21, alpha = 1, size = 3.5, stroke = 1) +
geom_smooth(method = 'lm', formula = y ~ x, color = 'blue4', linewidth = 1.5) +
labs(
title = 'Relationship between happiness index and GDP per capita',
y = 'Happiness index', x = 'GDP per capita',
subtitle = 'Happiness index: scale of 10 | GDP per capita: thousand USD')
From the graph above, we can figure out that there is an obvious connection between happiness index and GDP per capita and there should be a non-linear connection here. To be more particularly, the line is quite similar to the plot of the function: \(y=\ m\ + \ log(x) \ (x>0)\) ; therefore, we assume that the non-linear formula between y=happiness index an x=GDP per capita would be: \(y_i=\beta_0\ + \beta_1*log(x_i)\) . Then, it it necessary to find out the values of \(\beta_0\) and \(\beta_1\) expected and how strong this relationship is.
model_non <- nls(Happiness_index ~ b + a*log(GDP_per_capita),
data = GDP_NO_outliers, start = list(a=1, b=1))
summary(model_non)
##
## Formula: Happiness_index ~ b + a * log(GDP_per_capita)
##
## Parameters:
## Estimate Std. Error t value Pr(>|t|)
## a 0.54147 0.06954 7.786 7.07e-10 ***
## b 4.42452 0.16106 27.472 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6151 on 45 degrees of freedom
##
## Number of iterations to convergence: 1
## Achieved convergence tolerance: 1.205e-08
summary_table_non <- summary(model_non)$parameters
rownames(summary_table_non) <- c('Beta 1 (a)', 'Beta 0 (b)')
colnames(summary_table_non) <- c('Estimate', 'Standard Error', 't-value', 'p-value')
kable(summary_table_non, caption="Summary table of all nations worldwide",
digits = c(4,4,4,4), align = 'cccc')
| Estimate | Standard Error | t-value | p-value | |
|---|---|---|---|---|
| Beta 1 (a) | 0.5415 | 0.0695 | 7.7864 | 0 |
| Beta 0 (b) | 4.4245 | 0.1611 | 27.4717 | 0 |
From the results obtained from the table above:\(\beta_0=4.4245\) [and] \(\beta_1=0.5415\). Therefore, the formula of this model will be:\(y_i=4.4245\ +\ 0.5415*log(x_i)\) and this represents a positive relationship. Furthermore, the achieved convergence tolerance of this model is quite small (~0), which indicates a high level of precision and accuracy in this estimation process.
In conclusion, we can conclude in the countries with higher GDP per capita (or in other words, the more prosperous the country is) will provide better welfare for their citizens and the civilians will be also more content with their life. Beside that, as the relationship between these two variables is indicated by the line which is similar to the graph of the function: \(y=log(x)\), the countries with lower GDP per capita will obtain the more significant increase in the happiness level with the same increase in the GDP per capita. This result does support more or less an investigation published in the journal “Beyond GDP: Economics and Happiness” of Berkeley Economic Review (the non-profit publication of the University of California with aim to fostering the undergraduate writing and research on economics issues). According to this publication, there is a positive relationship between GDP per capita and happiness index and a 1% change in GDP per capita will cause about 0.3 unit change in happiness.
#Calculate to eliminate outliers
IQR1 <- IQR(GDP_sample$Happiness_index)
Lower_limit1 <- quantile(GDP_sample$Happiness_index, probs = 0.25) - 1.5*IQR1
Upper_limit1 <- quantile(GDP_sample$Happiness_index, probs =0.75) + 1.5*IQR1
IQR2 <- IQR(GDP_sample$Corruption_index)
Lower_limit2 <- quantile(GDP_sample$Corruption_index, probs = 0.25) - 1.5*IQR2
Upper_limit2 <- quantile(GDP_sample$Corruption_index, probs =0.75) + 1.5*IQR2
IQR3 <- IQR(GDP_sample$Inflation_rate)
Lower_limit3 <- quantile(GDP_sample$Inflation_rate, probs = 0.25) - 1.5*IQR3
Upper_limit3 <- quantile(GDP_sample$Inflation_rate, probs =0.75) + 1.5*IQR3
IQR4 <- IQR(GDP_sample$GDP_per_capita)
Lower_limit4 <- quantile(GDP_sample$GDP_per_capita, probs = 0.25) - 1.5*IQR4
Upper_limit4 <- quantile(GDP_sample$GDP_per_capita, probs =0.75) + 1.5*IQR4
IQR5 <- IQR(GDP_sample$Unemployment)
Lower_limit5 <- quantile(GDP_sample$Unemployment, probs = 0.25) - 1.5*IQR5
Upper_limit5 <- quantile(GDP_sample$Unemployment, probs =0.75) + 1.5*IQR5
#build dataset
GDP_5_stats <- subset(GDP_sample,
select=c("Happiness_index","Corruption_index",
"Inflation_rate","GDP_per_capita",
"Unemployment","Continent"))
GDP_no_outliers <- GDP_5_stats %>%
filter(Happiness_index>Lower_limit1
& Happiness_index<Upper_limit1) %>%
filter(Inflation_rate>Lower_limit3
& Inflation_rate<Upper_limit3) %>%
filter(Corruption_index<Upper_limit2
& Corruption_index>Lower_limit2) %>%
filter(GDP_per_capita<Upper_limit4
& GDP_per_capita>Lower_limit4) %>%
filter(Unemployment<Upper_limit5
& Unemployment>Lower_limit5)
Hypothesis: The following factors does not make a difference on happiness.
#table test of multiple linear regression
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita + Unemployment,
data=GDP_no_outliers)
summary(model)
##
## Call:
## lm(formula = Happiness_index ~ Corruption_index + Inflation_rate +
## GDP_per_capita + Unemployment, data = GDP_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.7925 -0.4147 0.1235 0.3673 0.8375
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.741614 0.320106 14.813 7.08e-16 ***
## Corruption_index 0.020847 0.007879 2.646 0.01253 *
## Inflation_rate -0.018497 0.017773 -1.041 0.30581
## GDP_per_capita 0.025302 0.008305 3.047 0.00461 **
## Unemployment -0.024810 0.024752 -1.002 0.32369
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5096 on 32 degrees of freedom
## Multiple R-squared: 0.7073, Adjusted R-squared: 0.6707
## F-statistic: 19.33 on 4 and 32 DF, p-value: 3.582e-08
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter', 'Corruption index',
'Inflation rate','GDP per capita','Unemployment')
kable(summary_table, caption="Summary table of all nations worldwide",
digits = c(4,4,4,4), align = 'cccc')
| Estimate | Standard Error | t-value | p-value | |
|---|---|---|---|---|
| Intercept parameter | 4.7416 | 0.3201 | 14.8126 | 0.0000 |
| Corruption index | 0.0208 | 0.0079 | 2.6459 | 0.0125 |
| Inflation rate | -0.0185 | 0.0178 | -1.0407 | 0.3058 |
| GDP per capita | 0.0253 | 0.0083 | 3.0468 | 0.0046 |
| Unemployment | -0.0248 | 0.0248 | -1.0023 | 0.3237 |
p_value of Inflation rate and Corruption index is ~ 0 => reject p_value of Unemployment is 0.32 > 0.1 => accept
=> Unemployment is not needed
#multiple-linear regression without unemployment
model <- lm(Happiness_index ~ Corruption_index + Inflation_rate + GDP_per_capita,
data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter','Corruption index',
'Inflation rate','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
digits = c(4,4,4,4), align = 'cccc')
| Estimate | Standard Error | t-value | p-value | |
|---|---|---|---|---|
| Intercept parameter | 4.6753 | 0.3132 | 14.9266 | 0.0000 |
| Corruption index | 0.0198 | 0.0078 | 2.5316 | 0.0163 |
| Inflation rate | -0.0245 | 0.0167 | -1.4678 | 0.1516 |
| GDP per capita | 0.0264 | 0.0082 | 3.2026 | 0.0030 |
p_value of Corruption index and GDP per capita is ~ 0 => reject p_value of inflation rate is 0.15 > 0.1 => accept
=> Inflation rate is not needed
#multiple-linear regression without inflation rate
model <- lm(Happiness_index ~ Corruption_index + GDP_per_capita,
data=GDP_no_outliers)
summary_table <- summary(model)$coefficients
colnames(summary_table) <- c('Estimate','Standard Error','t-value','p-value')
rownames(summary_table) <- c('Intercept parameter',
'Corruption index','GDP per capita')
kable(summary_table, caption="Summary table of all nations worldwide",
digits = c(4,4,4,4), align = 'cccc')
| Estimate | Standard Error | t-value | p-value | |
|---|---|---|---|---|
| Intercept parameter | 4.4054 | 0.2578 | 17.0863 | 0.000 |
| Corruption index | 0.0195 | 0.0079 | 2.4633 | 0.019 |
| GDP per capita | 0.0292 | 0.0081 | 3.5880 | 0.001 |
all p values are near to 0 < 0.1 => Cannot reject and the estimate are positive
Conclusion: Corruption index and GDP per capita are two driving factors.
#correlation test of each pair to test the influence of corruption index and GDP per capita
cor2 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$GDP_per_capita)$estimate
cor4 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$Corruption_index)$estimate
cor2 <- data.frame(cbind(round(cor2,3)))
cor4 <- data.frame(cbind(round(cor4,3)))
cor2 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$GDP_per_capita)$estimate
cor4 <- cor.test(GDP_no_outliers$Happiness_index, GDP_no_outliers$Corruption_index)$estimate
correlation_table <- rbind(cor2,cor4)
colnames(correlation_table) <- c('Correlation')
rownames(correlation_table) <- c('Happiness index & GDP per capita',
'Happiness index & Corruption index')
kable(correlation_table,
caption = "Correlation results of GDP per capita and Corruption index",
digits = c(4,4,4,4),
align = 'cccc')
| Correlation | |
|---|---|
| Happiness index & GDP per capita | 0.788 |
| Happiness index & Corruption index | 0.746 |
The correlation of GDP per capita (0.8) and Corruption index (0.73) is positively correlated and the rate is really high.
#plot the graph to illustrate
ggplot(GDP_no_outliers, aes(x = GDP_per_capita, y = Happiness_index)) +
geom_point(color = 'black', fill = 'orange', shape = 21, alpha = 1, size = 3, stroke = 0.9,
position = "jitter") +
geom_smooth(method = "lm", formula = y ~ x, linewidth = 1.2, color = "blue", fill = "transparent") +
labs(title = 'Relationship between GDP per Capita and Happiness index',
y = 'Happiness index', x = 'GDP per Capita')
ggplot(GDP_no_outliers, aes(x = Corruption_index, y = Happiness_index)) +
geom_point(color = 'black', fill = 'orange', shape = 21, alpha = 1, size = 3, stroke = 0.9,
position = "jitter") +
geom_smooth(method = "lm", formula = y ~ x, linewidth = 1.2, color = "blue", fill = "transparent") +
labs(title = 'Relationship between GDP per Capita and Happiness index',
y = 'Happiness index', x = 'Corruption_index')
#R square test to test the fitness of the model
Rsquared <- summary(model)$r.squared
print(Rsquared)
## [1] 0.6783591
68.16% of the variation in the y values is accounted for by the x values => good
Country_test <- subset(GDP_sample, Country == 'South Africa',
select=c("Country","Corruption_index",
"GDP_per_capita","Happiness_index"))
Country_test
## # A tibble: 1 × 4
## Country Corruption_index GDP_per_capita Happiness_index
## <chr> <dbl> <dbl> <dbl>
## 1 South Africa 43 6.8 5.2
new_data <- data.frame(Corruption_index = Country_test[1,2],
GDP_per_capita = Country_test[1,3])
prediction_interval <- predict(model,
newdata = new_data,
interval = "prediction",
level = 0.9)
print(prediction_interval)
## fit lwr upr
## 1 5.464745 4.61857 6.31092
Lower <- prediction_interval[1,2]
Upper <- prediction_interval[1,3]
Fit <- prediction_interval[1,1]
Real <- Country_test[1,4]
summary_table <- cbind(Lower, Fit, Upper,Real)
colnames(summary_table) <- c('Lower', 'Fit', 'Upper','Real')
rownames(summary_table) <- c('Chosen country')
kable(summary_table, caption="Confidence Interval of Predicted Nation",
digits = c(4,4,4,4), align = 'cccc')
| Lower | Fit | Upper | Real | |
|---|---|---|---|---|
| Chosen country | 4.6186 | 5.4647 | 6.3109 | 5.2 |
“We can see from the chosen country (South Africa), the predicted happiness index from 2 factors GDP per capita and Corruption index is 5.359 and it is really close to 5.9
=> the prediction is really good :DDDDDD”
| Lower | Fit | Upper | Real | Manual fit | |
|---|---|---|---|---|---|
| Chosen country | 4.6186 | 5.4647 | 6.3109 | 5.2 | 5.4647 |
The manual result is exactly the same the resulted taken from the predict function of R.
#Confidence interval of the sample mean = 95% through the t.test
# Create two example data frames
lower_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$conf.int[1]
upper_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$conf.int[2]
lower_2 <- t.test(GDP_no_outliers$GDP_per_capita)$conf.int[1]
upper_2 <- t.test(GDP_no_outliers$GDP_per_capita)$conf.int[2]
lower_3 <- t.test(GDP_no_outliers$Inflation_rate)$conf.int[1]
upper_3 <- t.test(GDP_no_outliers$Inflation_rate)$conf.int[2]
fit_1 <- t.test(GDP_no_outliers$GDP_growth_rate)$estimate
fit_2 <- t.test(GDP_no_outliers$GDP_per_capita)$estimate
fit_3 <- t.test(GDP_no_outliers$Inflation_rate)$estimate
A <- c(lower_1,fit_1,upper_1)
B <- c(lower_2,fit_2,upper_2)
C <- c(lower_3,fit_3,upper_3)
table_1 <- rbind(A,B,C)
rownames(table_1) <- c('GDP growth rate','GDP per capita','Inflation rate')
colnames(table_1) <- c('Lower','Sample','Upper')
kable(table_1, caption="Confidence Interval of sample mean with 95% confidence",
digits = c(4,4,4), align = 'ccc')
| Lower | Sample | Upper | |
|---|---|---|---|
| GDP growth rate | 3.0759 | 4.1286 | 5.1813 |
| GDP per capita | 10.2692 | 15.7771 | 21.2851 |
| Inflation rate | 7.3731 | 8.7886 | 10.2040 |
#draw table
table_2 <- rbind(confidence_interval_1, confidence_interval_2, confidence_interval_3)
rownames(table_2) <- c('GDP growth rate','GDP per capita','Inflation rate')
colnames(table_2) <- c('Lower','Upper')
kable(table_1, caption="Confidence Interval of sample mean with 95% confidence",
digits = c(4,4,4), align = 'ccc')
| Lower | Sample | Upper | |
|---|---|---|---|
| GDP growth rate | 3.0759 | 4.1286 | 5.1813 |
| GDP per capita | 10.2692 | 15.7771 | 21.2851 |
| Inflation rate | 7.3731 | 8.7886 | 10.2040 |
kable(table_2, caption="Confidence Interval of sample mean calculated manually",
digits = c(4,4,4), align = 'ccc')
| Lower | Upper | |
|---|---|---|
| GDP growth rate | 3.2792 | 4.9780 |
| GDP per capita | 11.3328 | 20.2215 |
| Inflation rate | 7.6465 | 9.9307 |
After all of those calculations, we can see that the sample mean and confidence interval from t.test() is quite equivalent tothe manual calculations we have been making, but the latter is much more precise than t.test().