3 How can countries be segmented based on key happiness-related
variables (Family, Life Expectancy, Freedom, Trust, and Generosity) to
identify patterns in well-being and socioeconomic
characteristics?
mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity")]))
mydata$Dissimilarity <- sqrt(mydata_clu_std$Family^2 + mydata_clu_std$`Life_Expectancy`^2 + mydata_clu_std$Freedom^2 + mydata_clu_std$Trust^2 + mydata_clu_std$Generosity^2)
head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")])
## ID Dissimilarity
## 129 129 4.585831
## 148 148 4.341715
## 154 154 3.750601
## 9 9 3.724099
## 3 3 3.697382
## 28 28 3.602008
I have identified ID129 and ID148 as potential outliers, as there is
a big jump in disimilarity numbers between units. I have decided to
remove these units.
print(mydata[c(129, 148), ])
## Country Region ID Happiness_Score
## 129 Myanmar Southeastern Asia 129 4.307
## 148 Central African Republic Sub-Saharan Africa 148 3.678
## GDP_per_capita Family Life_Expectancy Freedom Trust Generosity Dystopia
## 129 0.27108 0.70905 0.48246 0.44017 0.19034 0.79588 1.41805
## 148 0.07850 0.00000 0.06699 0.48879 0.08289 0.23835 2.72230
## Dissimilarity
## 129 4.585831
## 148 4.341715
mydata <- mydata %>%
filter(!ID %in% c(129, 148))
mydata$ID <- seq(1, nrow(mydata))
mydata_clu_std <- as.data.frame(scale(mydata[c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity")]))
rownames(mydata_clu_std) <- mydata$Country
Distances <- get_dist(mydata_clu_std,
method = "euclidian")
fviz_dist(Distances, #Showing matrix of distances
gradient = list(low = "darkred",
mid = "grey95",
high = "white"))

library(factoextra)
get_clust_tendency(mydata_clu_std, #Hopkins statistics
n = nrow(mydata_clu_std) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.6830286
##
## $plot
## NULL
Hopkins statistics is 0.68, my data is clusterable as it is above
0.5. With the help of Hierarhical clustering (dendrogram) and K-Means
clustering (Elbow method and Silhouette analysis) I will now determine
how many clusters to use.
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 156
fviz_dend(WARD)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Based on the dendrogram, I would choose 3 clusters, as there is the
biggest jump in vertical line.
library(dplyr)
library(factoextra)
WARD <- mydata_clu_std %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 156
library(factoextra)
library(NbClust)
fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
labs(subtitle = "Elbow method")

With the elbow method the slope changes most evidently at 3,
therefore I’d choose 3 clusters.
fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
labs(subtitle = "Silhouette analysis")

The higest value of the Silhouette analysis is at 2.
library(NbClust)
NbClust(mydata_clu_std,
distance = "euclidean",
min.nc = 2, max.nc = 10,
method = "kmeans",
index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##

## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 6 proposed 2 as the best number of clusters
## * 8 proposed 3 as the best number of clusters
## * 2 proposed 5 as the best number of clusters
## * 2 proposed 7 as the best number of clusters
## * 2 proposed 8 as the best number of clusters
## * 3 proposed 10 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
## $All.index
## KL CH Hartigan CCC Scott Marriot TrCovW TraceW
## 2 1.3166 73.1674 44.9343 -1.5735 195.3316 35315734948 11930.198 525.3834
## 3 2.0042 69.2725 27.5202 -0.2824 388.4529 23041609882 7117.389 406.7125
## 4 1.3647 63.2457 20.7245 0.4435 513.8824 18331678934 5527.242 344.7094
## 5 1.5892 58.6942 7.5026 1.1318 615.2159 14959454904 4127.544 303.3490
## 6 3.1160 50.4525 8.3136 -0.1628 639.4804 18438577563 3618.243 288.9902
## 7 0.0751 45.4546 26.1072 -0.8857 700.8194 16937798874 3464.357 273.8143
## 8 14.0319 49.1849 8.4114 2.1103 812.2501 10829902828 2158.047 232.9906
## 9 0.4673 46.2198 9.0653 1.8704 848.3482 10875136180 1939.257 220.4610
## 10 15.0520 44.3215 5.8537 1.9318 891.5193 10180380410 1769.317 207.6552
## Friedman Rubin Cindex DB Silhouette Duda Pseudot2 Beale Ratkowsky
## 2 3.1311 1.4751 0.4039 1.4501 0.2746 0.9513 5.1717 0.1581 0.3905
## 3 5.4273 1.9055 0.3448 1.3972 0.2695 0.8513 11.1827 0.5389 0.3947
## 4 7.3639 2.2483 0.3454 1.3639 0.2507 1.3808 -20.1320 -0.8440 0.3707
## 5 9.0857 2.5548 0.3854 1.3663 0.2563 1.9208 -34.5147 -1.4662 0.3477
## 6 9.5448 2.6818 0.3329 1.3624 0.2363 1.4354 -13.9523 -0.9196 0.3224
## 7 11.1882 2.8304 0.3129 1.3345 0.2142 1.5957 -15.6800 -1.1295 0.3028
## 8 13.2899 3.3263 0.3940 1.3100 0.2380 1.2300 -6.1715 -0.5644 0.2953
## 9 13.9415 3.5154 0.3822 1.3482 0.2247 1.2898 -7.8649 -0.6740 0.2817
## 10 14.8895 3.7321 0.3761 1.2708 0.2209 1.0139 -0.4671 -0.0413 0.2705
## Ball Ptbiserial Frey McClain Dunn Hubert SDindex Dindex SDbw
## 2 262.6917 0.4451 0.1263 0.7159 0.1253 0.0023 1.8367 1.7447 1.2727
## 3 135.5708 0.5458 0.8865 1.1204 0.1081 0.0034 1.7352 1.5212 0.9307
## 4 86.1773 0.5084 0.2018 1.6782 0.1272 0.0036 1.6230 1.3906 0.5690
## 5 60.6698 0.5213 0.9459 1.9563 0.1523 0.0041 1.5588 1.3112 0.4438
## 6 48.1650 0.4977 -1.6330 2.2866 0.0510 0.0039 1.7100 1.2835 0.5630
## 7 39.1163 0.4533 0.0967 2.7039 0.0473 0.0040 1.8817 1.2448 0.3736
## 8 29.1238 0.4725 0.4949 3.1287 0.1271 0.0044 1.8822 1.1630 0.3509
## 9 24.4957 0.4566 0.5625 3.4987 0.1109 0.0045 1.7358 1.1291 0.3323
## 10 20.7655 0.4459 0.6159 3.7482 0.1109 0.0047 1.7848 1.0938 0.3176
##
## $All.CriticalValues
## CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2 0.6588 52.3182 0.9775
## 3 0.6513 34.2721 0.7467
## 4 0.5995 48.7619 1.0000
## 5 0.5965 48.7016 1.0000
## 6 0.5502 37.6080 1.0000
## 7 0.5399 35.7855 1.0000
## 8 0.5287 29.4215 1.0000
## 9 0.5022 34.6984 1.0000
## 10 0.5094 32.7506 1.0000
##
## $Best.nc
## KL CH Hartigan CCC Scott Marriot TrCovW
## Number_clusters 10.000 2.0000 7.0000 8.0000 3.0000 3 3.000
## Value_Index 15.052 73.1674 17.7936 2.1103 193.1213 7564194118 4812.809
## TraceW Friedman Rubin Cindex DB Silhouette Duda
## Number_clusters 3.0000 3.0000 8.0000 7.0000 10.0000 2.0000 2.0000
## Value_Index 56.6678 2.2963 -0.3069 0.3129 1.2708 0.2746 0.9513
## PseudoT2 Beale Ratkowsky Ball PtBiserial Frey McClain
## Number_clusters 2.0000 2.0000 3.0000 3.0000 3.0000 1 2.0000
## Value_Index 5.1717 0.1581 0.3947 127.1209 0.5458 NA 0.7159
## Dunn Hubert SDindex Dindex SDbw
## Number_clusters 5.0000 0 5.0000 0 10.0000
## Value_Index 0.1523 0 1.5588 0 0.3176
##
## $Best.partition
## Switzerland Iceland Denmark
## 3 3 3
## Norway Canada Finland
## 3 3 3
## Netherlands Sweden New Zealand
## 3 3 3
## Australia Israel Costa Rica
## 3 1 1
## Austria Mexico United States
## 3 1 3
## Brazil Luxembourg Ireland
## 1 3 3
## Belgium United Arab Emirates United Kingdom
## 3 3 3
## Oman Venezuela Singapore
## 3 1 3
## Panama Germany Chile
## 1 3 1
## Qatar France Argentina
## 3 1 1
## Czech Republic Uruguay Colombia
## 1 3 1
## Thailand Saudi Arabia Spain
## 3 1 1
## Malta Taiwan Kuwait
## 3 1 1
## Suriname Trinidad and Tobago El Salvador
## 1 1 1
## Guatemala Uzbekistan Slovakia
## 1 3 1
## Japan South Korea Ecuador
## 1 1 1
## Bahrain Italy Bolivia
## 3 1 1
## Moldova Paraguay Kazakhstan
## 1 1 1
## Slovenia Lithuania Nicaragua
## 1 1 3
## Peru Belarus Poland
## 1 1 1
## Malaysia Croatia Libya
## 1 1 1
## Russia Jamaica North Cyprus
## 1 1 1
## Cyprus Algeria Kosovo
## 1 1 2
## Turkmenistan Mauritius Hong Kong
## 3 1 3
## Estonia Indonesia Vietnam
## 1 1 1
## Turkey Kyrgyzstan Nigeria
## 1 1 2
## Bhutan Azerbaijan Pakistan
## 3 1 2
## Jordan Montenegro China
## 1 1 1
## Zambia Romania Serbia
## 2 1 1
## Portugal Latvia Philippines
## 1 1 1
## Somaliland region Morocco Macedonia
## 3 2 1
## Mozambique Albania Bosnia and Herzegovina
## 2 1 1
## Lesotho Dominican Republic Laos
## 2 1 3
## Mongolia Swaziland Greece
## 1 2 1
## Lebanon Hungary Honduras
## 1 1 1
## Tajikistan Tunisia Palestinian Territories
## 2 2 1
## Bangladesh Iran Ukraine
## 2 2 1
## Iraq South Africa Ghana
## 2 2 2
## Zimbabwe Liberia India
## 2 2 2
## Sudan Haiti Congo (Kinshasa)
## 2 2 2
## Nepal Ethiopia Sierra Leone
## 2 2 2
## Mauritania Kenya Djibouti
## 2 2 2
## Armenia Botswana Georgia
## 1 2 2
## Malawi Sri Lanka Cameroon
## 2 1 2
## Bulgaria Egypt Yemen
## 1 2 2
## Angola Mali Congo (Brazzaville)
## 2 2 2
## Comoros Uganda Senegal
## 2 2 2
## Gabon Niger Cambodia
## 2 2 2
## Tanzania Madagascar Chad
## 2 2 2
## Guinea Ivory Coast Burkina Faso
## 2 2 2
## Afghanistan Rwanda Benin
## 2 3 2
## Syria Burundi Togo
## 2 2 2
We utilized three methods to determine the optimal number of
clusters:
Dendrogram Analysis (Hierarchical Clustering): Suggested 3
clusters.
Elbow Method: Largest slope change at 3 clusters.
Silhouette Analysis: Optimal at 2 clusters.
Here again 3 clusters are the optimal solution so we will use 3
clusters.
Clustering <- kmeans(mydata_clu_std,
centers = 3, #Number of groups
nstart = 25) #Number of attempts at different starting leader positions
library(factoextra)
fviz_cluster(Clustering,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw(),
data = mydata_clu_std)

With the help of Principal Component Analysis around 67.8% (21% +
46.8%) of information is showed when combining the 5 variables into 2
dimensions.
Clustering
## K-means clustering with 3 clusters of sizes 31, 73, 52
##
## Cluster means:
## Family Life_Expectancy Freedom Trust Generosity
## 1 0.8028006 0.7727040 1.19108894 1.4387314 1.10542963
## 2 0.3320646 0.4634078 -0.05718025 -0.4072969 -0.40343257
## 3 -0.9447603 -1.1112038 -0.62979997 -0.2859231 -0.09264887
##
## Clustering vector:
## Switzerland Iceland Denmark
## 1 1 1
## Norway Canada Finland
## 1 1 1
## Netherlands Sweden New Zealand
## 1 1 1
## Australia Israel Costa Rica
## 1 2 2
## Austria Mexico United States
## 1 2 1
## Brazil Luxembourg Ireland
## 2 1 1
## Belgium United Arab Emirates United Kingdom
## 1 1 1
## Oman Venezuela Singapore
## 1 2 1
## Panama Germany Chile
## 2 1 2
## Qatar France Argentina
## 1 2 2
## Czech Republic Uruguay Colombia
## 2 1 2
## Thailand Saudi Arabia Spain
## 1 2 2
## Malta Taiwan Kuwait
## 1 2 2
## Suriname Trinidad and Tobago El Salvador
## 2 2 2
## Guatemala Uzbekistan Slovakia
## 2 1 2
## Japan South Korea Ecuador
## 2 2 2
## Bahrain Italy Bolivia
## 2 2 2
## Moldova Paraguay Kazakhstan
## 2 2 2
## Slovenia Lithuania Nicaragua
## 2 2 1
## Peru Belarus Poland
## 2 2 2
## Malaysia Croatia Libya
## 2 2 2
## Russia Jamaica North Cyprus
## 2 2 2
## Cyprus Algeria Kosovo
## 2 2 3
## Turkmenistan Mauritius Hong Kong
## 2 2 1
## Estonia Indonesia Vietnam
## 2 2 2
## Turkey Kyrgyzstan Nigeria
## 2 2 3
## Bhutan Azerbaijan Pakistan
## 1 2 3
## Jordan Montenegro China
## 2 2 2
## Zambia Romania Serbia
## 3 2 2
## Portugal Latvia Philippines
## 2 2 2
## Somaliland region Morocco Macedonia
## 1 3 2
## Mozambique Albania Bosnia and Herzegovina
## 3 2 2
## Lesotho Dominican Republic Laos
## 3 2 1
## Mongolia Swaziland Greece
## 2 3 2
## Lebanon Hungary Honduras
## 2 2 2
## Tajikistan Tunisia Palestinian Territories
## 2 3 2
## Bangladesh Iran Ukraine
## 3 3 2
## Iraq South Africa Ghana
## 3 3 3
## Zimbabwe Liberia India
## 3 3 3
## Sudan Haiti Congo (Kinshasa)
## 3 3 3
## Nepal Ethiopia Sierra Leone
## 3 3 3
## Mauritania Kenya Djibouti
## 3 3 3
## Armenia Botswana Georgia
## 2 3 3
## Malawi Sri Lanka Cameroon
## 3 2 3
## Bulgaria Egypt Yemen
## 2 3 3
## Angola Mali Congo (Brazzaville)
## 3 3 3
## Comoros Uganda Senegal
## 3 3 3
## Gabon Niger Cambodia
## 3 3 3
## Tanzania Madagascar Chad
## 3 3 3
## Guinea Ivory Coast Burkina Faso
## 3 3 3
## Afghanistan Rwanda Benin
## 3 1 3
## Syria Burundi Togo
## 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 83.86715 161.73718 160.97704
## (between_SS / total_SS = 47.5 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups
## Family Life_Expectancy Freedom Trust Generosity
## 1 0.8028006 0.7727040 1.19108894 1.4387314 1.10542963
## 2 0.3320646 0.4634078 -0.05718025 -0.4072969 -0.40343257
## 3 -0.9447603 -1.1112038 -0.62979997 -0.2859231 -0.09264887
Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
Figure <- pivot_longer(Figure, cols = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3),
labels = c("1", "2", "3"))
Figure$NameF <- factor(Figure$name,
levels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"),
labels = c("Family", "Life_Expectancy", "Freedom", "Trust", "Generosity", "Dystopia"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2) +
theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

Group 1 (High Performers in Happiness):
Countries in this cluster consistently exhibit above-average values
for all analyzed variables: Family, Life Expectancy, Freedom, Trust, and
Generosity. These nations enjoy strong institutional and social support
systems, robust economies, and high happiness scores. Statistical tests
confirm that Group 1 has the highest GDP per capita contribution among
all clusters.
Examples: Switzerland, Iceland, Norway, Canada, and Hong Kong.
Group 2 (Moderate Performers in Happiness):
This cluster consists of countries with mixed performance across the
variables. While indicators like Family and Life Expectancy are close to
global averages, variables such as Trust and Generosity tend to fall
below average. These nations are often characterized as transitioning
economies, with moderate happiness levels and stable but average
institutional support systems.
Examples: Slovenia, Brazil, Chile, Romania, and Indonesia.
Group 3 (Low Performers in Happiness):
Countries in this group consistently score below average across all
variables, particularly in Family, Life Expectancy, and Trust. These
nations face significant socioeconomic challenges, weak governance, and
lower levels of social trust and generosity. This cluster represents
countries struggling with systemic issues that hinder their happiness
scores.
Examples: Zimbabwe, Afghanistan, Haiti, Rwanda, and the Central
African Republic.
mydata$Group <- Clustering$cluster
#Checking if clustering variables successfully differentiate between groups
fit <- aov(cbind(Family, Life_Expectancy, Freedom, Trust, Generosity) ~ as.factor(Group),
data = mydata)
summary(fit)
## Response Family :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.0789 2.53943 70.693 < 2.2e-16 ***
## Residuals 153 5.4961 0.03592
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Life_Expectancy :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 5.8669 2.93344 132.97 < 2.2e-16 ***
## Residuals 153 3.3752 0.02206
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Freedom :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.4899 0.74496 55.022 < 2.2e-16 ***
## Residuals 153 2.0715 0.01354
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Trust :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 1.1722 0.58610 82.724 < 2.2e-16 ***
## Residuals 153 1.0840 0.00709
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Generosity :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Group) 2 0.71448 0.35724 36.654 9.873e-14 ***
## Residuals 153 1.49120 0.00975
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Response for Family:
H0: μ(Family, G1) = μ(Family, G2) = μ(Family, G3)
H1: At least one μ(Family, j) is different.
We can reject H0 at p < 0.001. We can reject H0 for all cluster
variables at p < 0.001.
We found significant differences across clusters for all clustering
variables.
aggregate(mydata$GDP_per_capita,
by = list(mydata$Group),
FUN = mean)
## Group.1 x
## 1 1 1.1773374
## 2 2 0.9976774
## 3 3 0.4617729
On average the 1st group has the biggest GDP per capita contribution
to Happiness Score.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
leveneTest(mydata$GDP_per_capita, as.factor(mydata$Group))
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 2 1.7196 0.1826
## 153
H0: σ2 (GDP_per_capita, G1) = σ2 (GDP_per_capita, G2) = σ2
(GDP_per_capita, G3)
H1: At least one σ2 (GDP_per_capita, j) is different.
We cannot reject H0. We can assume homogeinity of variances,
library(dplyr)
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
mydata %>%
group_by(as.factor(mydata$Group)) %>%
shapiro_test(GDP_per_capita)
## # A tibble: 3 × 4
## `as.factor(mydata$Group)` variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 1 GDP_per_capita 0.799 0.0000504
## 2 2 GDP_per_capita 0.984 0.491
## 3 3 GDP_per_capita 0.959 0.0703
H0: GDP per capita is normally distributed in G1.
H1: GDP per capita is not normally distributed in G1.
We cannot reject H0.
H0: GDP per capita is normally distributed in G2.
H1: GDP per capita is not normally distributed in G2.
We cannot reject H0.
H0: GDP per capita is normally distributed in G3.
H1: GDP per capita is not normally distributed in G3.
We reject H0 at p < 0.001.
Our result is not validated.
kruskal.test(GDP_per_capita ~ as.factor(Group),
data = mydata)
##
## Kruskal-Wallis rank sum test
##
## data: GDP_per_capita by as.factor(Group)
## Kruskal-Wallis chi-squared = 79.541, df = 2, p-value < 2.2e-16
H0: The location distribution of GDP per capita is the same in all
groups
H1: The location distribution of GDP per capita is different in at
least on of the groups
We reject H0 at p<0.001. We can’t say that groups significantly
differ in the variable GDP per capita.
chi_square <- chisq.test(mydata$Region, as.factor(mydata$Group))
## Warning in chisq.test(mydata$Region, as.factor(mydata$Group)): Chi-squared
## approximation may be incorrect
chi_square
##
## Pearson's Chi-squared test
##
## data: mydata$Region and as.factor(mydata$Group)
## X-squared = 154.96, df = 18, p-value < 2.2e-16
H0:There is no association between Region and classification of
countries into 3 groups.
H1: There is association between Region and classification of
countries into 3 groups.
We reject H0 at p<0.001.
addmargins(chi_square$observed)
## as.factor(mydata$Group)
## mydata$Region 1 2 3 Sum
## Australia and New Zealand 2 0 0 2
## Central and Eastern Europe 1 26 2 29
## Eastern Asia 1 5 0 6
## Latin America and Caribbean 2 19 1 22
## Middle East and Northern Africa 3 10 7 20
## North America 2 0 0 2
## Southeastern Asia 3 4 1 8
## Southern Asia 1 1 5 7
## Sub-Saharan Africa 2 1 36 39
## Western Europe 14 7 0 21
## Sum 31 73 52 156
addmargins(round(chi_square$expected, 2))
## as.factor(mydata$Group)
## mydata$Region 1 2 3 Sum
## Australia and New Zealand 0.40 0.94 0.67 2.01
## Central and Eastern Europe 5.76 13.57 9.67 29.00
## Eastern Asia 1.19 2.81 2.00 6.00
## Latin America and Caribbean 4.37 10.29 7.33 21.99
## Middle East and Northern Africa 3.97 9.36 6.67 20.00
## North America 0.40 0.94 0.67 2.01
## Southeastern Asia 1.59 3.74 2.67 8.00
## Southern Asia 1.39 3.28 2.33 7.00
## Sub-Saharan Africa 7.75 18.25 13.00 39.00
## Western Europe 4.17 9.83 7.00 21.00
## Sum 30.99 73.01 52.01 156.01
Most of expected frequencies aren’t larger than 5. The test is
invalid.
round(chi_square$res, 2)
## as.factor(mydata$Group)
## mydata$Region 1 2 3
## Australia and New Zealand 2.54 -0.97 -0.82
## Central and Eastern Europe -1.98 3.37 -2.47
## Eastern Asia -0.18 1.31 -1.41
## Latin America and Caribbean -1.13 2.71 -2.34
## Middle East and Northern Africa -0.49 0.21 0.13
## North America 2.54 -0.97 -0.82
## Southeastern Asia 1.12 0.13 -1.02
## Southern Asia -0.33 -1.26 1.75
## Sub-Saharan Africa -2.07 -4.04 6.38
## Western Europe 4.81 -0.90 -2.65
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
effectsize::cramers_v(mydata$Region, mydata$Group)
## Cramer's V (adj.) | 95% CI
## --------------------------------
## 0.67 | [0.52, 1.00]
##
## - One-sided CIs: upper bound fixed at [1.00].
interpret_cramers_v(0.65)
## [1] "very large"
## (Rules: funder2019)
We can’t use residuals, since not all assumptions are met.