Homework assignment 2 at course MVA

2 Data description

head(mydata,10)

##    ID Country.name Ladder.score Logged.GDP.per.capita Social.support
## 1   1      Finland        7.842                10.775          0.954
## 2   2      Denmark        7.620                10.933          0.954
## 3   3  Switzerland        7.571                11.117          0.942
## 4   4      Iceland        7.554                10.878          0.983
## 5   5  Netherlands        7.464                10.932          0.942
## 6   6       Norway        7.392                11.053          0.954
## 7   7       Sweden        7.363                10.867          0.934
## 8   8   Luxembourg        7.324                11.647          0.908
## 9   9  New Zealand        7.277                10.643          0.948
## 10 10      Austria        7.268                10.906          0.934
##    Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## 1                     72.0                        0.949     -0.098
## 2                     72.7                        0.946      0.030
## 3                     74.4                        0.919      0.025
## 4                     73.0                        0.955      0.160
## 5                     72.4                        0.913      0.175
## 6                     73.3                        0.960      0.093
## 7                     72.7                        0.945      0.086
## 8                     72.6                        0.907     -0.034
## 9                     73.4                        0.929      0.134
## 10                    73.3                        0.908      0.042
##    Perceptions.of.corruption
## 1                      0.186
## 2                      0.179
## 3                      0.292
## 4                      0.673
## 5                      0.338
## 6                      0.270
## 7                      0.237
## 8                      0.386
## 9                      0.242
## 10                     0.481

Unit of observation is an individual country, with different variables, which describe the situations in the country, which will be explained down below. Sample size is 149.

Explanation of variables:

Ladder score
Represents the overall happiness score of a country, based on survey responses to the Cantril ladder question, where respondents evaluate their current life relative to the best and worst possible life. A higher score indicates a higher level of well-being and happiness.
Logged GDP per capita
Measures the economic production of a country on a per-person basis, adjusted logarithmically to reflect diminishing returns of income on happiness. Higher values suggest a higher standard of living and economic prosperity.
Social support index
Reflects the extent to which people feel they have someone to count on in times of trouble, based on survey responses. A higher value indicates stronger social support networks and connections within the country.
Healthy life expectancy
Represents the number of years a person can expect to live in good health. A higher value suggests better healthcare systems and living conditions.
Freedom to make life choices index

Measures people’s perceptions of the freedom they have to make life decisions. Higher values indicate greater personal autonomy and life satisfaction.
Generosity index

Captures the extent to which people donate time and money to charities and help others. A higher value suggests a more altruistic society.
Perceptions of corruption index

Reflects the public’s perception of the level of corruption within their government and businesses. Lower values indicate lower perceived corruption and higher trust in institutions.

Source of data: https://www.kaggle.com/datasets/ajaypalsinghlo/world-happiness-report-2021

library(pastecs)

## 
## Attaching package: 'pastecs'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

round(stat.desc(mydata[,c(-1,-2)]), 2)

##              Ladder.score Logged.GDP.per.capita Social.support
## nbr.val            149.00                149.00         149.00
## nbr.null             0.00                  0.00           0.00
## nbr.na               0.00                  0.00           0.00
## min                  2.52                  6.64           0.46
## max                  7.84                 11.65           0.98
## range                5.32                  5.01           0.52
## sum                824.39               1405.40         121.40
## median               5.53                  9.57           0.83
## mean                 5.53                  9.43           0.81
## SE.mean              0.09                  0.09           0.01
## CI.mean.0.95         0.17                  0.19           0.02
## var                  1.15                  1.34           0.01
## std.dev              1.07                  1.16           0.11
## coef.var             0.19                  0.12           0.14
##              Healthy.life.expectancy Freedom.to.make.life.choices Generosity
## nbr.val                       149.00                       149.00     149.00
## nbr.null                        0.00                         0.00       0.00
## nbr.na                          0.00                         0.00       0.00
## min                            48.48                         0.38      -0.29
## max                            76.95                         0.97       0.54
## range                          28.48                         0.59       0.83
## sum                          9683.93                       117.95      -2.26
## median                         66.60                         0.80      -0.04
## mean                           64.99                         0.79      -0.02
## SE.mean                         0.55                         0.01       0.01
## CI.mean.0.95                    1.09                         0.02       0.02
## var                            45.73                         0.01       0.02
## std.dev                         6.76                         0.11       0.15
## coef.var                        0.10                         0.14      -9.95
##              Perceptions.of.corruption
## nbr.val                         149.00
## nbr.null                          0.00
## nbr.na                            0.00
## min                               0.08
## max                               0.94
## range                             0.86
## sum                             108.39
## median                            0.78
## mean                              0.73
## SE.mean                           0.01
## CI.mean.0.95                      0.03
## var                               0.03
## std.dev                           0.18
## coef.var                          0.25

Minimum (Min):

The minimum healthy life expectancy is 48.48 years, indicating that the country with the lowest healthy life expectancy has an expected lifespan of 48.48 years in good health.

Maximum (Max):

The maximum healthy life expectancy is 76.95 years, meaning that the country with the highest healthy life expectancy has an expected lifespan of 76.95 years in good health.

Median: The median healthy life expectancy is 66.60 years, which means that half of the countries have a healthy life expectancy of 66.60 years or less, while the other half have a healthy life expectancy greater than 66.60 years.

#Saving standardized cluster variables into new data frame
mydata_clu_std <- as.data.frame(scale(mydata[c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption")]))

mydata$Dissimilarity <- sqrt(mydata_clu_std$Logged.GDP.per.capita^2 + mydata_clu_std$Social.support^2  + mydata_clu_std$Healthy.life.expectancy^2  + mydata_clu_std$Freedom.to.make.life.choices^2 + mydata_clu_std$Generosity^2 + mydata_clu_std$Perceptions.of.corruption^2) #Finding outliers

head(mydata[order(-mydata$Dissimilarity), c("ID", "Dissimilarity")], 10) #Finding units with highest value of dissimilarity

##      ID Dissimilarity
## 149 149      5.443879
## 143 143      4.684666
## 32   32      4.629847
## 140 140      4.398610
## 147 147      4.325374
## 128 128      3.988596
## 2     2      3.969410
## 1     1      3.904402
## 82   82      3.868888
## 98   98      3.842972

Since we can spot that there is a big gap between units with highest value of dissimiliarty, we remove following units:

mydata <- mydata %>%
  filter(!ID %in% c(149,32,143,140,147)) #Removing ID 32, 140,143,147, 149 from original data frame

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption")]))

We show the matrix of distances, from which we can recognize, that we have multiple groups, which can be indentified.

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.4.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#Finding Eudlidean distances, based on 6 Cluster variables, then saving them into object Distances

Distances <- get_dist(mydata_clu_std, 
                      method = "euclidian")

fviz_dist(Distances, #Showing matrix of distances
          gradient = list(low = "darkred",
                          mid = "grey95",
                          high = "white"))

We also calculate Hopkins statistics, which is also above 0.5 - our clusters are good.

library(factoextra) 
get_clust_tendency(mydata_clu_std, #Hopkins statistics
                   n = nrow(mydata_clu_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.7133091
## 
## $plot
## NULL

For further clustering, we selected K-Means clustering. First we predetermine the number of groups with a help of elbow and silhouette analysis.

library(factoextra)
library(NbClust)

fviz_nbclust(mydata_clu_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

fviz_nbclust(mydata_clu_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

library(NbClust)
NbClust(mydata_clu_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 7 proposed 2 as the best number of clusters 
## * 12 proposed 3 as the best number of clusters 
## * 2 proposed 5 as the best number of clusters 
## * 1 proposed 9 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##        KL      CH Hartigan     CCC    Scott      Marriot    TrCovW   TraceW
## 2  1.4646 77.7111  50.5306 -1.5166 204.4336 464874908577 13186.697 554.5281
## 3  3.6099 77.3987  22.5467  0.2894 381.9961 304790789090  5103.647 408.9895
## 4  0.8139 66.8879  21.4936  0.5364 466.3435 301643079108  3679.977 352.6058
## 5  4.2803 62.7894  11.1952  1.6211 571.5838 226942369375  2795.892 305.6765
## 6  0.5570 56.1097  11.8020  1.1950 622.0625 230164041938  2324.914 282.8920
## 7  1.1800 52.3418  10.1196  1.2826 701.3449 180642752049  1806.601 260.6047
## 8  0.8524 49.2618  10.0932  1.3262 739.2676 181314396821  1805.312 242.6791
## 9  2.6609 47.2150   6.5736  1.5525 811.7554 138713494017  1553.825 225.9129
## 10 2.1967 44.4114   3.5427  1.3127 856.1134 125849902307  1447.072 215.4232
##    Friedman  Rubin Cindex     DB Silhouette   Duda Pseudot2   Beale Ratkowsky
## 2    7.9547 1.5473 0.3917 1.2987     0.3254 0.7530  33.1273  1.2482    0.3808
## 3   11.1927 2.0979 0.3413 1.1711     0.3224 0.5472  33.9222  3.1424    0.4092
## 4   12.4770 2.4333 0.4018 1.2502     0.3008 1.6853 -19.1116 -1.4933    0.3820
## 5   14.8348 2.8069 0.3630 1.3237     0.2605 1.7393 -17.4271 -1.5535    0.3575
## 6   15.6725 3.0330 0.3910 1.3523     0.2317 1.1785  -4.9988 -0.5504    0.3335
## 7   18.0169 3.2923 0.3864 1.3492     0.2262 1.3629  -5.8575 -0.9561    0.3147
## 8   18.4804 3.5355 0.3559 1.3821     0.2296 1.5254  -8.9548 -1.2046    0.2989
## 9   20.4884 3.7979 0.3402 1.3573     0.2195 1.1817  -3.3827 -0.5324    0.2857
## 10  22.0848 3.9829 0.3827 1.3607     0.2125 1.2336  -3.7868 -0.6880    0.2734
##        Ball Ptbiserial    Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  277.2640     0.5323  0.2208  0.5596 0.1597 0.0023  1.9152 1.8382 0.9108
## 3  136.3298     0.6261  0.7058  0.8501 0.1249 0.0031  1.6285 1.5676 0.5427
## 4   88.1515     0.6143  1.0228  1.1622 0.1375 0.0032  1.6473 1.4732 0.4739
## 5   61.1353     0.5449  2.2211  1.8704 0.1017 0.0034  1.9415 1.3722 0.4254
## 6   47.1487     0.4751  0.6225  2.6418 0.0863 0.0034  1.9845 1.3182 0.3810
## 7   37.2292     0.4507 -0.1628  3.1382 0.1373 0.0036  2.0513 1.2668 0.3478
## 8   30.3349     0.4803  1.1307  2.9675 0.0874 0.0037  1.9746 1.2264 0.3394
## 9   25.1014     0.4348  1.0884  3.7974 0.1319 0.0038  2.1685 1.1774 0.3140
## 10  21.5423     0.4162  0.7337  4.2153 0.1509 0.0038  2.1883 1.1447 0.2961
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.7157            40.1151       0.2799
## 3          0.7006            17.5226       0.0050
## 4          0.5356            40.7476       1.0000
## 5          0.5190            38.0049       1.0000
## 6          0.4997            33.0377       1.0000
## 7          0.4643            25.3784       1.0000
## 8          0.3979            39.3438       1.0000
## 9          0.3758            36.5355       1.0000
## 10         0.4997            20.0228       1.0000
## 
## $Best.nc
##                     KL      CH Hartigan    CCC    Scott      Marriot  TrCovW
## Number_clusters 5.0000  2.0000   3.0000 5.0000   3.0000            3    3.00
## Value_Index     4.2803 77.7111  27.9839 1.6211 177.5625 156936409505 8083.05
##                  TraceW Friedman   Rubin Cindex     DB Silhouette  Duda
## Number_clusters  3.0000   3.0000  3.0000 9.0000 3.0000     2.0000 2.000
## Value_Index     89.1549   3.2381 -0.2151 0.3402 1.1711     0.3254 0.753
##                 PseudoT2  Beale Ratkowsky     Ball PtBiserial Frey McClain
## Number_clusters   2.0000 2.0000    3.0000   3.0000     3.0000    1  2.0000
## Value_Index      33.1273 1.2482    0.4092 140.9342     0.6261   NA  0.5596
##                   Dunn Hubert SDindex Dindex    SDbw
## Number_clusters 2.0000      0  3.0000      0 10.0000
## Value_Index     0.1597      0  1.6285      0  0.2961
## 
## $Best.partition
##   [1] 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 1 3 1 3 1 3 3 3 1 3 1 1 1 1 1 3 1 1 1 1 1 1
##  [38] 1 3 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 3 1 1 1 1 2 2 1 2 1 2 1 1 1 2 2 1 1 2 2 1 2 2 2 2 2 1 1 2 2 1 1 1 1 2 1
## [112] 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2

From here we can see that we will have 3 groups. Now we proceed with K-Means method.

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions
Clustering

## K-means clustering with 3 clusters of sizes 24, 79, 41
## 
## Cluster means:
##   Logged.GDP.per.capita Social.support Healthy.life.expectancy
## 1             1.1733233      0.9899720               1.0233188
## 2             0.2787572      0.3257858               0.3282251
## 3            -1.2239408     -1.2072294              -1.2314497
##   Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## 1                   1.02130098  0.5727610                -1.7417075
## 2                  -0.01723578 -0.4458824                 0.3996773
## 3                  -0.56462432  0.5238645                 0.2494263
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2
##  [38] 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 1 2 2 2 2 3 3 2 3 2 3 2 2 2 3 3 2 2 3 3 2 3 3 3 3 3 2 2 3 3 2 2 2 2 3 2
## [112] 3 3 3 3 3 3 3 3 3 2 2 2 2 3 2 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3
## 
## Within cluster sum of squares by cluster:
## [1]  51.36146 212.36967 145.25836
##  (between_SS / total_SS =  52.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Since we see that ID 41,67 81,94,97,125 are clearly outliers, we remove them and repeat the process.

library(dplyr)

mydata <- mydata %>%
  filter(!ID %in% c(41,67,81,94,97,125)) #Removing IDs from original data frame

mydata$ID <- seq(1, nrow(mydata))

mydata_clu_std <- as.data.frame(scale(mydata[c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption")]))

Clustering <- kmeans(mydata_clu_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of attempts at different starting leader positions
Clustering

## K-means clustering with 3 clusters of sizes 36, 80, 22
## 
## Cluster means:
##   Logged.GDP.per.capita Social.support Healthy.life.expectancy
## 1            -1.2908312     -1.2856269               -1.298512
## 2             0.2349987      0.3112376                0.290719
## 3             1.2577285      0.9719799                1.067678
##   Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## 1                  -0.70874597  0.4260038                 0.2358709
## 2                   0.04147238 -0.3977309                 0.3835836
## 3                   1.00895748  0.7491973                -1.7808198
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 3 2 3 2 3 3 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2
##  [38] 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3
##  [75] 2 2 2 2 1 2 1 2 1 2 2 2 1 1 2 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 1 2 1 2 1 1 1
## [112] 1 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 1 1 1 1 1 1 1 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 108.64553 233.81083  44.06584
##  (between_SS / total_SS =  53.0 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

rownames(mydata_clu_std) <- mydata$Country.name
library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_clu_std)

Averages <- Clustering$centers
Averages #Average values of cluster variables to describe groups

##   Logged.GDP.per.capita Social.support Healthy.life.expectancy
## 1            -1.2908312     -1.2856269               -1.298512
## 2             0.2349987      0.3112376                0.290719
## 3             1.2577285      0.9719799                1.067678
##   Freedom.to.make.life.choices Generosity Perceptions.of.corruption
## 1                  -0.70874597  0.4260038                 0.2358709
## 2                   0.04147238 -0.3977309                 0.3835836
## 3                   1.00895748  0.7491973                -1.7808198

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Logged.GDP.per.capita", "Social.support", "Healthy.life.expectancy","Freedom.to.make.life.choices","Generosity","Perceptions.of.corruption"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

library(ggplot2)
ggplot(Figure, aes(x = name, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2) +
  theme(axis.text.x = element_text(angle = 45, vjust = 0.50, size = 10))

From here we can see which clustering variables are above and below average for each formed group.

For group 1, variables like Perception of corruption and Generosity are above average,while Healthy life expectancy, Logged GDP per capita Freedom to make life choices and Social support are below average.

For group 2, all variables are above average except Generosity.

For group 3, all variables are above average except Perceptions of corruption.

mydata$Group <- Clustering$cluster

Then we test if variables are different for each group. Here is just an example of 1 variable:

\(H0: 𝜇_{Generosity,G1} = 𝜇_{Generosity,G2} = 𝜇_{Generosity,G3}\)

\(H1\): At least one group is different.

#Checking if clustering variables successfully differentiate between groups

fit <- aov(cbind(Logged.GDP.per.capita, Social.support, Healthy.life.expectancy,Freedom.to.make.life.choices,Generosity,Perceptions.of.corruption) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##  Response Logged.GDP.per.capita :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 123.385  61.692  177.17 < 2.2e-16 ***
## Residuals        135  47.008   0.348                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Social.support :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.97873 0.48937  121.36 < 2.2e-16 ***
## Residuals        135 0.54435 0.00403                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Healthy.life.expectancy :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 4033.7 2016.86   140.5 < 2.2e-16 ***
## Residuals        135 1937.9   14.35                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Freedom.to.make.life.choices :
##                   Df  Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.45553 0.227767  28.445 4.913e-11 ***
## Residuals        135 1.08097 0.008007                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Generosity :
##                   Df  Sum Sq  Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 0.49356 0.246781  20.185 2.141e-08 ***
## Residuals        135 1.65052 0.012226                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Perceptions.of.corruption :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 2.3594 1.17970  105.49 < 2.2e-16 ***
## Residuals        135 1.5097 0.01118                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

From the output we can conduct that we have to reject \(H0\) at p<0,001, which applies for all selected variables.

Then we also check the ladder score, for which group it is the highest (official happiness leaderboard score)

aggregate(mydata$Ladder.score, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1        x
## 1       1 4.522278
## 2       2 5.683650
## 3       3 7.062818

3rd group has the highest average score.

We perform Levene’s test for homogeneity of variance

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$Ladder.score, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value Pr(>F)
## group   2  1.8227 0.1655
##       135

Based on the levene test, we have:

\(H0\): \(σ_1^2 = σ_2^2 = σ_3^2\)

\(H1\): At least one is different.

From output we can see that we can not reject \(H0\).

We also check the normality:

\(H0\): Normality of distribution is met.

\(H1\): Normality of distribution is not met.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(Ladder.score)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable     statistic      p
##   <fct>                     <chr>            <dbl>  <dbl>
## 1 1                         Ladder.score     0.941 0.0534
## 2 2                         Ladder.score     0.978 0.174 
## 3 3                         Ladder.score     0.904 0.0363

From the output we can see we have to reject \(H0\) at p < 0.04.

Therefore, we have to perform kruskal-wallis test:

\(H0\): Location distribution of Ladder.Score is the same.

\(H1\): Location distribution of Ladder.Score is not the same.

kruskal.test(Ladder.score ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  Ladder.score by as.factor(Group)
## Kruskal-Wallis chi-squared = 81.148, df = 2, p-value < 2.2e-16

We have to reject \(H0\) at p < 0,001.

Homework assignment 2 at course MVA

2025-01-21

1 Data import

2 Data description

Final comment