HW 2: CLUSTERING -Ajda Robida

knitr::opts_chunk$set(echo = TRUE)
options(width = 120)
#install.packages(ggplot2)
library(ggplot2)
#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#install.packages("Hmisc")
library(Hmisc)

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

#install.packages("factoextra")
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

#install.packages("cluster")
library(cluster)
#install.packages("magrittr")
library(magrittr)
#install.packages("NbClust")
library("NbClust")

mydata <- read.table("~/insurance.csv", header=TRUE, sep=",", dec=".")

head(mydata)

##   age    sex    bmi children smoker    region   charges
## 1  19 female 27.900        0    yes southwest 16884.924
## 2  18   male 33.770        1     no southeast  1725.552
## 3  28   male 33.000        3     no southeast  4449.462
## 4  33   male 22.705        0     no northwest 21984.471
## 5  32   male 28.880        0     no northwest  3866.855
## 6  31 female 25.740        0     no southeast  3756.622

Dataset description: - age: Age of primary beneficiary (in years) - sex: Insurance contractor gender, can be female or male - bmi: Body mass index, (kg / m ^ 2) using the ratio of height to weight - children: Number of children covered by health insurance / Number of dependents - smoker: Is the beneficiary a smoker or not - region: The beneficiary’s residential area in the US, northeast, southeast, southwest, northwest. - charges: Individual medical costs billed by health insurance (in USD)

A unit of observation in this dataset is the primary beneficiary (person) of health insurance.

The data was taken from the website Kaggle.com, more specifically from the link https://www.kaggle.com/datasets/mirichoi0218/insurance?resource=download.

mydata <- data.frame(ID = seq(1, nrow(mydata)), mydata)

head(mydata)

##   ID age    sex    bmi children smoker    region   charges
## 1  1  19 female 27.900        0    yes southwest 16884.924
## 2  2  18   male 33.770        1     no southeast  1725.552
## 3  3  28   male 33.000        3     no southeast  4449.462
## 4  4  33   male 22.705        0     no northwest 21984.471
## 5  5  32   male 28.880        0     no northwest  3866.855
## 6  6  31 female 25.740        0     no southeast  3756.622

I added the ID column and place it as the first column.

install.packages("dplyr")

## Warning: package 'dplyr' is in use and will not be installed

library("dplyr")
set.seed(3)
mydata <- read.table("~/insurance.csv", header=TRUE, sep=",", dec=".") %>% 
  sample_n(500)

head(mydata)

##   age    sex   bmi children smoker    region   charges
## 1  44 female 36.48        0     no northeast 12797.210
## 2  53 female 39.60        1     no southeast 10579.711
## 3  33 female 36.29        3     no northeast  6551.750
## 4  54 female 46.70        2     no southwest 11538.421
## 5  41   male 35.75        1    yes southeast 40273.645
## 6  44   male 21.85        3     no northeast  8891.139

The sample size in this data set is now 500 units of observation. I decided to create such an sample to have better analysation.

The research question for this analysis is to find out the meaningful clusters of primary beneficiaries of health insurance based on their BMI, number of children, and medical charges, and how are these clusters corresponding to age (of this primary beneficiaries of health insurance).

mydata$sexF <- factor(mydata$sex,
                          labels = c("male", "female"),
                          levels = c("male", "female")) # Adding a factor variable for "sex"

mydata$smokerF <- factor(mydata$smoker,
                             labels = c("smoker", "non-smoker"),
                             levels = c("yes", "no")) # Adding a factor variable for "smoker"

mydata$regionF <- factor(mydata$region,
                             labels = c("northeast", "southeast", "southwest", "northwest"),
                             levels = c("northeast", "southeast", "southwest", "northwest"))

I added factor variables where needed.

summary(mydata[c("age", "bmi", "children", "charges")])

##       age             bmi           children        charges     
##  Min.   :18.00   Min.   :17.29   Min.   :0.000   Min.   : 1137  
##  1st Qu.:26.00   1st Qu.:26.11   1st Qu.:0.000   1st Qu.: 4529  
##  Median :39.00   Median :30.28   Median :1.000   Median : 8863  
##  Mean   :38.96   Mean   :30.83   Mean   :1.084   Mean   :12863  
##  3rd Qu.:51.00   3rd Qu.:34.99   3rd Qu.:2.000   3rd Qu.:14476  
##  Max.   :64.00   Max.   :53.13   Max.   :5.000   Max.   :63770

The youngest person is 18 years old, while the oldest is 64, showing a wide range. The median BMI in 30.28, indicating that half of the observed people have a BMI below this value and the other half above it. The mean for children is 1.084.

var(mydata$age)

## [1] 189.3808

var(mydata$charges)

## [1] 151953647

I am checking variance before scaling the data, to see if it is the same. We can see that it is not the same. We want all the variables to have the same effect, not the one with the highest variance to contribute the most, for this reason we standardize all the cluster variables.

mydata_std <- as.data.frame(scale(mydata[c("age", "bmi", "children", "charges")]))

Scaling the variables so they can be used in clustering.

var(mydata_std$age)

## [1] 1

var(mydata_std$charges)

## [1] 1

Now both variables have the same variance.

mydata_std$Dissimilarity <-sqrt(mydata_std$bmi^2 + mydata_std$children^2 + mydata_std$charges^2)

Computing the dissimilarity.

head(mydata_std[order(-mydata_std$Dissimilarity), ], 10)

##              age       bmi    children    charges Dissimilarity
## 366  1.093189173  2.680950 -0.88866794  4.1297333      5.003191
## 218 -0.578131685  1.174535 -0.06886357  3.7079458      3.890133
## 127 -1.522791300  3.605984 -0.88866794 -0.9491366      3.833238
## 36   0.003197309 -2.026698  3.21035394  0.4997052      3.829306
## 338 -0.432799436  1.870736  3.21035394 -0.5027341      3.749502
## 279  0.511860179  1.862650  1.57074519  2.7004012      3.637152
## 132  0.366527931  2.111698  0.75094081  2.7044460      3.512436
## 498  0.875190801  1.951596  0.75094081  2.8068159      3.500122
## 202 -1.014128430  2.378535  0.75094081  2.3727540      3.442572
## 431  1.311187546  1.827072 -0.06886357  2.9051877      3.432645

Identifying unit with high dissimilarity -> unit 366

print(mydata[c(366), ])

##     age    sex   bmi children smoker    region  charges   sexF smokerF   regionF
## 366  54 female 47.41        0    yes southeast 63770.43 female  smoker southeast

I see unit 366 as a potential outlier, as there is a big jump in dissimilarity numbers between units. For this reason I will remove this unit.

mydata <- mydata[c(-366), ]

Now rescaling the data.

mydata_std <- as.data.frame(scale(mydata[c("bmi", "children", "charges")]))

Now the sampler size got smaller, the new sample size is now 499.

library(factoextra)

Distances <- get_dist(mydata_std, 
                      method = "euclidian")

Distances2 <- Distances^2

fviz_dist(Distances2)

Graphical presentation of dissimilarity matrix.

From the matrix of distances displayed above, we can see that some groups of clusters are forming, and I see 4 or 5 groups.

library(factoextra) 


get_clust_tendency(mydata_std, 
                   n = nrow(mydata_std) - 1,
                   graph = FALSE)

## $hopkins_stat
## [1] 0.828396
## 
## $plot
## NULL

H0: There are no natural groups of objects H1: There are natural groups of objects

Hopkins statistics in my case is above 0.5, so I can conclude that the data is clusterable and there are natural groups present. If it would be greater, even closer to 1, it would be even more appropriate. Now the next question is how many clusters to use, so I will check this.

WARD <- mydata_std %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")          


WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 499

fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.

Based on the dendrogram, I would choose 4 clusters.

fviz_nbclust(mydata_std, kmeans, method = "wss") +
  labs(subtitle = "Elbow method")

With the elbow method, I can see the slope changes the most at 4, so I suggest 4 as the number of clusters.

fviz_nbclust(mydata_std, kmeans, method = "silhouette")+
  labs(subtitle = "Silhouette analysis")

The highest value of the Silhouette analysis is at 3, so this is the suggested numbers of clusters.

NbClust(mydata_std, 
        distance = "euclidean", 
        min.nc = 2, max.nc = 10,
        method = "kmeans", 
        index = "all")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 5 proposed 2 as the best number of clusters 
## * 8 proposed 3 as the best number of clusters 
## * 6 proposed 4 as the best number of clusters 
## * 1 proposed 5 as the best number of clusters 
## * 2 proposed 7 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

## $All.index
##         KL       CH Hartigan     CCC     Scott   Marriot    TrCovW    TraceW Friedman  Rubin Cindex     DB Silhouette
## 2   1.3986 180.6902 227.2798 -3.5565  572.3163 150755260 202456.18 1095.6599   1.3797 1.3636 0.2398 1.6396     0.3362
## 3   0.8467 244.8066 196.5838  0.5883 1275.3472  82906081  86588.50  751.8406   5.2113 1.9871 0.2579 1.1681     0.3689
## 4  24.9168 292.8249  82.9373  3.5241 1552.2808  84613635  30622.38  538.4373   5.7924 2.7747 0.2812 0.9903     0.3551
## 5   0.1087 276.5902  33.9191  3.2618 1754.7546  88113294  21311.09  461.1684   6.8963 3.2396 0.3094 1.0458     0.3431
## 6   0.7736 242.7566  94.0711  0.5863 1939.4354  87633612  24596.05  431.5381   8.9437 3.4620 0.2692 1.0867     0.2863
## 7   3.3838 256.0563  34.2191  3.2227 2142.5705  79390736  13994.43  362.3893  10.0836 4.1226 0.2825 1.0808     0.2976
## 8   0.6714 239.1431  57.7170  2.1162 2241.1155  85111265  13546.83  338.8238  10.9688 4.4094 0.2699 1.0110     0.3051
## 9   1.8213 240.5712  45.4017  3.1000 2444.7992  71617666  10132.77  303.1844  13.2709 4.9277 0.2769 1.0500     0.3131
## 10  1.4943 238.2124  38.6262  3.5347 2543.7000  72520169   9940.87  277.4746  14.0491 5.3843 0.2910 1.0428     0.3030
##      Duda Pseudot2   Beale Ratkowsky     Ball Ptbiserial   Frey McClain   Dunn Hubert SDindex Dindex   SDbw
## 2  1.0248  -8.7643 -0.0411    0.3610 547.8300     0.5274 0.2879  0.4566 0.0288 0.0015  2.5298 1.3453 1.2267
## 3  0.6572 167.4131  0.8849    0.3733 250.6135     0.6006 0.8234  0.6467 0.0354 0.0017  1.9323 1.1112 0.8806
## 4  1.5223 -88.5204 -0.5811    0.3992 134.6093     0.5636 0.4745  1.2189 0.0069 0.0017  1.6715 0.9492 0.6474
## 5  1.2877 -33.2935 -0.3783    0.3717  92.2337     0.5556 1.7456  1.4215 0.0061 0.0018  1.6290 0.8820 0.7571
## 6  1.6098 -38.2590 -0.6320    0.3435  71.9230     0.4891 0.2444  1.9991 0.0113 0.0020  2.0562 0.8478 0.4937
## 7  3.6518 -65.3545 -1.2061    0.3287  51.7699     0.4851 0.7620  2.2338 0.0228 0.0020  2.0751 0.7800 0.2865
## 8  2.0721 -39.8404 -0.8599    0.3108  42.3530     0.4546 0.0814  2.6576 0.0174 0.0022  2.0681 0.7473 0.4511
## 9  1.7028 -42.0992 -0.6889    0.2974  33.6872     0.4585 0.5343  2.6848 0.0176 0.0022  1.8874 0.7084 0.3402
## 10 1.4410 -27.8510 -0.5128    0.2853  27.7475     0.4366 0.2649  3.0440 0.0211 0.0023  1.9987 0.6800 0.3962
## 
## $All.CriticalValues
##    CritValue_Duda CritValue_PseudoT2 Fvalue_Beale
## 2          0.6673           180.5070       1.0000
## 3          0.6596           165.6852       0.4484
## 4          0.6280           152.8601       1.0000
## 5          0.6232            90.0723       1.0000
## 6          0.4752           111.5242       1.0000
## 7          0.4434           112.9878       1.0000
## 8          0.4474            95.1037       1.0000
## 9          0.4783           111.2731       1.0000
## 10         0.5088            87.8579       1.0000
## 
## $Best.nc
##                      KL       CH Hartigan     CCC    Scott  Marriot   TrCovW   TraceW Friedman   Rubin Cindex     DB
## Number_clusters  4.0000   4.0000   4.0000 10.0000   3.0000        3      3.0   4.0000   3.0000  7.0000 2.0000 4.0000
## Value_Index     24.9168 292.8249 113.6465  3.5347 703.0309 69556733 115867.7 136.1345   3.8316 -0.3739 0.2398 0.9903
##                 Silhouette   Duda PseudoT2   Beale Ratkowsky     Ball PtBiserial Frey McClain   Dunn Hubert SDindex
## Number_clusters     3.0000 2.0000   2.0000  2.0000    4.0000   3.0000     3.0000    1  2.0000 3.0000      0   5.000
## Value_Index         0.3689 1.0248  -8.7643 -0.0411    0.3992 297.2164     0.6006   NA  0.4566 0.0354      0   1.629
##                 Dindex   SDbw
## Number_clusters      0 7.0000
## Value_Index          0 0.2865
## 
## $Best.partition
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
##   1   1   2   2   3   2   1   1   1   1   2   2   1   1   1   1   1   1   1   1   2   3   1   1   2   1   1   2   3   3 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   1   2   2   2   2   1   1   1   1   1   2   1   2   1   2   1   2   1   2   1   1   2   1   1   1   1   1   1   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   1   3   1   2   3   1   2   1   3   1   1   3   2   2   1   1   1   2   1   2   1   1   1   1   1   2   2   2   1   3 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   1   2   1   2   1   1   1   3   1   3   1   1   1   1   1   1   2   2   2   1   2   1   3   2   1   1   1   3   3   1 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 
##   1   1   1   1   3   1   1   2   1   1   1   3   1   1   1   3   1   1   3   1   1   2   2   1   1   3   1   1   1   1 
## 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   1   1   2   1   2   2   1   2   1   1   1   1   3   2   1   2   2   1   2   1   2   1   2   1   3   1   1   3   2   1 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 
##   1   1   1   2   1   1   3   3   1   1   2   1   1   1   2   1   1   1   1   1   1   3   1   1   2   2   1   1   3   1 
## 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   1   1   2   1   3   1   1   3   3   1   2   1   1   1   1   1   2   2   2   2   1   1   1   1   1   2   2   1   1   1 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 
##   1   1   3   2   2   1   1   1   2   1   3   1   1   2   2   1   3   1   2   3   2   2   1   1   1   3   2   3   1   1 
## 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   3   2   2   1   1   1   1   1   3   3   1   1   1   2   3   2   1   1   1   1   2   1   1   2   1   1   2   1   2   2 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 
##   1   2   1   2   2   1   1   3   1   1   1   1   1   2   1   3   1   3   1   2   3   3   1   1   1   1   2   2   1   2 
## 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   1   1   2   1   1   1   1   2   1   2   1   1   1   1   2   1   1   2   2   1   1   1   1   3   1   1   2   3   1   3 
## 361 362 363 364 365 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 
##   1   2   2   2   1   1   3   1   1   1   1   1   2   1   2   3   1   1   1   3   1   1   1   1   1   2   1   1   1   1 
## 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 
##   2   1   2   1   2   1   1   1   1   1   2   1   2   1   1   2   1   1   1   1   1   1   3   1   2   1   2   1   2   2 
## 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 
##   2   1   2   2   1   3   2   2   2   3   1   1   1   2   1   1   1   1   1   1   2   1   1   2   1   2   2   2   1   3 
## 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 
##   2   1   1   3   1   2   3   1   2   1   1   2   1   2   2   1   1   1   1   2   1   3   3   1   3   1   1   1   2   1 
## 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   3   1   1   1   1   1   1   1   1   1   2   1   2   1   1   2   3   1   1

The most of all indices suggested 3 as the number of clusters to use, so I will proceed with 3 clusters.

Clustering <- kmeans(mydata_std, 
                     centers = 3, #Number of groups
                     nstart = 25) #Number of different positions of initial leaders

Clustering

## K-means clustering with 3 clusters of sizes 61, 133, 305
## 
## Cluster means:
##           bmi   children    charges
## 1  0.68512387  0.1443559  2.3163945
## 2 -0.09767091  1.3036416 -0.2694814
## 3 -0.09443385 -0.5973444 -0.3457673
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30 
##   3   3   2   2   1   2   3   3   3   3   2   2   3   3   3   3   3   3   3   3   2   1   3   3   2   3   3   2   1   1 
##  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   2   3   2   2   2   2   3   3   3   3   3   2   3   2   3   2   3   2   3   2   3   3   2   3   3   3   3   3   3   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90 
##   3   1   3   2   1   3   2   3   1   3   3   1   2   2   3   3   3   2   3   2   3   3   3   3   3   2   2   2   3   3 
##  91  92  93  94  95  96  97  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   3   2   3   2   3   3   3   1   3   1   3   3   3   3   3   3   2   2   2   3   1   3   1   2   3   3   3   1   1   3 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 
##   3   3   3   3   1   3   3   2   3   3   3   1   3   3   3   1   3   3   1   3   3   2   2   3   3   1   3   3   3   3 
## 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 
##   3   3   2   3   2   2   3   2   3   3   3   3   1   2   3   2   2   3   2   3   2   3   2   3   1   3   3   1   2   3 
## 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 
##   3   3   3   2   3   3   1   1   3   3   2   3   3   3   2   3   3   3   3   3   3   1   3   3   2   2   3   3   1   3 
## 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 
##   3   3   2   3   1   3   3   1   1   3   2   3   3   3   3   3   2   2   2   2   3   3   3   3   3   2   2   3   3   3 
## 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 
##   3   3   1   2   2   3   3   3   2   3   1   3   3   2   2   3   1   3   2   1   2   2   3   3   3   1   2   3   3   3 
## 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 
##   1   2   2   3   3   3   3   3   1   1   3   3   3   2   1   2   3   3   3   3   2   3   3   2   3   3   2   3   2   2 
## 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 
##   3   2   3   2   2   3   3   1   3   3   3   3   3   2   3   1   3   1   3   2   1   1   3   3   3   3   2   2   3   2 
## 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 
##   3   3   2   3   3   3   3   2   3   2   3   3   3   3   2   3   3   2   2   3   3   3   3   1   3   3   2   1   3   1 
## 361 362 363 364 365 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 
##   3   2   2   2   3   3   1   3   3   3   3   3   2   3   2   1   3   3   3   1   3   3   3   3   3   2   3   3   3   3 
## 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 
##   2   3   2   3   2   3   3   3   3   3   2   3   2   3   3   2   3   3   3   3   3   3   1   3   2   3   2   3   2   2 
## 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 
##   2   3   2   2   3   1   2   2   1   1   3   3   3   2   3   3   3   3   3   3   2   3   3   2   3   2   2   2   3   1 
## 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 
##   2   3   3   1   3   2   1   3   2   3   3   2   3   2   2   3   3   3   3   2   3   1   1   3   1   3   3   3   2   3 
## 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 
##   1   3   3   3   3   3   3   3   3   3   2   3   2   3   3   2   1   3   3 
## 
## Within cluster sum of squares by cluster:
## [1] 105.7491 229.2559 416.8114
##  (between_SS / total_SS =  49.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"    "size"        
## [8] "iter"         "ifault"

I did K-Means clustering. The ratio between the sum of squares and the totals sum of squares 49.7%. This number indicates a relatively good fit (as it’s above 50%) of clustering the 3 variables into 3 groups.

library(factoextra)
fviz_cluster(Clustering, 
             palette = "Set1", 
             repel = FALSE,
             ggtheme = theme_bw(),
             data = mydata_std)

With the help of Principal Component Analysis around 72.6% (40.4+32.5) of is showed when combining 3 variables into 3 dimensions. I won`t remove any unit since they are not quite far away from the other units in this cluster.

Averages <- Clustering$centers
Averages

##           bmi   children    charges
## 1  0.68512387  0.1443559  2.3163945
## 2 -0.09767091  1.3036416 -0.2694814
## 3 -0.09443385 -0.5973444 -0.3457673

library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:magrittr':
## 
##     extract

install.packages("magrittr")

## Warning: package 'magrittr' is in use and will not be installed

Figure <- as.data.frame(Averages)
Figure$ID <- 1:nrow(Figure)

Figure <- pivot_longer(Figure, cols = c("bmi", "children", "charges"))

Figure$Group <- factor(Figure$ID, 
                       levels = c(1, 2, 3), 
                       labels = c("1", "2", "3"))

Figure$NameF <- factor(Figure$name, 
                       levels = c("bmi", "children", "charges"), 
                       labels = c("Bmi", "N of Children", "Charges"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables") +
 ylim(-2.4,2.4)

We can see from the graph how the clusters differ from each other in our 3 variables. The 0 line represents the average. We can see that the 1st group is above average in BMI, group 2 and 3 has (almost) the average BMI. Group 2 has the highest number of children covered by their insurance and are the group with the second highest charges, although they are still below the total average. Group 3 has the lowest BMI, adn has all three cluster variables bellow average. Group 1 has charges really high above average.

mydata$Group <- Clustering$cluster

fit <- aov(cbind(bmi, children, charges) ~ as.factor(Group), 
           data = mydata)

summary(fit)

##  Response bmi :
##                   Df  Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2  1231.8  615.90  17.384 5.047e-08 ***
## Residuals        496 17572.8   35.43                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response children :
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 500.35 250.174  514.99 < 2.2e-16 ***
## Residuals        496 240.95   0.486                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response charges :
##                   Df     Sum Sq    Mean Sq F value    Pr(>F)    
## as.factor(Group)   2 5.4911e+10 2.7455e+10  743.44 < 2.2e-16 ***
## Residuals        496 1.8317e+10 3.6930e+07                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I did one-way anova for the variable of interest across clusters, to check if all the cluster variables are successful at classifing units into group. There is the summary of the ANOVA results, which shows that all the variables in our clustering analysis are statistically significant, so we can claim that all are successful at classifying units into groups.

aggregate(mydata$age, 
          by = list(mydata$Group), 
          FUN = mean)

##   Group.1        x
## 1       1 41.39344
## 2       2 37.96241
## 3       3 38.85246

I did check the criterion validity of the classification with variable that was not used in the clustering process -age. Group 1 has highest age (is the oldest).

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

leveneTest(mydata$age, as.factor(mydata$Group))

## Levene's Test for Homogeneity of Variance (center = median)
##        Df F value    Pr(>F)    
## group   2  13.142 2.744e-06 ***
##       496                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

I did perform the Levene’s test for checking homogenity of variances.

H0: μ (age, G1) = μ (age, G2) = μ (age, G3)

H1: At least one μ (age, j) is different. We can reject H0, because p < 0.001, meaning that we can say the variances across the three clusters are significantly different. The p-value of 2.744e-0 is extremely small. Since the assumption of homogeneity of variances is violated, a standard one-way ANOVA is not appropriate, as it assumes equal variances.

Instead, we should use Welch’s ANOVA, which adjusts for unequal variances and provides more reliable results when variance heterogeneity is present.

library(dplyr)
library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

mydata %>%
  group_by(as.factor(mydata$Group)) %>%
  shapiro_test(age)

## # A tibble: 3 × 4
##   `as.factor(mydata$Group)` variable statistic        p
##   <fct>                     <chr>        <dbl>    <dbl>
## 1 1                         age          0.940 4.88e- 3
## 2 2                         age          0.974 1.16e- 2
## 3 3                         age          0.925 2.83e-11

I did the Shapiro-Wilk test for normality, for the age variable within each cluster. The results shows that for G3, the p-value is extremely small (2.826867e-11), which is well below 0.05, meaning that the age distribution in this group significantly deviates from normality. For G1 and G2, the p-value is also below 0.05, suggesting a violation of normality. Since all of the groups do not follow a normal distribution, a standard one-way ANOVA is not appropriate because it assumes normally distributed residuals. Also, Levene’s test previously indicated that the assumption of homogeneity of variances was violated for age. Given these conditions, the best approach is to use the Kruskal-Wallis test, which is a non-parametric alternative to ANOVA that does not require normality.

kruskal.test(age ~ as.factor(Group), 
             data = mydata)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  age by as.factor(Group)
## Kruskal-Wallis chi-squared = 2.2503, df = 2, p-value = 0.3246

We cannot reject H0 (p > 0.005),

The Kruskal-Wallis rank sum test was performed to compare the distribution of age across the three clusters. The test produced a chi-squared value of 2.2503 with 2 degrees of freedom, and the p-value is 0.3246, which is above 0.05 and means that there is a statistically unsignificant difference in the distribution of age among the three clusters.

library(rstatix)
kruskal_effsize(age ~ Group, 
                data = mydata)

## # A tibble: 1 × 5
##   .y.       n  effsize method  magnitude
## * <chr> <int>    <dbl> <chr>   <ord>    
## 1 age     499 0.000505 eta2[H] small

The effect size is small, which means that, even if there is some minor difference between the groups, it doesn’t have practical significance.

Given both the non-significant p-value (0.3246) and the small effect size, I can conclude that age is not an important factor for differentiating the three clusters in analysis.

HW 2: CLUSTERING -Ajda Robida

2025-01-28