HW4

#Research question: Can we categorize the cancer cells in the data set into homogeneous groups based on four cluster variables (radius, texture, perimeter and area)?

mydata <- read.csv("C:/Users/nadja/Downloads/breast-cancer-wisconsin-data_data.csv")

#Excluding unimportant variables
mydata<- mydata[, -c(7:33)]

head(mydata)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1

##Description

Unit of observation: a cancer cell
Sample size: 569
Source: https://data.world/health/breast-cancer-wisconsin/workspace/file?filename=breast-cancer-wisconsin-data%2Fdata.csv
Diagnosis - The diagnosis of breast tissues (M = malignant, B = benign)
Radius_mean - mean of distances from center to points on the perimeter in millimeter
Texture_mean - standard deviation of gray-scale values
Perimeter_mean - mean size of the core tumor in millimeter
Area_mean - e average area (size) of the cell nuclei in millimeter

##Data manipulation and Descriptive statistics

library(tidyr)
mydata <- drop_na(mydata) #Removing units with missing values

No missing data

mydata$diagnosisF <- factor(mydata$diagnosis,
                        levels = c("M", "B"),
                        labels = c("Malignant","Benign"))

head(mydata)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   diagnosisF
## 1  Malignant
## 2  Malignant
## 3  Malignant
## 4  Malignant
## 5  Malignant
## 6  Malignant

library(tidyr)
summary(mydata[,c(-1,-2)]) #Descriptive statistics (excluding categorical data)

##   radius_mean      texture_mean   perimeter_mean     area_mean     
##  Min.   : 6.981   Min.   : 9.71   Min.   : 43.79   Min.   : 143.5  
##  1st Qu.:11.700   1st Qu.:16.17   1st Qu.: 75.17   1st Qu.: 420.3  
##  Median :13.370   Median :18.84   Median : 86.24   Median : 551.1  
##  Mean   :14.127   Mean   :19.29   Mean   : 91.97   Mean   : 654.9  
##  3rd Qu.:15.780   3rd Qu.:21.80   3rd Qu.:104.10   3rd Qu.: 782.7  
##  Max.   :28.110   Max.   :39.28   Max.   :188.50   Max.   :2501.0  
##      diagnosisF 
##  Malignant:212  
##  Benign   :357  
##                 
##                 
##                 
##

The lowest area of cell nuclei is 143.5 milimetars, while the biggest area is 2501mm.
The first quartile, or 25th percentile, signifies that 25% of the areas of cell nuclei are less than or equal to 4203mm.
Half (50%) of the areas of cell nuclei are less than or equal to 551.1.
Cell nuclei on average have area of 654.9mm.
The third quartile, or 75th percentile, signifies that 75% of the areas of cell nuclei less than or equal to 782.7mm.
Majority of diagnosis (357) are benign, while the rest of 212 is malignant.

##Clustering

#Standardizing cluster variables
mydata$radius_z <- scale(mydata$radius_mean)
mydata$texture_z   <- scale(mydata$texture_mean)
mydata$perimeter_z <- scale(mydata$perimeter_mean)
mydata$area_z <- scale(mydata$area_mean)

library(tidyr)
mydata <- drop_na(mydata) #Removing units with missing values

mydata$Dissimilarity <- sqrt(mydata$radius_z^2 + mydata$texture_z^2 + mydata$perimeter_z^2 + mydata$area_z^2) #Finding outliers

head(mydata[order(-mydata$Dissimilarity), c("id", "Dissimilarity") ], 10) #10 units with the highest value of dissimilarity

##            id Dissimilarity
## 462 911296202      7.722987
## 213   8810703      7.682549
## 181    873592      6.961796
## 353    899987      6.107474
## 83    8611555      5.865842
## 522  91762702      5.448662
## 123    865423      5.244761
## 203    878796      5.085291
## 340     89812      4.981009
## 237  88299702      4.956238

*We have to standardize data again after removing outliers

#Standardizing cluster variables
mydata$radius_z <- scale(mydata$radius_mean)
mydata$texture_z   <- scale(mydata$texture_mean)
mydata$perimeter_z <- scale(mydata$perimeter_mean)
mydata$area_z <- scale(mydata$area_mean)

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.3.2

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

distance <- get_dist (mydata[c("radius_z", "texture_z", "perimeter_z", "area_z")],
                     method = "euclidian")

distance2 <- distance^2

fviz_dist(distance2) #Dissimilarity matrix

Dissimilarity matrix above shows all the distances between all pairs of units
Dimensions: 569x569

##Hopkins statistics

get_clust_tendency(mydata[c("radius_z", "texture_z", "perimeter_z", "area_z")], #Hopkins statistics
                   n = nrow(mydata) - 1, 
                   graph = FALSE)

## $hopkins_stat
## [1] 0.9473338
## 
## $plot
## NULL

Since Hopkins statistic is above 0.5, it means that data has cluster tendency.

##Hierarchical clustering

Ward’s algorithm minimizes sum of squares within groups - making them more homogeneous

library(dplyr) #Allowing use of %>%

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

WARD <- mydata[c("radius_z", "texture_z", "perimeter_z", "area_z")] %>%
  #scale() %>%                           
  get_dist(method = "euclidean") %>%  #Selecting distance
  hclust(method = "ward.D2") #Selecting algorithm

WARD

## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 569

##Dendrogram

library(factoextra)
fviz_dend(WARD)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <https://github.com/kassambara/factoextra/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fviz_dend(WARD, 
          k = 3,
          cex = 0.5, 
          palette = "jama",
          color_labels_by_k = TRUE, 
          rect = TRUE)

* We are cutting dendrogram into 3 groups

mydata$ClusterWard <- cutree(WARD, 
                             k = 3) #Cutting the dendrogram
head(mydata)

##         id diagnosis radius_mean texture_mean perimeter_mean area_mean
## 1   842302         M       17.99        10.38         122.80    1001.0
## 2   842517         M       20.57        17.77         132.90    1326.0
## 3 84300903         M       19.69        21.25         130.00    1203.0
## 4 84348301         M       11.42        20.38          77.58     386.1
## 5 84358402         M       20.29        14.34         135.10    1297.0
## 6   843786         M       12.45        15.70          82.57     477.1
##   diagnosisF   radius_z  texture_z perimeter_z     area_z Dissimilarity
## 1  Malignant  1.0960995 -2.0715123   1.2688173  0.9835095      2.840737
## 2  Malignant  1.8282120 -0.3533215   1.6844726  1.9070303      3.153000
## 3  Malignant  1.5784992  0.4557859   1.5651260  1.5575132      2.752248
## 4  Malignant -0.7682333  0.2535091  -0.5921661 -0.7637917      1.260352
## 5  Malignant  1.7487579 -1.1508038   1.7750113  1.8246238      3.295819
## 6  Malignant -0.4759559 -0.8346009  -0.3868077 -0.5052059      1.152365
##   ClusterWard
## 1           1
## 2           1
## 3           1
## 4           2
## 5           1
## 6           2

#Calculating positions of initial leaders

Initial_leaders <- aggregate(mydata[, c("radius_z", "texture_z", "perimeter_z", "area_z")], 
                             by = list(mydata$ClusterWard), 
                             FUN = mean)

Initial_leaders

##   Group.1   radius_z  texture_z perimeter_z     area_z
## 1       1  1.7088083  0.4953569   1.7074300  1.7605512
## 2       2 -0.5714602 -0.3351839  -0.5754811 -0.5559728
## 3       3  0.4029972  0.7806526   0.4202220  0.2867965

We have 3 initial leaders

library(factoextra)

#Performing K-Means clustering 

K_MEANS <- hkmeans(mydata[c("radius_z", "texture_z", "perimeter_z", "area_z")], 
                   k = 3,
                   hc.metric = "euclidean",
                   hc.method = "ward.D2")

K_MEANS

## Hierarchical K-means clustering with 3 clusters of sizes 107, 296, 166
## 
## Cluster means:
##      radius_z  texture_z perimeter_z     area_z
## 1  1.67511174  0.5291436  1.67462636  1.7212771
## 2 -0.64646288 -0.6141324 -0.64983176 -0.6133779
## 3  0.07298829  0.7540049  0.07930831 -0.0157639
## 
## Clustering vector:
##   [1] 1 1 1 2 1 2 1 3 3 3 3 3 1 3 3 3 3 3 1 2 2 2 2 1 3 1 3 1 3 1 1 2 3 1 3 3 3
##  [38] 2 3 3 3 2 1 3 3 1 2 2 2 3 2 2 2 1 3 2 1 3 2 2 2 2 3 2 3 3 2 2 2 2 1 2 3 2
##  [75] 2 3 2 1 1 2 2 2 1 1 2 1 3 1 3 2 3 3 2 2 3 1 2 2 2 3 3 2 2 2 2 2 2 2 1 2 2
## [112] 3 3 2 2 2 2 3 3 1 2 1 1 2 2 2 3 1 3 1 2 3 3 2 1 3 2 2 3 2 2 3 2 2 2 2 2 3
## [149] 2 2 3 2 2 2 2 2 1 3 2 2 2 1 1 3 1 3 2 3 1 3 2 3 2 2 2 2 2 3 3 2 1 1 3 2 3
## [186] 2 1 2 2 2 3 3 2 3 3 2 3 1 1 3 2 1 1 3 2 3 2 3 3 2 1 2 1 3 3 2 2 2 1 1 2 2
## [223] 2 3 2 2 2 2 3 3 3 3 3 1 2 3 1 1 3 3 2 2 2 3 1 2 2 2 3 2 1 2 1 1 1 2 1 3 3
## [260] 3 1 3 1 3 3 1 2 3 2 2 2 2 1 2 1 2 2 1 2 2 1 2 1 3 2 2 2 2 2 2 3 3 2 2 2 2
## [297] 2 2 3 2 1 2 1 2 2 3 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 3 3 3 2 2
## [334] 2 2 3 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 3 1 3 2 2 2 2 2 2 2 3 2 3 2 1 1 2 1 1
## [371] 3 2 1 1 2 3 2 3 2 2 2 2 3 2 2 3 2 2 2 1 2 2 3 1 2 2 2 2 2 2 1 2 2 2 2 2 3
## [408] 3 1 2 2 2 2 3 3 2 2 3 2 2 2 2 2 3 2 2 2 2 2 2 3 2 1 1 3 3 2 2 3 2 2 3 2 2
## [445] 1 3 1 3 3 1 2 1 3 2 2 3 3 3 3 3 3 1 3 2 2 3 3 2 1 2 2 3 2 3 2 2 3 2 2 3 2
## [482] 3 2 2 2 2 2 1 2 3 3 1 1 2 3 3 2 2 1 1 3 3 2 1 2 2 2 2 3 3 2 2 3 2 3 2 1 1
## [519] 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 2 1 3 3 2 2 2 3 3 3 3 3 2 2 2 3 2 2 3 2 3
## [556] 3 2 3 3 3 3 3 3 1 1 1 3 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 242.7607 278.0061 235.8988
##  (between_SS / total_SS =  66.7 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

###Size of clusters * The first cluster group contains the least people - 18% (107/569). * The second cluster group contains the most people - 52% (296/569). * The third cluster group contains 30% (166/569).

*Between_SS/total_SS = 66.7% - which means that cluster groups differentiate between the units (We want this ratio to be as high as possible)

##Cluster Plot

fviz_cluster(K_MEANS, 
             palette = "jama", 
             repel = FALSE,
             ggtheme = theme_classic()) #Vizualization

mydata$ClusterK_Means <- K_MEANS$cluster
head(mydata[c("id", "ClusterWard", "ClusterK_Means")])

##         id ClusterWard ClusterK_Means
## 1   842302           1              1
## 2   842517           1              1
## 3 84300903           1              1
## 4 84348301           2              2
## 5 84358402           1              1
## 6   843786           2              2

#Checking for reclassifications
table(mydata$ClusterWard)

## 
##   1   2   3 
## 102 372  95

At the beggining there were 102 units in first group, 372 units in second, and 95 units in third group.

table(mydata$ClusterK_Means)

## 
##   1   2   3 
## 107 296 166

After reclassification there are 107 units in first group, 269 units in second, and 166 units in third group.

table(mydata$ClusterWard, mydata$ClusterK_Means)

##    
##       1   2   3
##   1 102   0   0
##   2   0 296  76
##   3   5   0  90

###First group * Initially in first group there were 102 units, and all units stayed in same group.

At the end , there are 107 units in fist group. Out of 107 units, 102 units were in first group while 5 were reclassified from third group.

###Second group * Initially in second group there were 372 units, 296 units stayed in same group, while 76 units reclassified in the third group.

At the end , there are 296 units in second group. Out of 296 units, none of units were reclassified from any of groups.

###Third group * Initially in third group there were 95 units, 90 units stayed in same group, while 5 units reclassified in the first group.

At the end , there are 166 units in third group. Out of 166 units, 90 units were in third group while 76 were reclassified from second group.

##Final leaders (Centroids)

Centroids <- K_MEANS$centers
Centroids

##      radius_z  texture_z perimeter_z     area_z
## 1  1.67511174  0.5291436  1.67462636  1.7212771
## 2 -0.64646288 -0.6141324 -0.64983176 -0.6133779
## 3  0.07298829  0.7540049  0.07930831 -0.0157639

library(ggplot2)
library(tidyr)

Figure <- as.data.frame(Centroids)
Figure$id <- 1:nrow(Figure)
Figure <- pivot_longer(Figure, cols = c("radius_z", "texture_z", "perimeter_z", "area_z"))

Figure$Groups <- factor(Figure$id,
                        levels = c(1, 2, 3),
                        labels = c("1", "2", "3"))

Figure$NameFactor <- factor(Figure$name,
                            levels = c("radius_z", "texture_z", "perimeter_z", "area_z"),
                            labels = c("radius_z", "texture_z", "perimeter_z", "area_z"))

ggplot(Figure, aes(x = NameFactor, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Groups, col = Groups), size = 3) +
  geom_line(aes(group = id), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster Variables") +
  ylim(-3, 3)

* Group 1: best performing group, with all values high above average. Lowest value is texture, but still above average.

Group 2: characterized by a lower than average value of all cluster variables.
Group 3: Cells have lowest (average) area (size),radius and perimeter are right above average, while texture is high above average.

##Performing ANOVA

#Are all the cluster variables successful at classifying units into groups? Performing ANOVA

fit <- aov(cbind(radius_z, texture_z, perimeter_z, area_z) ~ as.factor(ClusterK_Means), 
                 data = mydata)

summary(fit)

##  Response 1 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 424.83 212.414  839.74 < 2.2e-16 ***
## Residuals                 566 143.17   0.253                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 2 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 235.97 117.987  201.13 < 2.2e-16 ***
## Residuals                 566 332.03   0.587                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 3 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 426.11 213.054  849.86 < 2.2e-16 ***
## Residuals                 566 141.89   0.251                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response 4 :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 428.43 214.213  868.67 < 2.2e-16 ***
## Residuals                 566 139.57   0.247                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

###Radius * H0: H0: arithmetic mean (radius, G1) = arithmetic mean (radius, G2) = arithmetic mean (radius, G3) * H1: At least one μ is different

*We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of radius than the others.

###Texture * H0: H0: arithmetic mean (texture, G1) = arithmetic mean (texture, G2) = arithmetic mean (texture, G3) * H1: At least one μ is different

*We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of texture than the others.

###Perimeter * H0: H0: arithmetic mean (perimeter, G1) = arithmetic mean (perimeter, G2) = arithmetic mean (perimeter, G3) * H1: At least one μ is different

*We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of perimeter than the others.

###Area * H0: H0: arithmetic mean (area, G1) = arithmetic mean (area, G2) = arithmetic mean (area, G3) * H1: At least one μ is different

*We reject the null hypothesis at p < 0.001 and assume that at least one group has different arithmetic mean of area than the others.

##Validation of results

I would validate data with diagnosis since it is the only categorical variable in data set.

#Pearson Chi2 test

chisq_results <- chisq.test(mydata$diagnosisF, as.factor(mydata$ClusterK_Means))
chisq_results

## 
##  Pearson's Chi-squared test
## 
## data:  mydata$diagnosisF and as.factor(mydata$ClusterK_Means)
## X-squared = 332.26, df = 2, p-value < 2.2e-16

H0: There is no association between the variables diagnosis and clusters by K Means
H1: There is association between the variables diagnosis and clusters by K Means

*We reject null hypothesis at p<0.001, which means that there is association between diagnosis and clusters by K Means.

addmargins(chisq_results$observed)

##                  
## mydata$diagnosisF   1   2   3 Sum
##         Malignant 106  14  92 212
##         Benign      1 282  74 357
##         Sum       107 296 166 569

round(chisq_results$expected, 2)

##                  
## mydata$diagnosisF     1      2      3
##         Malignant 39.87 110.28  61.85
##         Benign    67.13 185.72 104.15

All expected frequencies are greater than 5 - good sign.

round(chisq_results$res, 2)

##                  
## mydata$diagnosisF     1     2     3
##         Malignant 10.47 -9.17  3.83
##         Benign    -8.07  7.07 -2.95

In the first group there are more than expected malignant and less than expected benign cells (alpha=0.001).
In the second group there are more than expected benign and less than expected malignant cells (alpha=0.001)
In the third group there are more than expected malignant and less than expected benign cells (alpha=0.001)

#Conclusion * Group 1: After K Means clustering, 107 cancer cells, constituting 18% of the sample, fall into the first group. Within this group, 99% are malignant, while rest of 1% is benign. There are more than expected malignant and less than expected benign cells. This is also highest performing group, with all values highly above average (lowest value is texture, but still above average).

Group 2: After K Means clustering, 296 cancer cells, constituting 52% of the sample, fall into the second group. Within this group, 4% are malignant, while rest of 96% is benign. There are more than expected benign and less than expected malignant cells. This group is also characterized by a lower than average values of all cluster variables.
Group 3: After K Means clustering, 160 cancer cells, constituting 30% of the sample, fall into the third group. Within this group, 55% are malignant, while rest of 45% is benign. There are more than expected malignant and less than expected benign cells. In this group cells have lowest (average) area (size),radius and perimeter are right above average, while texture is high above average.

HW4

2024-02-03