R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

data(iris)
clustering<-iris
head(clustering)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

DESCRIPTION:

The data was taken from the R studio dataset. It has 150 observations, 3 numerical and 1 categorical variable, which are:

Sepal.Length: The length of the sepal, in centimeters.

Sepal.Width: The width of the sepal, in centimeters.

Petal.Length: The length of the petal, in centimeters.

Petal.Width: The width of the petal, in centimeters.

Species: The species of the iris flower, which can be setosa, versicolor, or virginica.

clustering$Species <- factor(clustering$Species, levels = c("setosa", "versicolor", "virginica"), labels = c(1, 2, 3))
clustering1<-clustering[,-5]

The main goal of this analysis is to cluster Iris flowers on the basis of four variables.

mydata_std1 <- as.data.frame(scale(clustering1))

head(mydata_std1)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   -0.8976739  1.01560199    -1.335752   -1.311052
## 2   -1.1392005 -0.13153881    -1.335752   -1.311052
## 3   -1.3807271  0.32731751    -1.392399   -1.311052
## 4   -1.5014904  0.09788935    -1.279104   -1.311052
## 5   -1.0184372  1.24503015    -1.335752   -1.311052
## 6   -0.5353840  1.93331463    -1.165809   -1.048667

Standardizing the data.

mydata_std1$Dissimilarity <- sqrt(mydata_std1$"Sepal.Length"^2 + mydata_std1$"Sepal.Width"^2 + mydata_std1$"Petal.Length"^2 + mydata_std1$"Petal.Width"^2)

Looking for dissimilarities.

head(mydata_std1[order(-mydata_std1$Dissimilarity), ], 10)
##     Sepal.Length Sepal.Width Petal.Length Petal.Width Dissimilarity
## 118    2.2421720   1.7038865     1.666574    1.312801      3.525830
## 132    2.4836986   1.7038865     1.496631    1.050416      3.523530
## 16    -0.1730941   3.0804554    -1.279104   -1.048667      3.500711
## 119    2.2421720  -1.0492515     1.779869    1.443994      3.373621
## 34    -0.4146207   2.6215991    -1.335752   -1.311052      3.247735
## 33    -0.7769106   2.3921710    -1.279104   -1.442245      3.168951
## 123    2.2421720  -0.5903951     1.666574    1.050416      3.042490
## 42    -1.6222537  -1.7375359    -1.392399   -1.179859      2.996929
## 110    1.6383555   1.2450302     1.326688    1.706379      2.984316
## 136    2.2421720  -0.1315388     1.326688    1.443994      2.981586

Showing 10most dissimilar units.

print(clustering1[c(118,132,16,119), ])
##     Sepal.Length Sepal.Width Petal.Length Petal.Width
## 118          7.7         3.8          6.7         2.2
## 132          7.9         3.8          6.4         2.0
## 16           5.7         4.4          1.5         0.4
## 119          7.7         2.6          6.9         2.3

Printing 4 most dissimilar units to see if there is any errors or completely different numbers in them and if they should consequently be removed, however I decided they will be left in the analysis since they do not show signs of abnormalities.

#install.packages("factoextra")
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances1 <- get_dist(mydata_std1,
                      method = "euclidian")

Distances3 <- Distances1^2

fviz_dist(Distances3, show_labels = FALSE)

Through the Euclidian distance graph I can see three clusters forming.

library(factoextra)
get_clust_tendency(mydata_std1,
                   n = nrow(mydata_std1) - 1,
                   graph = FALSE)
## $hopkins_stat
## [1] 0.8363266
## 
## $plot
## NULL

The Hopkins statistics is well above 0.5, meaning that the data is appropriate for making clusters.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
WARD1 <- mydata_std1 %>%
  get_dist(method = "euclidean") %>%  
  hclust(method = "ward.D2")        

WARD1
## 
## Call:
## hclust(d = ., method = "ward.D2")
## 
## Cluster method   : ward.D2 
## Distance         : euclidean 
## Number of objects: 150
library(factoextra)
fviz_dend(WARD1, show_labels = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.

Through the Dendogram I can also observe three formed clusters. The left one is most homogenous.

clustering1$ClusterWard <- cutree(WARD1,
                             k = 3)

clustering2 <- cbind(ID = 1:nrow(clustering1), clustering1)
head(clustering2[c( "ID", "ClusterWard")])
##   ID ClusterWard
## 1  1           1
## 2  2           1
## 3  3           1
## 4  4           1
## 5  5           1
## 6  6           1

Checking in which cluster does each unit fall.

Leaders_initial <- aggregate(mydata_std1,
                             by = list(clustering1$ClusterWard),
                             FUN = mean)

Leaders_initial
##   Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Dissimilarity
## 1       1    -1.011191   0.8504137     -1.30063  -1.2507035      2.416490
## 2       2     0.184300  -0.6617039      0.45707   0.3962255      1.289667
## 3       3     1.420053   0.2479001      1.20032   1.2774803      2.406324

I assign initial leaders to the clusters.

library(factoextra)

kmeans_clu <- hkmeans(mydata_std1,
                      k = 3,
                      hc.metric = "euclidean",
                      hc.method = "ward.D2")

kmeans_clu
## Hierarchical K-means clustering with 3 clusters of sizes 50, 64, 36
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Dissimilarity
## 1  -1.01119138   0.8504137    -1.300630  -1.2507035      2.416490
## 2   0.07031946  -0.7266181     0.386691   0.3247565      1.229557
## 3   1.27942010   0.1106354     1.118980   1.1597433      2.203003
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
##  [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 3 2 3 3 3 3
## [112] 2 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
## 
## Within cluster sum of squares by cluster:
## [1] 52.81060 76.98213 45.39127
##  (between_SS / total_SS =  74.0 %)
## 
## Available components:
## 
##  [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
##  [6] "betweenss"    "size"         "iter"         "ifault"       "data"        
## [11] "hclust"

Cluster 1 has 50 units in it, cluster 2 has 64 and cluster 3 has 36 units. The ratio between the between-group sum of squares and the total sum of squares, which is 74%, indicates the proportion of the overall variability in a dependent variable that can be accounted for by group differences.

library(factoextra)
fviz_cluster(kmeans_clu,
             palette = "Set1",
             repel = FALSE,
             ggtheme = theme_bw())

I look for outliers, however there are not any obvious ones so I will not adjust the data.

clustering1$ClusterK_Means <- kmeans_clu$cluster
table(clustering1$ClusterWard)
## 
##  1  2  3 
## 50 74 26
table(clustering1$ClusterK_Means)
## 
##  1  2  3 
## 50 64 36
table(clustering1$ClusterWard, clustering1$ClusterK_Means)
##    
##      1  2  3
##   1 50  0  0
##   2  0 64 10
##   3  0  0 26

From this table, I can see that no units were reclassified from cluster1, however 10 units went(reclassified) from cluster 2 to cluster 3.

Leaders_final <- kmeans_clu$centers

Figure <- as.data.frame(Leaders_final)
Figure$ID <- 1:nrow(Figure)

library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))

Figure$Group <- factor(Figure$ID,
                       levels = c(1, 2, 3,4),
                       labels = c("1", "2", "3","4"))

Figure$NameF <- factor(Figure$name,
                       levels = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
                       labels = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))

library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
  geom_hline(yintercept = 0) +
  theme_bw() +
  geom_point(aes(shape = Group, col = Group), size = 3) +
  geom_line(aes(group = ID), linewidth = 1) +
  ylab("Averages") +
  xlab("Cluster variables")+
  ylim(-2.2, 2.2)

From the chart above we can say that people in cluster 1 (red circle), have below average sepal length, petal length and petal width, however are very above average sepal width. The Irises in cluster 2 (green triangle) have an average sepal length, are a bit above average in petal length and petal width but are quite below the average for sepal width. The third cluster (blue square) is above average in all variables, however exceedes above average in sepal length and is the closest to average in sepal width.

fit <- aov(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ as.factor(ClusterK_Means),
           data = clustering1)

summary(fit)
##  Response Sepal.Length :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 75.681   37.84     210 < 2.2e-16 ***
## Residuals                 147 26.488    0.18                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Sepal.Width :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 13.373  6.6864  65.816 < 2.2e-16 ***
## Residuals                 147 14.934  0.1016                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Length :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 433.87 216.937  1047.2 < 2.2e-16 ***
## Residuals                 147  30.45   0.207                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##  Response Petal.Width :
##                            Df Sum Sq Mean Sq F value    Pr(>F)    
## as.factor(ClusterK_Means)   2 77.496  38.748  627.75 < 2.2e-16 ***
## Residuals                 147  9.074   0.062                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With ANOVA I will be looking if my clustering variables are successfully differentiating between each other, since p values are below 0.05, we can reject the null hypothesis, meaning they differentiate between each other.