This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
data(iris)
clustering<-iris
head(clustering)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
DESCRIPTION:
The data was taken from the R studio dataset. It has 150 observations, 3 numerical and 1 categorical variable, which are:
Sepal.Length: The length of the sepal, in centimeters.
Sepal.Width: The width of the sepal, in centimeters.
Petal.Length: The length of the petal, in centimeters.
Petal.Width: The width of the petal, in centimeters.
Species: The species of the iris flower, which can be setosa, versicolor, or virginica.
clustering$Species <- factor(clustering$Species, levels = c("setosa", "versicolor", "virginica"), labels = c(1, 2, 3))
clustering1<-clustering[,-5]
The main goal of this analysis is to cluster Iris flowers on the basis of four variables.
mydata_std1 <- as.data.frame(scale(clustering1))
head(mydata_std1)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 -0.8976739 1.01560199 -1.335752 -1.311052
## 2 -1.1392005 -0.13153881 -1.335752 -1.311052
## 3 -1.3807271 0.32731751 -1.392399 -1.311052
## 4 -1.5014904 0.09788935 -1.279104 -1.311052
## 5 -1.0184372 1.24503015 -1.335752 -1.311052
## 6 -0.5353840 1.93331463 -1.165809 -1.048667
Standardizing the data.
mydata_std1$Dissimilarity <- sqrt(mydata_std1$"Sepal.Length"^2 + mydata_std1$"Sepal.Width"^2 + mydata_std1$"Petal.Length"^2 + mydata_std1$"Petal.Width"^2)
Looking for dissimilarities.
head(mydata_std1[order(-mydata_std1$Dissimilarity), ], 10)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Dissimilarity
## 118 2.2421720 1.7038865 1.666574 1.312801 3.525830
## 132 2.4836986 1.7038865 1.496631 1.050416 3.523530
## 16 -0.1730941 3.0804554 -1.279104 -1.048667 3.500711
## 119 2.2421720 -1.0492515 1.779869 1.443994 3.373621
## 34 -0.4146207 2.6215991 -1.335752 -1.311052 3.247735
## 33 -0.7769106 2.3921710 -1.279104 -1.442245 3.168951
## 123 2.2421720 -0.5903951 1.666574 1.050416 3.042490
## 42 -1.6222537 -1.7375359 -1.392399 -1.179859 2.996929
## 110 1.6383555 1.2450302 1.326688 1.706379 2.984316
## 136 2.2421720 -0.1315388 1.326688 1.443994 2.981586
Showing 10most dissimilar units.
print(clustering1[c(118,132,16,119), ])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 118 7.7 3.8 6.7 2.2
## 132 7.9 3.8 6.4 2.0
## 16 5.7 4.4 1.5 0.4
## 119 7.7 2.6 6.9 2.3
Printing 4 most dissimilar units to see if there is any errors or completely different numbers in them and if they should consequently be removed, however I decided they will be left in the analysis since they do not show signs of abnormalities.
#install.packages("factoextra")
library(factoextra)
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Distances1 <- get_dist(mydata_std1,
method = "euclidian")
Distances3 <- Distances1^2
fviz_dist(Distances3, show_labels = FALSE)
Through the Euclidian distance graph I can see three clusters
forming.
library(factoextra)
get_clust_tendency(mydata_std1,
n = nrow(mydata_std1) - 1,
graph = FALSE)
## $hopkins_stat
## [1] 0.8363266
##
## $plot
## NULL
The Hopkins statistics is well above 0.5, meaning that the data is appropriate for making clusters.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
WARD1 <- mydata_std1 %>%
get_dist(method = "euclidean") %>%
hclust(method = "ward.D2")
WARD1
##
## Call:
## hclust(d = ., method = "ward.D2")
##
## Cluster method : ward.D2
## Distance : euclidean
## Number of objects: 150
library(factoextra)
fviz_dend(WARD1, show_labels = FALSE)
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
## Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.
Through the Dendogram I can also observe three formed clusters. The left
one is most homogenous.
clustering1$ClusterWard <- cutree(WARD1,
k = 3)
clustering2 <- cbind(ID = 1:nrow(clustering1), clustering1)
head(clustering2[c( "ID", "ClusterWard")])
## ID ClusterWard
## 1 1 1
## 2 2 1
## 3 3 1
## 4 4 1
## 5 5 1
## 6 6 1
Checking in which cluster does each unit fall.
Leaders_initial <- aggregate(mydata_std1,
by = list(clustering1$ClusterWard),
FUN = mean)
Leaders_initial
## Group.1 Sepal.Length Sepal.Width Petal.Length Petal.Width Dissimilarity
## 1 1 -1.011191 0.8504137 -1.30063 -1.2507035 2.416490
## 2 2 0.184300 -0.6617039 0.45707 0.3962255 1.289667
## 3 3 1.420053 0.2479001 1.20032 1.2774803 2.406324
I assign initial leaders to the clusters.
library(factoextra)
kmeans_clu <- hkmeans(mydata_std1,
k = 3,
hc.metric = "euclidean",
hc.method = "ward.D2")
kmeans_clu
## Hierarchical K-means clustering with 3 clusters of sizes 50, 64, 36
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width Dissimilarity
## 1 -1.01119138 0.8504137 -1.300630 -1.2507035 2.416490
## 2 0.07031946 -0.7266181 0.386691 0.3247565 1.229557
## 3 1.27942010 0.1106354 1.118980 1.1597433 2.203003
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 3 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [75] 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 3 2 3 3 3 3
## [112] 2 3 2 2 3 3 3 3 2 3 2 3 2 3 3 2 2 3 3 3 3 3 2 2 3 3 3 2 3 3 3 2 3 3 3 2 3
## [149] 3 2
##
## Within cluster sum of squares by cluster:
## [1] 52.81060 76.98213 45.39127
## (between_SS / total_SS = 74.0 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault" "data"
## [11] "hclust"
Cluster 1 has 50 units in it, cluster 2 has 64 and cluster 3 has 36 units. The ratio between the between-group sum of squares and the total sum of squares, which is 74%, indicates the proportion of the overall variability in a dependent variable that can be accounted for by group differences.
library(factoextra)
fviz_cluster(kmeans_clu,
palette = "Set1",
repel = FALSE,
ggtheme = theme_bw())
I look for outliers, however there are not any obvious ones so I will
not adjust the data.
clustering1$ClusterK_Means <- kmeans_clu$cluster
table(clustering1$ClusterWard)
##
## 1 2 3
## 50 74 26
table(clustering1$ClusterK_Means)
##
## 1 2 3
## 50 64 36
table(clustering1$ClusterWard, clustering1$ClusterK_Means)
##
## 1 2 3
## 1 50 0 0
## 2 0 64 10
## 3 0 0 26
From this table, I can see that no units were reclassified from cluster1, however 10 units went(reclassified) from cluster 2 to cluster 3.
Leaders_final <- kmeans_clu$centers
Figure <- as.data.frame(Leaders_final)
Figure$ID <- 1:nrow(Figure)
library(tidyr)
Figure <- pivot_longer(Figure, cols = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
Figure$Group <- factor(Figure$ID,
levels = c(1, 2, 3,4),
labels = c("1", "2", "3","4"))
Figure$NameF <- factor(Figure$name,
levels = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"),
labels = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"))
library(ggplot2)
ggplot(Figure, aes(x = NameF, y = value)) +
geom_hline(yintercept = 0) +
theme_bw() +
geom_point(aes(shape = Group, col = Group), size = 3) +
geom_line(aes(group = ID), linewidth = 1) +
ylab("Averages") +
xlab("Cluster variables")+
ylim(-2.2, 2.2)
From the chart above we can say that people in cluster 1 (red circle),
have below average sepal length, petal length and petal width, however
are very above average sepal width. The Irises in cluster 2 (green
triangle) have an average sepal length, are a bit above average in petal
length and petal width but are quite below the average for sepal width.
The third cluster (blue square) is above average in all variables,
however exceedes above average in sepal length and is the closest to
average in sepal width.
fit <- aov(cbind(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) ~ as.factor(ClusterK_Means),
data = clustering1)
summary(fit)
## Response Sepal.Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 75.681 37.84 210 < 2.2e-16 ***
## Residuals 147 26.488 0.18
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Sepal.Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 13.373 6.6864 65.816 < 2.2e-16 ***
## Residuals 147 14.934 0.1016
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Petal.Length :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 433.87 216.937 1047.2 < 2.2e-16 ***
## Residuals 147 30.45 0.207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Response Petal.Width :
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(ClusterK_Means) 2 77.496 38.748 627.75 < 2.2e-16 ***
## Residuals 147 9.074 0.062
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With ANOVA I will be looking if my clustering variables are successfully differentiating between each other, since p values are below 0.05, we can reject the null hypothesis, meaning they differentiate between each other.