library(ggplot2)
library(tidyr)
library(class)
library(gmodels)
library(fpc)
library(rsconnect)
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Checking the dimension of data
dim(iris)
## [1] 150 5
summarizing the data
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
ggplot(iris,aes(x=Sepal.Length,y=Sepal.Width,color=Species)) + geom_jitter()
Restructuring the data for further visualization
iris_tidy<- iris %>% gather(key,Value,-Species) %>% separate(key,c("Part","Measure"),"\\.")
head(iris_tidy)
## Species Part Measure Value
## 1 setosa Sepal Length 5.1
## 2 setosa Sepal Length 4.9
## 3 setosa Sepal Length 4.7
## 4 setosa Sepal Length 4.6
## 5 setosa Sepal Length 5.0
## 6 setosa Sepal Length 5.4
ggplot(iris_tidy,aes(x=Species,y=Value,color=Part,shape=Part))+geom_jitter()+facet_grid(. ~ Measure)
Creating a scatter plot for the length and width for 3 species of flowers in our data. The overall value of length is more than the width. Compared to versicolor and virginica the petal length and width of setosa is smaller.
ggplot(iris_tidy, aes(x=Species,y=Value,color=Part,fill=Part))+geom_col(position = "dodge")+facet_grid(~Measure)
Visualizing the metrics on a histogram , it’s fairly obvious that the length of both petal and sepal is larger than the width.
ggplot(iris, aes(x=Sepal.Width,fill=Species)) +geom_histogram(binwidth =.2,position="dodge")+labs(x='Sepal Width',y="count")
Ploting the variable Sepal length of all 3 flowers. We see that the sepal length of versicolor ranges from 2.0 to 3.4 and that of setosa is from 3.0 to 4.3. The y-axis shows the count of samples of these flowers in the data.
ggplot(iris,aes(Petal.Length,fill=Species,..scaled..)) + geom_density(aes(alpha=0.4))
PLotting the density curve of petal length.
The KNN algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It captures the idea of similarity (sometimes called distance, proximity, or closeness) with some mathematics we might have learned in our childhood— calculating the distance between points on a graph.
There are other ways of calculating distance, and one way might be preferable depending on the problem we are solving. However, the straight-line distance (also called the Euclidean distance) is a popular and familiar choice.
We have 120 observations in our dataset which we will break down into training and testing sets. The training set will be 80% of the size of the whole data and test set is 20% of the whole data.
set.seed(13383610)
size=floor(nrow(iris))
shuffled<-iris[sample(size),]
train_data=shuffled[1:(0.8*size),]
test_data=shuffled[(0.8*size+1):(size),]
dim(train_data)
## [1] 120 5
Size of training data
dim(test_data)
## [1] 30 5
Size of test data
knn_iris <- knn(train = train_data[,-5],test= test_data[,-5],cl=train_data[,5],k=5)
knn_iris
## [1] versicolor versicolor virginica virginica versicolor setosa
## [7] versicolor setosa setosa setosa setosa virginica
## [13] versicolor setosa versicolor versicolor virginica versicolor
## [19] versicolor virginica virginica setosa virginica setosa
## [25] virginica versicolor virginica setosa setosa versicolor
## Levels: setosa versicolor virginica
Checking the confusion matrix for our model to see how our model has performed.
table(test_data[,5],knn_iris,dnn=c("True","Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 11 0
## virginica 0 0 9
It seems our model has performed well.
If we check the accuracy of the model in 30 total samples in our dataset. we get (10+11+9)*100/30 i.e. 100 accuracy. Calculating the accuracy by formula:
mean(test_data[,5]==knn_iris)
## [1] 1
miserror <- sum(test_data[,5]!=knn_iris)/nrow(test_data)
miserror
## [1] 0
The misclassifcation rate is 1-accuracy, which is 0.083. Plotting the same output.
plot(knn_iris)
CrossTable(x=test_data[,5],y=knn_iris,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 30
##
##
## | knn_iris
## test_data[, 5] | setosa | versicolor | virginica | Row Total |
## ---------------|------------|------------|------------|------------|
## setosa | 10 | 0 | 0 | 10 |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## ---------------|------------|------------|------------|------------|
## versicolor | 0 | 11 | 0 | 11 |
## | 0.000 | 1.000 | 0.000 | 0.367 |
## | 0.000 | 1.000 | 0.000 | |
## | 0.000 | 0.367 | 0.000 | |
## ---------------|------------|------------|------------|------------|
## virginica | 0 | 0 | 9 | 9 |
## | 0.000 | 0.000 | 1.000 | 0.300 |
## | 0.000 | 0.000 | 1.000 | |
## | 0.000 | 0.000 | 0.300 | |
## ---------------|------------|------------|------------|------------|
## Column Total | 10 | 11 | 9 | 30 |
## | 0.333 | 0.367 | 0.300 | |
## ---------------|------------|------------|------------|------------|
##
##
Here we can see a more detailed view of the confusion matrix as well as the accuracy in each category. Setosa comprises of 33% of our test data, veriscolor is 36.7% and virginica is 30%. The accuracy is setosa is 1, that is 100% with no misclassifications.
The overall accuracy is 0.333+0.367+0.300=1.00.
Lets us train another model with the parameter K as 10.
knn_iris2 <- knn(train = train_data[,-5],test= test_data[,-5],cl=train_data[,5],k=10)
knn_iris2
## [1] versicolor versicolor virginica versicolor versicolor setosa
## [7] versicolor setosa setosa setosa setosa virginica
## [13] versicolor setosa versicolor versicolor virginica versicolor
## [19] versicolor virginica virginica setosa virginica setosa
## [25] virginica versicolor virginica setosa setosa versicolor
## Levels: setosa versicolor virginica
table(test_data[,5],knn_iris2,dnn=c("True","Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 11 0
## virginica 0 1 8
mean(test_data[,5]==knn_iris2)
## [1] 0.9666667
miserror2 <- sum(test_data[,5]!=knn_iris2)/nrow(test_data)
miserror2
## [1] 0.03333333
CrossTable(x=test_data[,5],y=knn_iris2,prop.chisq = FALSE)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 30
##
##
## | knn_iris2
## test_data[, 5] | setosa | versicolor | virginica | Row Total |
## ---------------|------------|------------|------------|------------|
## setosa | 10 | 0 | 0 | 10 |
## | 1.000 | 0.000 | 0.000 | 0.333 |
## | 1.000 | 0.000 | 0.000 | |
## | 0.333 | 0.000 | 0.000 | |
## ---------------|------------|------------|------------|------------|
## versicolor | 0 | 11 | 0 | 11 |
## | 0.000 | 1.000 | 0.000 | 0.367 |
## | 0.000 | 0.917 | 0.000 | |
## | 0.000 | 0.367 | 0.000 | |
## ---------------|------------|------------|------------|------------|
## virginica | 0 | 1 | 8 | 9 |
## | 0.000 | 0.111 | 0.889 | 0.300 |
## | 0.000 | 0.083 | 1.000 | |
## | 0.000 | 0.033 | 0.267 | |
## ---------------|------------|------------|------------|------------|
## Column Total | 10 | 12 | 8 | 30 |
## | 0.333 | 0.400 | 0.267 | |
## ---------------|------------|------------|------------|------------|
##
##
Here we have again 100% accuracy in setosa and in virginica but in versicolor 1 sample is being predicted as virginica with an accuracy of 91.7%. The overall accuracy is 96.6%
The overall accuracy is 0.333+0.367+0.267= 0.966.
K-means Clustering is an unsupervised meachine learning algorithm. It groups similar datapoints together and discrovers underlying patterns, by identifying a fixed nummber (K) clusters in dataset. ‘Means’ refers to the averaging of the data .e. finding the clusters. K is defined as the number of centroids we need in the dataset. A centroid is an imaginary or real location representing teh center of the cluster.
Starts with the first group of randomly selected centroids- which are the beginning points.
Performs iterative(repititive) calculations to optimize the poistions of clusters.
The process stops when the centroids are stabilized and the values don’t change with further iterations or when the defined number of iterations are reached.
The bigger the value of K, the lower will be the variance within the groups in the clustering. If K is equal to the number of observations, then each point will be a group and the variance will be 0. It’s necessary to find an optimum number of clusters. variance within a group means how different the members of the group are. A large variance shows that there’s more dissimilarity in the groups.
set.seed(13383610)
input <- iris[,1:4]
kmeans_fit<-kmeans(input, centers = 3, nstart = 20)
kmeans_fit
## K-means clustering with 3 clusters of sizes 62, 50, 38
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 5.901613 2.748387 4.393548 1.433871
## 2 5.006000 3.428000 1.462000 0.246000
## 3 6.850000 3.073684 5.742105 2.071053
##
## Clustering vector:
## [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [38] 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [75] 1 1 1 3 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 3 3 3 3 1 3 3 3 3
## [112] 3 3 1 1 3 3 3 3 1 3 1 3 1 3 3 1 1 3 3 3 3 3 1 3 3 3 3 1 3 3 3 1 3 3 3 1 3
## [149] 3 1
##
## Within cluster sum of squares by cluster:
## [1] 39.82097 15.15100 23.87947
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
The kmeans() function outputs the results of the clustering. The cluster in which each observation was allocated has a mean and a percentage (88.4%) that represents the compactness of the clustering, and how similar are the members within the same group. If all the observations within a group were in the same exact point in the n-dimensional space, then we would achieve 100% of compactness.
plotcluster(input,kmeans_fit$cluster,xlab="Number of groups")
table(kmeans_fit$cluster, iris$Species)
##
## setosa versicolor virginica
## 1 0 48 14
## 2 50 0 0
## 3 0 2 36
As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified 2 data points belonging to versicolor into virginica and 14 data points belonging to virginica into versicolor.
Let’s plot a chart showing the “within sum of squares” by the number of groups (K value). The within sum of squares is a metric that shows the dissimilarity within members of a group. The greater is the sum, the greater is the dissimilarity.
wssplot <- function(input, nc=15, seed=13383610){
wss <- (nrow(input)-1)*sum(apply(input,2,var))
for (i in 2:nc){
set.seed(seed)
wss[i] <- sum(kmeans(input, centers=i)$withinss)}
plot(1:nc, wss, type="b", xlab="Number of groups",
ylab="Sum of squares within a group")}
wssplot(input, nc = 20)
We can see that going from K=3 to 4 there’s a decrease in the sum of squares, which means our dissimilarity will decrease and compactness will increase if we take K=4. So, let’s choose K = 4 and run the K-means again.
kmeans_fit2<-kmeans(input, centers = 4, nstart = 20)
kmeans_fit2
## K-means clustering with 4 clusters of sizes 40, 32, 28, 50
##
## Cluster means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1 6.252500 2.855000 4.815000 1.625000
## 2 6.912500 3.100000 5.846875 2.131250
## 3 5.532143 2.635714 3.960714 1.228571
## 4 5.006000 3.428000 1.462000 0.246000
##
## Clustering vector:
## [1] 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
## [38] 4 4 4 4 4 4 4 4 4 4 4 4 4 1 1 1 3 1 3 1 3 1 3 3 3 3 1 3 1 3 3 1 3 1 3 1 1
## [75] 1 1 1 1 1 3 3 3 3 1 3 1 1 1 3 3 3 1 3 3 3 3 3 1 3 3 2 1 2 2 2 2 3 2 2 2 1
## [112] 1 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 1 1
## [149] 2 1
##
## Within cluster sum of squares by cluster:
## [1] 13.624750 18.703437 9.749286 15.151000
## (between_SS / total_SS = 91.6 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
Using 3 groups (K = 3) we had 88.4% of well-grouped data. Using 4 groups (K = 4) that value raised to 91.6%, which is a good value for us.
Hierarchical clustering is an alternative approach to clustering which builds a hierarchy from the bottom-up, and doesn’t require us to specify the number of clusters beforehand. There are two types of hierarchial clustering
Agglomerative- Each data point is considered a separate cluster initially and at each iteration, similar clusters merge with other clusters util one or desired number of clusters are formed.
Divisive-It’s the opposite of Agglomerative clustering. All data points are considered into a singlt cluster and then are seaprated further until we get the desired number of clusters.
Compute proximity/dissimilarity/distance matrix. This is the backbone of our clustering. It is a mathematical expression of how different or distant the datapoints are from each other.
Let each data point be a cluster.
Merge the 2 closest clusters based on the distances from the distance matrix and as a result the amount of clusters goes down by 1
5)Update proximity/distance matrix and repeat step 4 until desired clusters remain.
Let us see how well the hierarchical clustering algorithm performs on our dataset. We will use hclust for this which requires us to provide the data in the form of a distance matrix. We will create this by using dist.
clusters <- hclust(dist(iris[, 1:4]))
In hierarchical clustering, we categorize the objects into a hierarchy similar to a tree-like diagram which is called a dendrogram.
plot(clusters, xlab="Clusters",
ylab="Height of dendogram")
We’ll cut our dendogram at cluster 3 and check how it performs.
clusterCut <- cutree(clusters, 3)
table(clusterCut, iris$Species)
##
## clusterCut setosa versicolor virginica
## 1 50 0 0
## 2 0 23 49
## 3 0 27 1
It looks like the algorithm successfully classified all the flowers of species setosa into cluster 1, and virginica into cluster 2, but had trouble with versicolor.
Let us see if we can better by using a different linkage method. This time, we will use the mean linkage method
clusters2 <- hclust(dist(iris[, 1:4]), method = 'average')
plot(clusters2,xlab="Clusters",
ylab="Height of dendogram")
Next, we’ll cut the dendrogram in order to create the desired number of clusters. Since in this case we already know that there are three Species we will choose the number of clusters to be k = 3. We will use the cutree() function.
clusterCut2 <- cutree(clusters2, k= 3)
plot(clusters2, xlab="Clusters",
ylab="Height of dendogram")
rect.hclust(clusters2 , k = 3, border = 2:6)
abline(h = 3, col = 'red')
table(clusterCut2, iris$Species)
##
## clusterCut2 setosa versicolor virginica
## 1 50 0 0
## 2 0 50 14
## 3 0 0 36
We can see that this time, the algorithm did a little better but has prolem classifying virginica properly.