library(factoextra)
library(stringr)
library(tidyr)
library(gridExtra)
library(FunCluster)
library(rpart)
library(caret)
library(rattle)
#Loading the data
data<-read.csv("Disputed_Essay_data.csv")
#str(data)
#Summary of the authors
summary(data$author)
## dispt Hamilton HM Jay Madison
## 11 51 3 5 15
data<-extract(data, filename, into = c("Name", "Num"), "([^(]+)\\s*[^0-9]+([0-9].).")
rownames(data)<-data$file
data<-data[c(-(ncol(data)-1))]
data<-data[c(-(ncol(data)))]
data<-data[c(-2,-3)]
data<-droplevels(data)
As we have made few changes to the data, let us have a look at it.
head(data, 5)
## author a all also an and any are as at be
## D - 49 dispt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411
## D - 50 dispt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393
## D - 51 dispt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474
## D - 52 dispt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365
## D - 53 dispt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344
## been but by can do down even every for. from had
## D - 49 0.026 0.009 0.140 0.035 0.026 0.000 0.009 0.044 0.096 0.044 0.035
## D - 50 0.165 0.000 0.139 0.000 0.013 0.000 0.025 0.000 0.076 0.101 0.101
## D - 51 0.015 0.038 0.173 0.023 0.000 0.008 0.015 0.023 0.098 0.053 0.008
## D - 52 0.127 0.032 0.167 0.056 0.000 0.000 0.024 0.040 0.103 0.079 0.016
## D - 53 0.047 0.061 0.209 0.088 0.000 0.000 0.020 0.027 0.141 0.074 0.000
## has have her his if. in. into is it its may
## D - 49 0.017 0.044 0 0.017 0.000 0.262 0.009 0.157 0.175 0.070 0.035
## D - 50 0.013 0.152 0 0.000 0.025 0.291 0.025 0.038 0.127 0.038 0.038
## D - 51 0.015 0.023 0 0.000 0.023 0.308 0.038 0.150 0.173 0.030 0.120
## D - 52 0.024 0.143 0 0.024 0.040 0.238 0.008 0.151 0.222 0.048 0.056
## D - 53 0.054 0.047 0 0.020 0.034 0.263 0.013 0.189 0.108 0.013 0.047
## more must my no not now of on one only or our
## D - 49 0.026 0.026 0 0.035 0.114 0 0.900 0.140 0.026 0.035 0.096 0.017
## D - 50 0.000 0.013 0 0.000 0.127 0 0.747 0.139 0.025 0.000 0.114 0.000
## D - 51 0.038 0.083 0 0.030 0.068 0 0.858 0.150 0.030 0.023 0.060 0.000
## D - 52 0.056 0.071 0 0.032 0.087 0 0.802 0.143 0.032 0.048 0.064 0.016
## D - 53 0.067 0.013 0 0.047 0.128 0 0.869 0.054 0.047 0.027 0.081 0.027
## shall should so some such than that the their then there
## D - 49 0.017 0.017 0.035 0.009 0.026 0.009 0.184 1.425 0.114 0.000 0.009
## D - 50 0.000 0.013 0.013 0.063 0.000 0.000 0.152 1.254 0.165 0.000 0.000
## D - 51 0.008 0.068 0.038 0.030 0.045 0.023 0.188 1.490 0.053 0.015 0.015
## D - 52 0.016 0.032 0.040 0.024 0.008 0.000 0.238 1.326 0.071 0.008 0.000
## D - 53 0.000 0.000 0.027 0.067 0.027 0.047 0.162 1.193 0.027 0.007 0.007
## things this to up upon was were what when which who
## D - 49 0.009 0.044 0.507 0 0.000 0.009 0.017 0.000 0.009 0.175 0.044
## D - 50 0.000 0.051 0.355 0 0.013 0.051 0.000 0.000 0.000 0.114 0.038
## D - 51 0.000 0.075 0.361 0 0.000 0.008 0.015 0.008 0.000 0.105 0.008
## D - 52 0.000 0.103 0.532 0 0.000 0.087 0.079 0.008 0.024 0.167 0.000
## D - 53 0.000 0.094 0.485 0 0.000 0.027 0.020 0.020 0.007 0.155 0.027
## will with would your
## D - 49 0.009 0.087 0.192 0
## D - 50 0.089 0.063 0.139 0
## D - 51 0.173 0.045 0.068 0
## D - 52 0.079 0.079 0.064 0
## D - 53 0.168 0.074 0.040 0
The Eucldena distance is calculated to measure the distance between the vectors and in here we use it to measure the similarity between the files. As we can see from the below plot that the files intersecting at the blue point are very similar and the ones at the red are not.
distance<-get_dist(data)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
Clustering is an unsupervised learning technique. It is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of relationship between two data objects. Clustering is mainly used for exploratory data mining. It is used in many fields such as machine learning, pattern recognition, image analysis, information retrieval, bio-informatics, data compression, and computer graphics.
set.seed(42)
def <- kmeans(data[c(-1)], centers = 5)
t(table(data[,1],def$cluster))
##
## dispt Hamilton Madison
## 1 4 1 8
## 2 4 0 7
## 3 2 13 0
## 4 1 10 0
## 5 0 27 0
From the above result we can see that the disputed articles have been well spread across the authors. The reson being, usage of many clusters. SO we have to find the optimal number of clusters to to gain the accurate answer. Let us have a look at the clusters that we have so far.
fviz_cluster(def, data = data[c(-1)])
set.seed(123)
wss <- function(k){
return(kmeans(data[c(-1)], k, nstart = 25)$tot.withinss)
}
k_values <- 1:10
wss_values <- purrr::map_dbl(k_values, wss)
plot(x = k_values, y = wss_values,
type = "b", frame = F,
xlab = "Number of clusters K",
ylab = "Total within-clusters sum of square")
From the above graph, it is safe to say that 4 is the optimal number of clusters for this dataset.
set.seed(48)
def <- kmeans(data[c(-1)], centers = 4, nstart = 15, iter.max = 100)
t <- t(table(data[,1],def$cluster))
t
##
## dispt Hamilton Madison
## 1 7 1 8
## 2 0 28 0
## 3 4 0 7
## 4 0 22 0
As we can see from the above result that the disputed articles were authored by Madison.
fviz_cluster(def, data = data[c(-1)])
Let us have a another look at the way the cluster formation varies with gradual increase in number of clusters.
k2 <- kmeans(data[c(-1)], centers = 2, nstart = 25)
k3 <- kmeans(data[c(-1)], centers = 3, nstart = 25)
k4 <- kmeans(data[c(-1)], centers = 4, nstart = 25)
k5 <- kmeans(data[c(-1)], centers = 5, nstart = 25)
k6 <- kmeans(data[c(-1)], centers = 6, nstart = 25)
k7 <- kmeans(data[c(-1)], centers = 7, nstart = 25)
p2 <- fviz_cluster(k2, geom = "point", data = data[c(-1)]) + ggtitle("k = 2")
p3 <- fviz_cluster(k3, geom = "point", data = data[c(-1)]) + ggtitle("k = 3")
p4 <- fviz_cluster(k4, geom = "point", data = data[c(-1)]) + ggtitle("k = 4")
p5 <- fviz_cluster(k5, geom = "point", data = data[c(-1)]) + ggtitle("k = 5")
p6 <- fviz_cluster(k6, geom = "point", data = data[c(-1)]) + ggtitle("k = 6")
p7 <- fviz_cluster(k7, geom = "point", data = data[c(-1)]) + ggtitle("k = 7")
grid.arrange(p2, p3, p4, p5, p6, p7, nrow = 3)
Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.
hac_output <- hclust(dist(data[c(-1)], method = "euclidean"), method = "ward.D2")
plot.new()
plot(hac_output,main="Dendogram using HAC algorithm",xlab = "Author", ylab = "Euclidean Distance", cex = 0.6, hang = -1)
rect.hclust(hac_output, k=4)
Even here, we can clearly see that the disputed articles have been clustered together with the articles authored by Madison.
Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.
The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data).
Splitting the data into training and testing based on the author name.
test <- data[data$author=="dispt",]
train <- data[data$author!="dispt",]
train<-droplevels(train)
test<-droplevels(test)
Let us now perform decision tree analysis on this training data. But, in the prediction part, the ‘type’ we use is probability.
dt_model <- train(author ~ ., data = train, metric = "Accuracy", method = "rpart")
dt_predict <- predict(dt_model, newdata = test, na.action = na.omit, type = "prob")
head(dt_predict, 11)
## Hamilton Madison
## D - 49 0.0625 0.9375
## D - 50 0.0625 0.9375
## D - 51 0.0625 0.9375
## D - 52 0.0625 0.9375
## D - 53 0.0625 0.9375
## D - 54 0.0625 0.9375
## D - 55 0.0625 0.9375
## D - 56 0.0625 0.9375
## D - 57 0.0625 0.9375
## D - 62 0.0625 0.9375
## D - 63 0.0625 0.9375
Thus, with 93.75% probability the disputed articles belng to madison.
print(dt_model)
## CART
##
## 66 samples
## 70 predictors
## 2 classes: 'Hamilton', 'Madison'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 66, 66, 66, 66, 66, 66, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.0000000 0.9393337 0.7711225
## 0.4666667 0.9393337 0.7711225
## 0.9333333 0.8584306 0.4312663
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.4666667.
fancyRpartPlot(dt_model$finalModel)
dt_predict2 <- predict(dt_model, newdata = test, type = "raw")
print(dt_predict2)
## [1] Madison Madison Madison Madison Madison Madison Madison Madison
## [9] Madison Madison Madison
## Levels: Hamilton Madison
From the predicting model of type ‘RAW’, we can reconfirm that the discputed articles have been authored by Madison.
dt_model_preprune <- train(author ~ ., data = train, method = "rpart",
metric = "Accuracy",
tuneLength = 8,
control = rpart.control(minsplit = 50, minbucket = 20, maxdepth = 6))
print(dt_model_preprune$finalModel)
## n= 66
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 66 15 Hamilton (0.7727273 0.2272727)
## 2) upon>=0.0235 46 0 Hamilton (1.0000000 0.0000000) *
## 3) upon< 0.0235 20 5 Madison (0.2500000 0.7500000) *
fancyRpartPlot(dt_model_preprune$finalModel)
In both the models above, we can clearly see that the word ‘upon’ plays a significant role. The frequency of this word seems to determine the authorship of the whole file (surprisingly!). And the tuning and pruning has increased the required frquency from 0.019 to 0.024. If it’s greater than the said value, then the file belongs to Hamilton else, its writting by Madison.
Cross-validation is a statistical method used to estimate the skill of machine learning models.
It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.
tr_control <- trainControl(method = "cv", number = 3)
tr_control <- trainControl(method = "cv", number = 3)
dt_model_cv <- train(author ~ ., data = train, method = "rpart",
metric = "Accuracy",
tuneLength = 8,
control = rpart.control(minsplit = 30, minbucket = 10, maxdepth = 5, cp = 0.5, trcontrol = tr_control,na.rm = T))
print(dt_model_cv$finalModel)
## n= 66
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 66 15 Hamilton (0.7727273 0.2272727)
## 2) upon>=0.019 50 0 Hamilton (1.0000000 0.0000000) *
## 3) upon< 0.019 16 1 Madison (0.0625000 0.9375000) *
dt_predict3 <- predict(dt_model_cv, newdata = test, type = "raw")
print(dt_predict3)
## [1] Madison Madison Madison Madison Madison Madison Madison Madison
## [9] Madison Madison Madison
## Levels: Hamilton Madison
So we can hereby conclude that, the disputed articles were authored by Madison.