Loading the requied packages

library(factoextra)
library(stringr)
library(tidyr)
library(gridExtra)
library(FunCluster)
library(rpart)
library(caret)
library(rattle)

Loading the data

#Loading the data
data<-read.csv("Disputed_Essay_data.csv")
#str(data)

#Summary of the authors
summary(data$author)

##    dispt Hamilton       HM      Jay  Madison 
##       11       51        3        5       15

Data Manipulation

Creating a new column with a short form of the author name:

data$owner <- ifelse(data$author == 'HM', 'HM', ifelse(data$author == 'Jay', "J", ifelse(data$author == 'Madison', 'M', ifelse(data$author == 'dispt', 'D', ifelse(data$author == 'Hamilton', 'H', NA)))))

Splitting the file name & number:

data<-extract(data, filename, into = c("Name", "Num"), "([^(]+)\\s*[^0-9]+([0-9].).")

Creating a new column combining the author name along with the file number:

data$file<-paste(data$owner,"-",data$Num)

Column to Index:

rownames(data)<-data$file

Dropping the unwanted columns:

data<-data[c(-(ncol(data)-1))]
data<-data[c(-(ncol(data)))]
data<-data[c(-2,-3)]

Moving aside the files authored by Jay and Hamilton+Madison:

As we are only conserned about the authorship of the disputed articles and only among Hamilton and Madison. SO, we can go ahead and remove ‘Jay’ and ‘HM’

d <- data[data$author!="Jay",]
data <- d[d$author!="HM",]

Dropping unused levels:

data<-droplevels(data)

Sample data post manipulation:

As we have made few changes to the data, let us have a look at it.

head(data, 5)

##        author     a   all  also    an   and   any   are    as    at    be
## D - 49  dispt 0.280 0.052 0.009 0.096 0.358 0.026 0.131 0.122 0.017 0.411
## D - 50  dispt 0.177 0.063 0.013 0.038 0.393 0.063 0.051 0.139 0.114 0.393
## D - 51  dispt 0.339 0.090 0.008 0.030 0.301 0.008 0.068 0.203 0.023 0.474
## D - 52  dispt 0.270 0.024 0.016 0.024 0.262 0.056 0.064 0.111 0.056 0.365
## D - 53  dispt 0.303 0.054 0.027 0.034 0.404 0.040 0.128 0.148 0.013 0.344
##         been   but    by   can    do  down  even every  for.  from   had
## D - 49 0.026 0.009 0.140 0.035 0.026 0.000 0.009 0.044 0.096 0.044 0.035
## D - 50 0.165 0.000 0.139 0.000 0.013 0.000 0.025 0.000 0.076 0.101 0.101
## D - 51 0.015 0.038 0.173 0.023 0.000 0.008 0.015 0.023 0.098 0.053 0.008
## D - 52 0.127 0.032 0.167 0.056 0.000 0.000 0.024 0.040 0.103 0.079 0.016
## D - 53 0.047 0.061 0.209 0.088 0.000 0.000 0.020 0.027 0.141 0.074 0.000
##          has  have her   his   if.   in.  into    is    it   its   may
## D - 49 0.017 0.044   0 0.017 0.000 0.262 0.009 0.157 0.175 0.070 0.035
## D - 50 0.013 0.152   0 0.000 0.025 0.291 0.025 0.038 0.127 0.038 0.038
## D - 51 0.015 0.023   0 0.000 0.023 0.308 0.038 0.150 0.173 0.030 0.120
## D - 52 0.024 0.143   0 0.024 0.040 0.238 0.008 0.151 0.222 0.048 0.056
## D - 53 0.054 0.047   0 0.020 0.034 0.263 0.013 0.189 0.108 0.013 0.047
##         more  must my    no   not now    of    on   one  only    or   our
## D - 49 0.026 0.026  0 0.035 0.114   0 0.900 0.140 0.026 0.035 0.096 0.017
## D - 50 0.000 0.013  0 0.000 0.127   0 0.747 0.139 0.025 0.000 0.114 0.000
## D - 51 0.038 0.083  0 0.030 0.068   0 0.858 0.150 0.030 0.023 0.060 0.000
## D - 52 0.056 0.071  0 0.032 0.087   0 0.802 0.143 0.032 0.048 0.064 0.016
## D - 53 0.067 0.013  0 0.047 0.128   0 0.869 0.054 0.047 0.027 0.081 0.027
##        shall should    so  some  such  than  that   the their  then there
## D - 49 0.017  0.017 0.035 0.009 0.026 0.009 0.184 1.425 0.114 0.000 0.009
## D - 50 0.000  0.013 0.013 0.063 0.000 0.000 0.152 1.254 0.165 0.000 0.000
## D - 51 0.008  0.068 0.038 0.030 0.045 0.023 0.188 1.490 0.053 0.015 0.015
## D - 52 0.016  0.032 0.040 0.024 0.008 0.000 0.238 1.326 0.071 0.008 0.000
## D - 53 0.000  0.000 0.027 0.067 0.027 0.047 0.162 1.193 0.027 0.007 0.007
##        things  this    to up  upon   was  were  what  when which   who
## D - 49  0.009 0.044 0.507  0 0.000 0.009 0.017 0.000 0.009 0.175 0.044
## D - 50  0.000 0.051 0.355  0 0.013 0.051 0.000 0.000 0.000 0.114 0.038
## D - 51  0.000 0.075 0.361  0 0.000 0.008 0.015 0.008 0.000 0.105 0.008
## D - 52  0.000 0.103 0.532  0 0.000 0.087 0.079 0.008 0.024 0.167 0.000
## D - 53  0.000 0.094 0.485  0 0.000 0.027 0.020 0.020 0.007 0.155 0.027
##         will  with would your
## D - 49 0.009 0.087 0.192    0
## D - 50 0.089 0.063 0.139    0
## D - 51 0.173 0.045 0.068    0
## D - 52 0.079 0.079 0.064    0
## D - 53 0.168 0.074 0.040    0

Euclidean distance calculation & visualization:

The Eucldena distance is calculated to measure the distance between the vectors and in here we use it to measure the similarity between the files. As we can see from the below plot that the files intersecting at the blue point are very similar and the ones at the red are not.

distance<-get_dist(data)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

K-means - Default

Clustering is an unsupervised learning technique. It is the task of grouping together a set of objects in a way that objects in the same cluster are more similar to each other than to objects in other clusters. Similarity is an amount that reflects the strength of relationship between two data objects. Clustering is mainly used for exploratory data mining. It is used in many fields such as machine learning, pattern recognition, image analysis, information retrieval, bio-informatics, data compression, and computer graphics.

set.seed(42)
def <- kmeans(data[c(-1)], centers = 5)
t(table(data[,1],def$cluster))

##    
##     dispt Hamilton Madison
##   1     4        1       8
##   2     4        0       7
##   3     2       13       0
##   4     1       10       0
##   5     0       27       0

From the above result we can see that the disputed articles have been well spread across the authors. The reson being, usage of many clusters. SO we have to find the optimal number of clusters to to gain the accurate answer. Let us have a look at the clusters that we have so far.

Plotting the CLusters

fviz_cluster(def, data = data[c(-1)])

Finding optimal nnumber of clusters

set.seed(123)
wss <- function(k){
  return(kmeans(data[c(-1)], k, nstart = 25)$tot.withinss)
}

k_values <- 1:10

wss_values <- purrr::map_dbl(k_values, wss)

plot(x = k_values, y = wss_values, 
     type = "b", frame = F,
     xlab = "Number of clusters K",
     ylab = "Total within-clusters sum of square")

From the above graph, it is safe to say that 4 is the optimal number of clusters for this dataset.

set.seed(48)
def <- kmeans(data[c(-1)], centers = 4, nstart = 15, iter.max = 100)
t <- t(table(data[,1],def$cluster))
t

##    
##     dispt Hamilton Madison
##   1     7        1       8
##   2     0       28       0
##   3     4        0       7
##   4     0       22       0

As we can see from the above result that the disputed articles were authored by Madison.

Plotting the Clusters

fviz_cluster(def, data = data[c(-1)])

Cluster Growth:

Let us have a another look at the way the cluster formation varies with gradual increase in number of clusters.

k2 <- kmeans(data[c(-1)], centers = 2, nstart = 25)
k3 <- kmeans(data[c(-1)], centers = 3, nstart = 25)
k4 <- kmeans(data[c(-1)], centers = 4, nstart = 25)
k5 <- kmeans(data[c(-1)], centers = 5, nstart = 25)
k6 <- kmeans(data[c(-1)], centers = 6, nstart = 25)
k7 <- kmeans(data[c(-1)], centers = 7, nstart = 25)

Plotting the clusters

p2 <- fviz_cluster(k2, geom = "point", data = data[c(-1)]) + ggtitle("k = 2")
p3 <- fviz_cluster(k3, geom = "point",  data = data[c(-1)]) + ggtitle("k = 3")
p4 <- fviz_cluster(k4, geom = "point",  data = data[c(-1)]) + ggtitle("k = 4")
p5 <- fviz_cluster(k5, geom = "point",  data = data[c(-1)]) + ggtitle("k = 5")
p6 <- fviz_cluster(k6, geom = "point",  data = data[c(-1)]) + ggtitle("k = 6")
p7 <- fviz_cluster(k7, geom = "point",  data = data[c(-1)]) + ggtitle("k = 7")

grid.arrange(p2, p3, p4, p5, p6, p7, nrow = 3)

Hierarchical Clustering

Hierarchical clustering, also known as hierarchical cluster analysis, is an algorithm that groups similar objects into groups called clusters. The endpoint is a set of clusters, where each cluster is distinct from each other cluster, and the objects within each cluster are broadly similar to each other.

hac_output <- hclust(dist(data[c(-1)], method = "euclidean"), method = "ward.D2")

Plot the hierarchical clustering

plot.new()
plot(hac_output,main="Dendogram using HAC algorithm",xlab = "Author", ylab = "Euclidean Distance", cex = 0.6, hang = -1)
rect.hclust(hac_output, k=4)

Even here, we can clearly see that the disputed articles have been clustered together with the articles authored by Madison.

Decision Tree Algorithm

Decision Tree algorithm belongs to the family of supervised learning algorithms. Unlike other supervised learning algorithms, decision tree algorithm can be used for solving regression and classification problems too.

The general motive of using Decision Tree is to create a training model which can use to predict class or value of target variables by learning decision rules inferred from prior data(training data).

Train and Test Split

Splitting the data into training and testing based on the author name.

test <- data[data$author=="dispt",]
train <- data[data$author!="dispt",]

Dropping Unused Levels

train<-droplevels(train)
test<-droplevels(test)

Training the model with the training dataset

Let us now perform decision tree analysis on this training data. But, in the prediction part, the ‘type’ we use is probability.

dt_model <- train(author ~ ., data = train, metric = "Accuracy", method = "rpart")
dt_predict <- predict(dt_model, newdata = test, na.action = na.omit, type = "prob")
head(dt_predict, 11)

##        Hamilton Madison
## D - 49   0.0625  0.9375
## D - 50   0.0625  0.9375
## D - 51   0.0625  0.9375
## D - 52   0.0625  0.9375
## D - 53   0.0625  0.9375
## D - 54   0.0625  0.9375
## D - 55   0.0625  0.9375
## D - 56   0.0625  0.9375
## D - 57   0.0625  0.9375
## D - 62   0.0625  0.9375
## D - 63   0.0625  0.9375

Thus, with 93.75% probability the disputed articles belng to madison.

Printing the final model

print(dt_model)

## CART 
## 
## 66 samples
## 70 predictors
##  2 classes: 'Hamilton', 'Madison' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 66, 66, 66, 66, 66, 66, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa    
##   0.0000000  0.9393337  0.7711225
##   0.4666667  0.9393337  0.7711225
##   0.9333333  0.8584306  0.4312663
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.4666667.

Plotting the final model

fancyRpartPlot(dt_model$finalModel)

Model Prediction - ‘RAW’

dt_predict2 <- predict(dt_model, newdata = test, type = "raw")
print(dt_predict2)

##  [1] Madison Madison Madison Madison Madison Madison Madison Madison
##  [9] Madison Madison Madison
## Levels: Hamilton Madison

From the predicting model of type ‘RAW’, we can reconfirm that the discputed articles have been authored by Madison.

Model Tuning & Pruning

dt_model_preprune <- train(author ~ ., data = train, method = "rpart",
                           metric = "Accuracy",
                           tuneLength = 8,
                           control = rpart.control(minsplit = 50, minbucket = 20, maxdepth = 6))
print(dt_model_preprune$finalModel)

## n= 66 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 66 15 Hamilton (0.7727273 0.2272727)  
##   2) upon>=0.0235 46  0 Hamilton (1.0000000 0.0000000) *
##   3) upon< 0.0235 20  5 Madison (0.2500000 0.7500000) *

Plotting the new model

fancyRpartPlot(dt_model_preprune$finalModel)

In both the models above, we can clearly see that the word ‘upon’ plays a significant role. The frequency of this word seems to determine the authorship of the whole file (surprisingly!). And the tuning and pruning has increased the required frquency from 0.019 to 0.024. If it’s greater than the said value, then the file belongs to Hamilton else, its writting by Madison.

Cross-Validation

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

tr_control <- trainControl(method = "cv", number = 3)

tr_control <- trainControl(method = "cv", number = 3)
dt_model_cv <- train(author ~ ., data = train, method = "rpart",
                           metric = "Accuracy",
                           tuneLength = 8,
                           control = rpart.control(minsplit = 30, minbucket = 10, maxdepth = 5, cp =  0.5, trcontrol = tr_control,na.rm = T))

print(dt_model_cv$finalModel)

## n= 66 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 66 15 Hamilton (0.7727273 0.2272727)  
##   2) upon>=0.019 50  0 Hamilton (1.0000000 0.0000000) *
##   3) upon< 0.019 16  1 Madison (0.0625000 0.9375000) *

dt_predict3 <- predict(dt_model_cv, newdata = test, type = "raw")
print(dt_predict3)

##  [1] Madison Madison Madison Madison Madison Madison Madison Madison
##  [9] Madison Madison Madison
## Levels: Hamilton Madison

Conclusion

So we can hereby conclude that, the disputed articles were authored by Madison.

IST 707 HW 2 - Cluster Analaysis & Decision Tree Induction

Henglong

10/10/2019