Assign2-707

library(knitr)
knitr::opts_chunk$set(fig.width = 12,fig.height = 8)

Part 1. Data Cleaning and preparation

Step 1.1 Data cleaning

library(readr)
# read the dataset
EssayData <- read_csv("Disputed_Essay_data.csv")
# Create dataset including only explanatory factors
DfNum <- EssayData[,-1:-2]
sum(!complete.cases(EssayData))

## [1] 0

This Dataset has no missing value, which is clean enough.

Step 1.2 Data Preparation for Clustering analysis

# Create vectors for Author labels
index <- rep(NA,85)
for (i in 1:85){
  index[i] <- paste(substr(EssayData$author[i],1,2),i)  
}
# Change rownames for the dataframe for futher demonstration of Clustering/PCA methods
rownames(DfNum) <- index

Step 1.3 Data prepartion for Decision tree modeling

DFTree <- EssayData[,-2]
# Using stratified sampling methods to split the dataset into training and testing datasets
Ham <- DFTree[12:62,]
Mad <- DFTree[71:85,]

SizeH <- floor(nrow(Ham)*0.8) # sample size from Hamilton
SizeM <- floor(nrow(Mad)*0.8) # sample size from not Madison


#Sample index generating
H <- sample(12:62,SizeH,replace = F)
M <- sample(71:85,SizeM,replace = F)

# Training Data
TrDF <- DFTree[c(H,M),]
# Testing Data
TeDF <- DFTree[-c(H,M,63:70),]
TeDF <- TeDF[-1:-11,]
# Prediction Data
PreDF <- DFTree[1:11,]

colnames(TrDF) <- make.names(colnames(TrDF))
colnames(TeDF) <- make.names(colnames(TeDF))
colnames(PreDF) <- make.names(colnames(PreDF))

Part 2. Clustering and Decision Tree Modelling/Tuning

Part 2.1 Clustering Analysis

Step 2.1 Data Standardization

# Data Standardization
DfNum <- scale(DfNum,center = T,scale = T)

# summary(DfNum)
# Distance Function and visualization
library(factoextra)
# Initial Visualization
distance <- get_dist(DfNum,method = 'euclidean')
fviz_dist(distance,gradient = list(low='#00AFBB',mid='white',high='#FC4E07'))

From the graph, we can see that Dispute 1,3,4,5,6,7,8 is more similar to Maison,While dipute 2 is more similar to joint authorship-HM, and Ja 66, Ma 83,Ma76 also share more features with HM.

Step 2.2 Opimize numbers of clusters

# Elbow method to decide Optimal Number of clusters
set.seed(8)
wss <- function(k){
  return(kmeans(DfNum,k,nstart = 25)$tot.withinss)
}
k_value <- 1:10
wss_value <- purrr::map_dbl(k_value,wss)
plot(x=k_value,y=wss_value,
     type = 'b',frame=F,
     xlab = 'Number of clusters K',
     ylab = 'Total within-clusters sum of square')

# There is no apparent bend(knee) in this plot, it's hard to decide the k clusters by this

# Using Silhouette method
fviz_nbclust(DfNum,kmeans,method = 'silhouette')+labs(sutitle='Silhouette method')

# Gap statistics
set.seed(123)
fviz_nbclust(DfNum,kmeans,nstart=25,method='gap_stat',nboot=50)+labs(subtitle = 'Gap Statistic method')

# From the gap statistics, either choose 2 or 4 as numbers of clusters

Step 2.3 Kmeans function

Observation: * There is bigger difference while using 2 clusters, which indicates groups between Ha and non-Han. * Dipute 1, 3, 6, 7 and 10 is both close to Halmiton. * Di 4,5,9 are exclusively close to Ma. * Di 2 is close to Ma and HM

# Using 4
km_output4 <- kmeans(DfNum,centers = 4,nstart = 25,iter.max = 1000,algorithm = 'Hartigan-Wong')
fviz_cluster(km_output4,data=DfNum)

# Using 2 
km_output2 <- kmeans(DfNum,centers = 2,nstart = 25,iter.max = 1000,algorithm = 'Hartigan-Wong')
#str(km_output2)
fviz_cluster(km_output2,data=DfNum)

Step 2.4 PCA Analysis

pca <- prcomp(DfNum,scale. = T,center = T)
#summary(pca)
pcaDf <- as.data.frame(pca$rotation)

PC 1-PC 10 collectively explain 49.83% of the original information

Interpretation One of PC1-PC4

‘upon,to,there,of,an’ is most positively related to PC1,PC3
‘and, their, was, one, had’ is most negatively realted PC1
‘will, than, may, be, more’ is most postively related to PC2
‘had, was, been, of , were’ is most negatively related to PC2

Step 2.5 Analyze Characterics of PCA by cluster/author

pcs <- as.data.frame(predict(pca,newdata = DfNum))
summary(pcs[12:62,])[,1:8] # Ham is relatively positively related to PC1,PC3 on average

##       PC1               PC2               PC3                PC4         
##  Min.   :-2.4981   Min.   :-3.6749   Min.   :-2.20344   Min.   :-4.8451  
##  1st Qu.: 0.7453   1st Qu.:-1.2235   1st Qu.:-0.07684   1st Qu.:-1.4602  
##  Median : 1.8176   Median :-0.2604   Median : 0.85428   Median :-0.3921  
##  Mean   : 1.5243   Mean   :-0.1873   Mean   : 0.90519   Mean   :-0.1037  
##  3rd Qu.: 2.5310   3rd Qu.: 0.7047   3rd Qu.: 2.05794   3rd Qu.: 0.9985  
##  Max.   : 3.5031   Max.   : 6.1508   Max.   : 5.78036   Max.   : 4.9241  
##       PC5                PC6               PC7               PC8         
##  Min.   :-7.23853   Min.   :-5.9376   Min.   :-4.1748   Min.   :-3.5021  
##  1st Qu.:-1.03095   1st Qu.:-1.0250   1st Qu.:-0.9648   1st Qu.:-0.7885  
##  Median : 0.43574   Median : 0.1486   Median : 0.1361   Median : 0.3526  
##  Mean   :-0.08723   Mean   : 0.1103   Mean   : 0.1025   Mean   : 0.1058  
##  3rd Qu.: 1.10796   3rd Qu.: 1.6481   3rd Qu.: 1.4417   3rd Qu.: 0.9378  
##  Max.   : 2.89376   Max.   : 3.2758   Max.   : 3.0490   Max.   : 3.3011

summary(pcs[71:85,])[,1:8] # Ma is relatively negatively related to PC1,PC2,PC3, and posively related to PC4

##       PC1                PC2                PC3         
##  Min.   :-3.57966   Min.   :-2.47961   Min.   :-3.9561  
##  1st Qu.:-2.01823   1st Qu.:-0.73491   1st Qu.:-2.7515  
##  Median :-1.41793   Median : 0.04877   Median :-2.3190  
##  Mean   :-1.29549   Mean   :-0.29613   Mean   :-2.0737  
##  3rd Qu.:-0.39040   3rd Qu.: 0.34731   3rd Qu.:-1.1060  
##  Max.   :-0.09796   Max.   : 1.33855   Max.   :-0.6137  
##       PC4                PC5                PC6               PC7         
##  Min.   :-3.72667   Min.   :-2.77133   Min.   :-3.3073   Min.   :-2.5548  
##  1st Qu.: 0.06367   1st Qu.:-0.66157   1st Qu.:-1.4403   1st Qu.:-1.0010  
##  Median : 0.58283   Median : 0.01185   Median :-0.7234   Median : 0.2539  
##  Mean   : 0.52825   Mean   : 0.09991   Mean   :-0.5274   Mean   :-0.1372  
##  3rd Qu.: 1.74892   3rd Qu.: 0.99273   3rd Qu.: 0.3519   3rd Qu.: 0.5998  
##  Max.   : 2.72509   Max.   : 2.39256   Max.   : 1.8413   Max.   : 2.2543  
##       PC8          
##  Min.   :-1.90009  
##  1st Qu.:-1.19969  
##  Median :-0.09133  
##  Mean   :-0.08380  
##  3rd Qu.: 0.49210  
##  Max.   : 2.90651

summary(pcs[66:70,])[,1:8] # Ja is extremely negatively related to PC1, PC4, and positively related to PC2

##       PC1              PC2               PC3               PC4         
##  Min.   :-7.778   Min.   :-0.6614   Min.   :-2.5064   Min.   :-5.0773  
##  1st Qu.:-6.957   1st Qu.: 4.3354   1st Qu.:-0.6867   1st Qu.:-3.4455  
##  Median :-5.419   Median : 4.5522   Median : 0.3486   Median :-1.6006  
##  Mean   :-5.515   Mean   : 4.4384   Mean   : 0.8435   Mean   :-2.3375  
##  3rd Qu.:-5.144   3rd Qu.: 5.8934   3rd Qu.: 2.2450   3rd Qu.:-1.4467  
##  Max.   :-2.278   Max.   : 8.0722   Max.   : 4.8168   Max.   :-0.1174  
##       PC5               PC6               PC7              PC8         
##  Min.   :-2.5939   Min.   :-1.3844   Min.   :-1.836   Min.   :-2.6330  
##  1st Qu.:-1.4817   1st Qu.:-0.1490   1st Qu.: 1.188   1st Qu.:-2.0730  
##  Median :-0.7780   Median : 1.2887   Median : 1.519   Median : 2.5887  
##  Mean   :-0.7186   Mean   : 0.6465   Mean   : 1.273   Mean   : 0.7763  
##  3rd Qu.: 0.2021   3rd Qu.: 1.2982   3rd Qu.: 2.604   3rd Qu.: 2.6933  
##  Max.   : 1.0585   Max.   : 2.1790   Max.   : 2.889   Max.   : 3.3053

summary(pcs[63:65,])[,1:8] # Co-author is extremly negatively related to PC1,PC2, and relatively positively related to PC3 and PC4

##       PC1              PC2              PC3               PC4         
##  Min.   :-7.463   Min.   :-6.426   Min.   :0.07157   Min.   :-0.9980  
##  1st Qu.:-6.447   1st Qu.:-5.178   1st Qu.:0.36036   1st Qu.: 0.4174  
##  Median :-5.431   Median :-3.930   Median :0.64916   Median : 1.8327  
##  Mean   :-5.842   Mean   :-4.341   Mean   :0.97380   Mean   : 1.1440  
##  3rd Qu.:-5.031   3rd Qu.:-3.298   3rd Qu.:1.42491   3rd Qu.: 2.2150  
##  Max.   :-4.630   Max.   :-2.665   Max.   :2.20067   Max.   : 2.5972  
##       PC5               PC6               PC7                PC8         
##  Min.   :-1.8075   Min.   :-2.2432   Min.   :-0.33332   Min.   :-4.5237  
##  1st Qu.:-0.3856   1st Qu.:-1.2036   1st Qu.:-0.32628   1st Qu.:-2.0688  
##  Median : 1.0364   Median :-0.1640   Median :-0.31924   Median : 0.3860  
##  Mean   : 0.3800   Mean   :-0.8469   Mean   : 0.05581   Mean   :-1.2134  
##  3rd Qu.: 1.4737   3rd Qu.:-0.1487   3rd Qu.: 0.25037   3rd Qu.: 0.4418  
##  Max.   : 1.9110   Max.   :-0.1333   Max.   : 0.81999   Max.   : 0.4976

Interpretation TWO of PC1-PC4:

Hamilton used more ‘upon,to,there,of,an’ while Madison used less these words, which distinguish these two authors’ writing styles.

Step 2.6 Recreate Cluster Assignment with PCA

cluster <- km_output4$cluster
pcs$cluster <- cluster[match(rownames(pcs),names(cluster))]
library(ggplot2)
ggplot(pcs,aes(x=PC1,y=PC2))+geom_point(col=cluster)+geom_text(label=rownames(pcs))

Observation: * Articles of Joint Authorship are closer to Madison on the graph.

Step 2.7 Hierarchical Clustering

The following result shows that the Hierarchical Clustering is not that efficient for clustering compared with Kmeans due to the subtle difference. Not be able to produce desirable number of clusters by consistent comparison methods

hac_output <- hclust(dist(DfNum,method = 'euclidean'),method = 'complete')
plot(hac_output)

hac_cut2 <- cutree(hac_output,2)
for (i in 1:length(hac_cut2)) {
  if(hac_cut2[i]!=km_output2$cluster[i]) print(names(hac_cut2)[i])
}

## [1] "di 1"
## [1] "di 2"
## [1] "di 3"
## [1] "di 4"
## [1] "di 5"
## [1] "di 6"
## [1] "di 8"
## [1] "di 9"
## [1] "di 10"
## [1] "di 11"
## [1] "Ha 15"
## [1] "Ha 36"
## [1] "Ha 62"
## [1] "HM 63"
## [1] "HM 64"
## [1] "HM 65"
## [1] "Ja 66"
## [1] "Ja 67"
## [1] "Ja 70"
## [1] "Ma 71"
## [1] "Ma 72"
## [1] "Ma 73"
## [1] "Ma 74"
## [1] "Ma 75"
## [1] "Ma 76"
## [1] "Ma 77"
## [1] "Ma 78"
## [1] "Ma 79"
## [1] "Ma 80"
## [1] "Ma 81"
## [1] "Ma 82"
## [1] "Ma 83"
## [1] "Ma 84"
## [1] "Ma 85"

hac_cut4 <- cutree(hac_output,4)
for (i in 1:length(hac_cut4)) {
  if(hac_cut4[i]!=km_output4$cluster[i]) print(names(hac_cut4)[i])
}

## [1] "di 1"
## [1] "di 2"
## [1] "di 3"
## [1] "di 4"
## [1] "di 5"
## [1] "di 6"
## [1] "di 7"
## [1] "di 8"
## [1] "di 9"
## [1] "di 10"
## [1] "di 11"
## [1] "Ha 12"
## [1] "Ha 13"
## [1] "Ha 14"
## [1] "Ha 15"
## [1] "Ha 16"
## [1] "Ha 17"
## [1] "Ha 18"
## [1] "Ha 19"
## [1] "Ha 20"
## [1] "Ha 21"
## [1] "Ha 22"
## [1] "Ha 23"
## [1] "Ha 24"
## [1] "Ha 25"
## [1] "Ha 26"
## [1] "Ha 27"
## [1] "Ha 28"
## [1] "Ha 29"
## [1] "Ha 30"
## [1] "Ha 31"
## [1] "Ha 32"
## [1] "Ha 33"
## [1] "Ha 34"
## [1] "Ha 35"
## [1] "Ha 36"
## [1] "Ha 37"
## [1] "Ha 38"
## [1] "Ha 39"
## [1] "Ha 40"
## [1] "Ha 41"
## [1] "Ha 42"
## [1] "Ha 43"
## [1] "Ha 44"
## [1] "Ha 45"
## [1] "Ha 46"
## [1] "Ha 47"
## [1] "Ha 48"
## [1] "Ha 49"
## [1] "Ha 50"
## [1] "Ha 51"
## [1] "Ha 52"
## [1] "Ha 53"
## [1] "Ha 54"
## [1] "Ha 55"
## [1] "Ha 56"
## [1] "Ha 57"
## [1] "Ha 58"
## [1] "Ha 59"
## [1] "Ha 60"
## [1] "Ha 61"
## [1] "Ha 62"
## [1] "HM 64"
## [1] "HM 65"
## [1] "Ja 68"
## [1] "Ja 69"
## [1] "Ja 70"
## [1] "Ma 71"
## [1] "Ma 72"
## [1] "Ma 73"
## [1] "Ma 74"
## [1] "Ma 75"
## [1] "Ma 77"
## [1] "Ma 78"
## [1] "Ma 79"
## [1] "Ma 80"
## [1] "Ma 81"
## [1] "Ma 82"
## [1] "Ma 83"
## [1] "Ma 84"
## [1] "Ma 85"

However, from intuitively reading the graph, some observations are as follows: * Dispute 1 could be ha or Ma * Dispute 3,4,5,6,10,11 are very likely to be Ma * Dispute 7,8,9 are more likely to be Ha * Dispute 2 could be by HM co-auther

Part 2.2 DecisionTree

Step 2.21 Building Model

library(caret)
library(rpart)
dt_model <- train(author~.,data=TrDF,metric='Accuracy',method='rpart')
print(dt_model)

## CART 
## 
## 52 samples
## 70 predictors
##  2 classes: 'Hamilton', 'Madison' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 52, 52, 52, 52, 52, 52, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa    
##   0.0000000  0.9120824  0.7542551
##   0.4583333  0.8987491  0.7188121
##   0.9166667  0.8186825  0.3628433
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

print(dt_model$finalModel)

## n= 52 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 52 12 Hamilton (0.76923077 0.23076923)  
##   2) upon>=0.019 39  0 Hamilton (1.00000000 0.00000000) *
##   3) upon< 0.019 13  1 Madison (0.07692308 0.92307692) *

The final value used for the model was cp=0.458, which is relatively high enough to ensure the fitting of the model.
It uses the ‘upon’ as the root node.

Step 2.22 Model Testing & Performance Evaluation

dt_pred <- predict(dt_model,newdata=TeDF,type = 'prob')
dt_pred

##      Hamilton   Madison
## 1  1.00000000 0.0000000
## 2  1.00000000 0.0000000
## 3  1.00000000 0.0000000
## 4  1.00000000 0.0000000
## 5  1.00000000 0.0000000
## 6  1.00000000 0.0000000
## 7  1.00000000 0.0000000
## 8  1.00000000 0.0000000
## 9  1.00000000 0.0000000
## 10 1.00000000 0.0000000
## 11 1.00000000 0.0000000
## 12 0.07692308 0.9230769
## 13 0.07692308 0.9230769
## 14 0.07692308 0.9230769

dt_pred2 <- predict(dt_model,newdata=TeDF,type = 'raw')
dt_pred2

##  [1] Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton Hamilton
##  [8] Hamilton Hamilton Hamilton Hamilton Madison  Madison  Madison 
## Levels: Hamilton Madison

table(dt_pred2,TeDF$author)

##           
## dt_pred2   Hamilton Madison
##   Hamilton       11       0
##   Madison         0       3

# Accuracy of this model is 100%.

precision <- posPredValue(as.factor(dt_pred2), as.factor(TeDF$author), positive = "Hamilton")
recall <- sensitivity(as.factor(dt_pred2), as.factor(TeDF$author), positive = "Hamilton")
f <- 2 * precision * recall / (precision + recall)
sprintf("Precision is %.2f; recall is %.2f; F measure if %.2f",precision, recall, f)

## [1] "Precision is 1.00; recall is 1.00; F measure if 1.00"

Step 2.23 Model Tuning1

dt_model_tune <- train(author~.,data=TrDF,method='rpart',
                       metric='Accuracy',
                       tuneLength=8)
print(dt_model_tune$finalModel)

## n= 52 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 52 12 Hamilton (0.76923077 0.23076923)  
##   2) upon>=0.019 39  0 Hamilton (1.00000000 0.00000000) *
##   3) upon< 0.019 13  1 Madison (0.07692308 0.92307692) *

dt_predictTune <- predict(dt_model_tune,newdata = TeDF,type = 'prob')
head(dt_predictTune)

##   Hamilton Madison
## 1        1       0
## 2        1       0
## 3        1       0
## 4        1       0
## 5        1       0
## 6        1       0

Step 2.24 Postprune

dt_model_postprune <- prune(dt_model$finalModel,cp=0.1)
print(dt_model_postprune)

## n= 52 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 52 12 Hamilton (0.76923077 0.23076923)  
##   2) upon>=0.019 39  0 Hamilton (1.00000000 0.00000000) *
##   3) upon< 0.019 13  1 Madison (0.07692308 0.92307692) *

dt_predictTune <- predict(dt_model_postprune,newdata = TeDF,type = 'prob')
dt_predictTune

##      Hamilton   Madison
## 1  1.00000000 0.0000000
## 2  1.00000000 0.0000000
## 3  1.00000000 0.0000000
## 4  1.00000000 0.0000000
## 5  1.00000000 0.0000000
## 6  1.00000000 0.0000000
## 7  1.00000000 0.0000000
## 8  1.00000000 0.0000000
## 9  1.00000000 0.0000000
## 10 1.00000000 0.0000000
## 11 1.00000000 0.0000000
## 12 0.07692308 0.9230769
## 13 0.07692308 0.9230769
## 14 0.07692308 0.9230769

## the decrease of cp does not do harm to the model fitting and prediction precision. The accuracy is still 100%.

plot(dt_model$finalModel)
text(dt_model$finalModel)

library(rattle)
fancyRpartPlot(dt_model$finalModel)

library(rpart.plot)
prp(dt_model$finalModel)

Based on the model, there are 75% of articles written by Hamilton, 25% written by Madison in the training dataset. The probability of cases in Halmiton that are classified rightly is 100%, while that in Madison is 92%.

Part 3. Prediction and Interpretation

dt_judge <- predict(dt_model,newdata = PreDF,type = 'prob')
dt_judge

##      Hamilton   Madison
## 1  0.07692308 0.9230769
## 2  0.07692308 0.9230769
## 3  0.07692308 0.9230769
## 4  0.07692308 0.9230769
## 5  0.07692308 0.9230769
## 6  0.07692308 0.9230769
## 7  0.07692308 0.9230769
## 8  0.07692308 0.9230769
## 9  0.07692308 0.9230769
## 10 0.07692308 0.9230769
## 11 0.07692308 0.9230769

dt_judge2 <- predict(dt_model,newdata = PreDF,type = 'raw')
dt_judge2

##  [1] Madison Madison Madison Madison Madison Madison Madison Madison
##  [9] Madison Madison Madison
## Levels: Hamilton Madison

Based on the prediction result of decision tree model, all the articles are written by Madison.

Comparison between cluster analysis:

Among the cluter Analysis, Kmeans is more useful and informative than Hierarchical in this case.

Comparison between cluster analysis and decision tree:

Kmeans clusters give more details about the article and authorship belonging from map and localization.
PCA is also an important methods, from which we can know which words are more frequently used or closely related to each author.
Decision tree give uniform and more direct classification for the items.
The root node ‘upon (larger frequency are assigned to Halmiton)’ and child node ‘which’ in decision tree are consistent with the results from the PCA ( upon are positively related with Halmiton ).