AS10-1：誰是來亂的？

The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

Vandal = 1 if this edit was vandalism, 0 if not. 資料原本是對的，又被別人改錯
Minor = 1 if the user marked this edit as a “minor edit”, 0 if not.
Loggedin = 1 if the user made this edit while using a Wikipedia account, 0 if they did not.
Added = The unique words added.
Removed = The unique words removed.

packages = c(
  "dplyr","ggplot2","caTools","tm","SnowballC","ROCR","rpart","rpart.plot","randomForest")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)

rm(list=ls(all=TRUE))
Sys.setlocale("LC_ALL","C")

## [1] "C"

options(digits=5, scipen=10)

library(dplyr)
library(tm)
library(SnowballC)
library(ROCR)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)

Problem 1 - Bags of Words

1.1 The data set

wiki = read.csv("data/wiki.csv" , stringsAsFactors = FALSE)
wiki$Vandal = as.factor(wiki$Vandal)
table(wiki$Vandal)

## 
##    0    1 
## 2061 1815

【P1.1】How many cases of vandalism were detected in the history of this page?

1815

1.2 DTM, The Added Words

library(tm)
library(SnowballC)

# Create corpus for Added Words
txt = iconv(wiki$Added, to = "utf-8", sub="")
corpus = Corpus(VectorSource(txt))
corpus = tm_map(corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

corpus = tm_map(corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

#---------------------------------
library(tm)
library(SnowballC)
txt = iconv(wiki$Added, to = "utf-8", sub="")
corpusAdded = Corpus(VectorSource(txt))
corpusAdded = tm_map(corpusAdded, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(corpusAdded, removeWords,
## stopwords("english")): transformation drops documents

corpusAdded = tm_map(corpusAdded , stemDocument)

## Warning in tm_map.SimpleCorpus(corpusAdded, stemDocument): transformation
## drops documents

dtmAdded = DocumentTermMatrix(corpusAdded)
dtmAdded

## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

【P1.2】How many terms appear in dtmAdded?

6675

1.3 Handle Sparsity

Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded.

nwAdded = rowSums(as.matrix(dtm))     # no. word added in each edit
dtm = removeSparseTerms(dtm, 0.997)   #篩選掉最後0.3%資料
dtm

## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

#---------------------------------
dtmAdded = removeSparseTerms(dtmAdded , 0.997)   #篩選掉最後0.3%資料
dtmAdded

## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

【P1.3】How many terms appear in sparseAdded?

1.4 Create Data Frames, `wordAdded` & `wordRemoved`

Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

wordsAdded = as.data.frame(as.matrix(dtm))
colnames(wordsAdded) = paste("A", colnames(wordsAdded))  # for proper column names

#---------------------------------
wordsAdded = as.data.frame(as.matrix(dtmAdded))
colnames(wordsAdded) = paste("A", colnames(wordsAdded))

Now repeat all of the steps we’ve done so far to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

# Create corpus
txt = iconv(wiki$Removed, to = "utf-8", sub="")
corpus = Corpus(VectorSource(txt))
corpus = tm_map(corpus, removeWords, stopwords("english"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

corpus = tm_map(corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents

dtm = DocumentTermMatrix(corpus)
dtm

## <<DocumentTermMatrix (documents: 3876, terms: 5404)>>
## Non-/sparse entries: 13294/20932610
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

nwRemoved = rowSums(as.matrix(dtm))
dtm = removeSparseTerms(dtm, 0.997)
dtm

## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

wordsRemoved = as.data.frame(as.matrix(dtm))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

#---------------------------------
txt = iconv(wiki$Removed, to = "utf-8", sub="")
corpus = Corpus(VectorSource(txt))
corpus = tm_map(corpus, removeWords , stopwords("english"))

## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents

corpus = tm_map(corpus, stemDocument)

## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents

dtmRemoved = DocumentTermMatrix(corpus)
dtmAdded

## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

nwAdded = rowSums(as.matrix(dtmAdded)) 
sparseRemoved = removeSparseTerms(dtmRemoved , 0.997)   #篩選掉最後0.3%資料
sparseRemoved

## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

wordsRemoved = as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

【P1.4】How many words are in the wordsRemoved data frame?

1.5 Prepare the Data Frame

Combine the Data Frames wordsAdded & wordsRemoved with the Target Variable wiki$Vandal

wikiWords = cbind(wordsAdded, wordsRemoved)
wikiWords$Vandal = wiki$Vandal

Split the data frame for train and test data

library(caTools)
set.seed(123)
spl = sample.split(wikiWords$Vandal, 0.7)
train = subset(wikiWords, spl == TRUE)
test = subset(wikiWords, spl == FALSE)
table(test$Vandal) %>% prop.table

## 
##       0       1 
## 0.53138 0.46862

#---------------------------------
library(caTools)
set.seed(123)
spl = sample.split(wikiWords$Vandal , 0.7)
train = subset(wikiWords , spl == TRUE )
test = subset(wikiWords , spl == FALSE)
table(test$Vandal)

## 
##   0   1 
## 618 545

(618)/(618+545)

## [1] 0.53138

【P1.5】What is the accuracy on the test set of a baseline method that always predicts “not vandalism”?

0.53138

1.6 CART Model

library(rpart)
library(rpart.plot)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart,test,type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # 0.54428

## [1] 0.54428

#---------------------------------
#隨機種子會影響最後產生的準確率
library(rpart)
library(rpart.plot)
cart = rpart(Vandal~. , train , method = "class")
pred = predict(cart , newdata = test , type = "class")
table(test$Vandal , pred)

##    pred
##       0   1
##   0 614   4
##   1 526  19

(614+19)/(614+4+526+19)

## [1] 0.54428

【P1.6】What is the accuracy of the model on the test set, using a threshold of 0.5?

0.54428

1.7 Plot the Decision Tree

prp(cart)

【P1.7】How many word stems does the CART model use?

1.8 Predictability of the CART model

【P1.8】Given the performance of the CART model relative to the baseline, what is the best explanation of these results?

Although it beats the baseline, bag of words is not very predictive for this problem.
產生的模型並無過度適配的問題，主要問題在於字詞切割與分析，會影響預測結果

Problem 2 - Add Features with Problem-specific Knowledge

2.1 Add `HTTP` column

Add a new column based on whether "http" is added

wiki2 = wikiWords
wiki2$HTTP = ifelse( grepl("http",wiki$Added,fixed=TRUE) , 1, 0)
table(wiki2$HTTP) # 217

## 
##    0    1 
## 3659  217

#---------------------------------
wikiWords2 = wikiWords
wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)
table(wikiWords2$HTTP)

## 
##    0    1 
## 3659  217

【P2.1】Based on this new column, how many revisions added a link?

2.2 Check accuracy again

train2 = subset(wiki2, spl==T)
test2 = subset(wiki2, spl==F)
cart2 = rpart(Vandal~., train2, method="class")
pred2 = predict(cart2,test2,type='class')
table(test2$Vandal, pred2) %>% {sum(diag(.)) / sum(.)}

## [1] 0.57524

table(test2$Vandal, pred2)

##    pred2
##       0   1
##   0 605  13
##   1 481  64

(605+64)/(605+13+481+64)  #0.57524

## [1] 0.57524

【P2.2】What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

0.57524

2.3 Total numbers of words added and removed

wiki2$nwAdded = nwAdded
wiki2$nwRemoved = nwRemoved
mean(nwAdded) # 4.0501

## [1] 0.73271

#---------------------------------
wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))
mean(wikiWords2$NumWordsAdded)

## [1] 0.73271

【P2.3】What is the average number of words added?

4.0501

2.4 Check accuracy again

train = subset(wiki2, spl)
test = subset(wiki2, !spl)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart,test,type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # 0.6552

## [1] 0.57782

#---------------------------------
wikiTrain3 = subset(wikiWords2, spl==TRUE)
wikiTest3 = subset(wikiWords2, spl==FALSE)
wikiCART3 = rpart(Vandal ~ ., data=wikiTrain3, method="class")
testPredictCART3 = predict(wikiCART3, newdata=wikiTest3, type="class")
table(wikiTest3$Vandal, testPredictCART3)

##    testPredictCART3
##       0   1
##   0 349 269
##   1 222 323

(349+323)/(349+269+222+323) #與答案不同

## [1] 0.57782

【P2.4】What is the new accuracy of the CART model on the test set?

0.6552

Problem 3 - Using Non-Textual Data

原始資料之中還有一些之前沒有用到的欄位，我們把它們也加進來

wiki3 = wiki2
wiki3$Minor = wiki$Minor
wiki3$Loggedin = wiki$Loggedin

#---------------------------------
wikiWords3 = wikiWords2
wikiWords3$Minor = wiki$Minor
wikiWords3$Loggedin = wiki$Loggedin

3.1 Check accuracy again

train = subset(wiki3, spl=T)
test = subset(wiki3, spl=F)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart,test,type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # .72472

## [1] 0.71259

#---------------------------------
train = subset(wikiWords3, spl=T)
test = subset(wikiWords3, spl=F)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart , newdata = test , type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # 0.71259

## [1] 0.71259

【P3.1】What is the accuracy of the model on the test set?

0.71259(Ans:0.7188306)

3.2 The Decision Tree

prp(cart)  #1

【P3.2】How many splits are there in the tree?

1(Ans:3)

討論議題：
■ 請舉出一些可以繼續提高模型準確率的方法，方法越多越好：
● 模型相互組合訓練及預測(例如：邏輯式回歸+隨機森林)。
● 篩選掉更多較不重要字詞，留下常出現的字詞。
● 增加變數預測，找出關聯性。

AS10-1：誰是來亂的？

施采彣 M064020017

2018-08-17 23:20:26

Problem 1 - Bags of Words

1.1 The data set

1.2 DTM, The Added Words

1.3 Handle Sparsity

1.4 Create Data Frames, `wordAdded` & `wordRemoved`

1.5 Prepare the Data Frame

1.6 CART Model

1.7 Plot the Decision Tree

1.8 Predictability of the CART model

Problem 2 - Add Features with Problem-specific Knowledge

2.1 Add `HTTP` column

2.2 Check accuracy again

2.3 Total numbers of words added and removed

2.4 Check accuracy again

Problem 3 - Using Non-Textual Data

3.1 Check accuracy again

3.2 The Decision Tree

AS10-1：誰是來亂的？

施采彣 M064020017

2018-08-17 23:20:26

Problem 1 - Bags of Words

1.1 The data set

1.2 DTM, The Added Words

1.3 Handle Sparsity

1.4 Create Data Frames, wordAdded & wordRemoved

1.5 Prepare the Data Frame

1.6 CART Model

1.7 Plot the Decision Tree

1.8 Predictability of the CART model

Problem 2 - Add Features with Problem-specific Knowledge

2.1 Add HTTP column

2.2 Check accuracy again

2.3 Total numbers of words added and removed

2.4 Check accuracy again

Problem 3 - Using Non-Textual Data

3.1 Check accuracy again

3.2 The Decision Tree

1.4 Create Data Frames, `wordAdded` & `wordRemoved`

2.1 Add `HTTP` column