The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

每一筆修改為單位



packages = c(
  "dplyr","ggplot2","caTools","tm","SnowballC","ROCR","rpart","rpart.plot","randomForest")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=TRUE))
Sys.setlocale("LC_ALL","C")
## [1] "C"
options(digits=5, scipen=10)

library(dplyr)
library(tm)
library(SnowballC)
library(ROCR)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)


Problem 1 - Bags of Words

1.1 The data set
wiki = read.csv("data/wiki.csv", stringsAsFactors = F)
wiki$Vandal = factor(wiki$Vandal)
table(wiki$Vandal)
## 
##    0    1 
## 2061 1815

【P1.1】How many cases of vandalism were detected in the history of this page?

  • 1815
1.2 DTM, The Added Words
library(tm)
library(SnowballC)

# Create corpus for Added Words
txt = iconv(wiki$Added, to = "utf-8", sub="")
corpus = Corpus(VectorSource(txt))
corpus = tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus = tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

【P1.2】How many terms appear in dtmAdded?

  • 6675
1.3 Handle Sparsity

Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded.

nwAdded = rowSums(as.matrix(dtm))     # no. word added in each edit
dtm = removeSparseTerms(dtm, 0.997)
dtm
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

【P1.3】How many terms appear in sparseAdded?

  • 166
1.4 Create Data Frames, wordAdded & wordRemoved

Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

wordsAdded = as.data.frame(as.matrix(dtm))
colnames(wordsAdded) = paste("A", colnames(wordsAdded))  # for proper column names

Now repeat all of the steps we’ve done so far to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

# Create corpus
txt = iconv(wiki$Removed, to = "utf-8", sub="")
corpus = Corpus(VectorSource(txt))
corpus = tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus = tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
dtm = DocumentTermMatrix(corpus)
dtm
## <<DocumentTermMatrix (documents: 3876, terms: 5404)>>
## Non-/sparse entries: 13294/20932610
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)
nwRemoved = rowSums(as.matrix(dtm))
dtm = removeSparseTerms(dtm, 0.997)
dtm
## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)
wordsRemoved = as.data.frame(as.matrix(dtm))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

【P1.4】How many words are in the wordsRemoved data frame?

  • 162
1.5 Prepare the Data Frame

Combine the Data Frames wordsAdded & wordsRemoved with the Target Variable wiki$Vandal

wikiWords = cbind(wordsAdded, wordsRemoved)
wikiWords$Vandal = wiki$Vandal

Split the data frame for train and test data

library(caTools)
set.seed(123)
spl = sample.split(wikiWords$Vandal, 0.7)
train = subset(wikiWords, spl == TRUE)
test = subset(wikiWords, spl == FALSE)
table(test$Vandal) %>% prop.table
## 
##       0       1 
## 0.53138 0.46862

【P1.5】What is the accuracy on the test set of a baseline method that always predicts “not vandalism”?

  • 0.53138
1.6 CART Model
library(rpart)
library(rpart.plot)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart,test,type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # 0.54428
## [1] 0.54428

【P1.6】What is the accuracy of the model on the test set, using a threshold of 0.5?

  • 0.54428
1.7 Plot the Decision Tree
prp(cart)

【P1.7】How many word stems does the CART model use?

  • 3
1.8 Predictability of the CART model

【P1.8】Given the performance of the CART model relative to the baseline, what is the best explanation of these results?

  • Although it beats the baseline, bag of words is not very predictive for this problem.


Problem 2 - Add Features with Problem-specific Knowledge

2.1 Add HTTP column

Add a new column based on whether "http" is added

wiki2 = wikiWords
wiki2$HTTP = ifelse( grepl("http",wiki$Added,fixed=TRUE) , 1, 0)
table(wiki2$HTTP) # 217
## 
##    0    1 
## 3659  217

【P2.1】Based on this new column, how many revisions added a link?

  • 217
2.2 Check accuracy again
train2 = subset(wiki2, spl==T)
test2 = subset(wiki2, spl==F)
cart2 = rpart(Vandal~., train2, method="class")
pred2 = predict(cart2,test2,type='class')
table(test2$Vandal, pred2) %>% {sum(diag(.)) / sum(.)} # 0.57524
## [1] 0.57524

【P2.2】What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

  • 0.57524
2.3 Total numbers of words added and removed
wiki2$nwAdded = nwAdded
wiki2$nwRemoved = nwRemoved
mean(nwAdded) # 4.0501
## [1] 4.0501

【P2.3】What is the average number of words added?

  • 4.0501
2.4 Check accuracy again
train = subset(wiki2, spl)
test = subset(wiki2, !spl)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart,test,type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # 0.6552
## [1] 0.6552

【P2.4】What is the new accuracy of the CART model on the test set?

  • 0.57782


Problem 3 - Using Non-Textual Data

原始資料之中還有一些之前沒有用到的欄位,我們把它們也加進來

wiki3 = wiki2
wiki3$Minor = wiki$Minor
wiki3$Loggedin = wiki$Loggedin
3.1 Check accuracy again
train = subset(wiki3, spl=T)
test = subset(wiki3, spl=F)
cart = rpart(Vandal~., train, method="class")
pred = predict(cart,test,type='class')
table(test$Vandal, pred) %>% {sum(diag(.)) / sum(.)} # .72472
## [1] 0.72472

【P3.1】What is the accuracy of the model on the test set?

  • 0.72472
3.2 The Decision Tree
prp(cart)

【P3.2】How many splits are there in the tree?

  • 3


討論議題:
  ■ 請舉出一些可以繼續提高模型準確率的方法,方法越多越好:
    ●將非結構化資料轉成結構化資料,同時也將沒有用到的欄位放進來,雖然會增加變數 ,但可以在試一次決策樹與隨機森林,但有可能因為變數過多,導致決策樹分類效果 變差,這時可以使用cart()修剪。
    ● 用交叉驗證
    ● 調整參數
    ●
    ●