The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

以每一筆修改為單位



packages = c(
  "dplyr","ggplot2","caTools","tm","SnowballC","ROCR","rpart","rpart.plot","randomForest")
existing = as.character(installed.packages()[,1])
for(pkg in packages[!(packages %in% existing)]) install.packages(pkg)
rm(list=ls(all=TRUE))
Sys.setlocale("LC_ALL","C")
## [1] "C"
options(digits=5, scipen=10)

library(dplyr)
library(tm)
library(SnowballC)
library(ROCR)
library(caTools)
library(rpart)
library(rpart.plot)
library(randomForest)


Problem 1 - Bags of Words

1.1 The data set

Load the data wiki.csv with the option stringsAsFactors=FALSE, calling the data frame “wiki”.

wiki=read.csv("data/wiki.csv", stringsAsFactors=F)

Convert the “Vandal” column to a factor using the command :

wiki$Vandal=as.factor(wiki$Vandal)
table(wiki$Vandal)
## 
##    0    1 
## 2061 1815

How many cases of vandalism were detected in the history of this page?

  • 1815
1.2 DTM, The Added Words

We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We’ll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

  1. Create the corpus for the Added column, and call it “corpusAdded”.

  2. Remove the English-language stopwords.

  3. Stem the words.

  4. Build the DocumentTermMatrix, and call it dtmAdded.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusAdded, removeWords, sw) instead of tm_map(corpusAdded, removeWords, stopwords(“english”)).

用Added做文字矩陣

library(tm)
library(SnowballC)

# Create corpus for Added Words
txt = iconv(wiki$Added, to = "utf-8", sub="") #把檔案轉成utf-8, 以空白鍵分隔
corpusAdded = Corpus(VectorSource(txt)) #建立corpus
corpusAdded = tm_map(corpusAdded, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpusAdded, removeWords,
## stopwords("english")): transformation drops documents
corpusAdded = tm_map(corpusAdded, stemDocument)
## Warning in tm_map.SimpleCorpus(corpusAdded, stemDocument): transformation
## drops documents
#create a DocumentTermMatrix
dtmAdded = DocumentTermMatrix(corpusAdded) ;dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

【P1.2】How many terms appear in dtmAdded?

  • 6675
1.3 Handle Sparsity

Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded.

nwAdded = rowSums(as.matrix(dtmAdded))     # no. word added in each edit
sparseAdded = removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

【P1.3】How many terms appear in sparseAdded?

  • 166
1.4 Create Data Frames, wordAdded & wordRemoved

Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

wordsAdded = as.data.frame(as.matrix(dtmAdded)) #先as.matrix再as.data.frame,沒什麼原因,就是這樣...
colnames(wordsAdded) = paste("A", colnames(wordsAdded))  # for proper column names

Now repeat all of the steps we’ve done so far (create a corpus, remove stop words, stem the document, create a sparse document term matrix, and convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

用Removed做文字矩陣

txt = iconv(wiki$Removed, to = "utf-8", sub="") 
corpus = Corpus(VectorSource(txt)) #建立corpus
corpus = tm_map(corpus, removeWords, stopwords("english"))
## Warning in tm_map.SimpleCorpus(corpus, removeWords, stopwords("english")):
## transformation drops documents
corpus = tm_map(corpus, stemDocument)
## Warning in tm_map.SimpleCorpus(corpus, stemDocument): transformation drops
## documents
#create a DocumentTermMatrix
dtm = DocumentTermMatrix(corpus) #;dtm
nwRemoved = rowSums(as.matrix(dtm))     # no. word added in each edit
dtm = removeSparseTerms(dtm,  0.997)
dtm
## <<DocumentTermMatrix (documents: 3876, terms: 162)>>
## Non-/sparse entries: 2552/625360
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)
wordsRemoved = as.data.frame(as.matrix(dtm))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

How many words are in the wordsRemoved data frame?

  • 162
1.5 Prepare the Data Frame

Combine the two data frames into a data frame called wikiWords with the following line of code:

wikiWords = cbind(wordsAdded, wordsRemoved)

The cbind function combines two sets of variables for the same observations into one data frame. Then add the Vandal column (HINT: remember how we added the dependent variable back into our data frame in the Twitter lecture). Set the random seed to 123 and then split the data set using sample.split from the “caTools” package to put 70% in the training set.

wikiWords$Vandal = wiki$Vandal

library(caTools)
set.seed(123)
spl = sample.split(wikiWords$Vandal, SplitRatio = 0.7)
tr = subset(wikiWords, spl)
ts = subset(wikiWords, !spl)
table(ts$Vandal) %>% prop.table() #直接求出table各結果的比例
## 
##       0       1 
## 0.53138 0.46862
618/(618+545)
## [1] 0.53138

What is the accuracy on the test set of a baseline method that always predicts “not vandalism” (the most frequent outcome)?

  • 0.53138
1.6 CART Model

Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don’t set values for minbucket or cp). What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type=“class” when making predictions, the output of predict will automatically use a threshold of 0.5.)

library(rpart)
library(rpart.plot)
cart= rpart(Vandal~., tr, method = "class")
pred = predict(cart, newdata=ts, type= "class")
table(ts$Vandal, pred) %>% {sum(diag(.)) / sum(.)}
## [1] 0.54428
  • 0.54428
1.7 Plot the Decision Tree

Plot the CART tree. How many word stems does the CART model use?

prp(cart)

  • 3
1.8 Predictability of the CART model

【P1.8】Given the performance of the CART model relative to the baseline, what is the best explanation of these results?

  • Although it beats the baseline, bag of words is not very predictive for this problem.
  • 沒有overfit也沒有過度省略內容
  • 所有model的ACC均不理想,代青這兩變數可能不是適合的預測變數


Problem 2 - Add Features with Problem-specific Knowledge

2.1 Add HTTP column

We weren’t able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.

The key class of words we will use are website addresses. “Website addresses” (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be “http://www.google.com”. The first part is the protocol, which is usually “http” (HyperText Transfer Protocol). The second part is the address of the site, e.g. “www.google.com”. We have stripped all punctuation so links to websites appear in the data as one word, e.g. “httpwwwgooglecom”. We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.

We can search for the presence of a web address in the words added by searching for “http” in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.

* grepl()能用以確認文字內容是不是片語。例如,cats和dogs可能指貓和狗,但cats and dogs另指下大雨。

Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

Make a new column in wikiWords2 that is 1 if “http” was in Added:

wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)
table(wikiWords2$HTTP)
## 
##    0    1 
## 3659  217

Based on this new column, how many revisions added a link?

  • 217
2.2 Check accuracy again

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets: 把是片語的結果考慮進去後

tr2= subset(wikiWords2, spl)
ts2= subset(wikiWords2, !spl)

Then create a new CART model using this new variable as one of the independent variables.

cart2 = rpart(Vandal~., tr2, method = "class")
pred2 = predict(cart2, newdata=ts2, type= "class")
table(ts2$Vandal, pred2) %>% {sum(diag(.)) / sum(.)}
## [1] 0.57524

What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

  • 0.57524 ACC確實上升了,但也不多
2.3 Total numbers of words added and removed

Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).

Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:

#rowSums的動作在上面已經做過,並命名為newAdded和newRemoved
#  wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))
# mean(wikiWords2$NumWordsAdded)
#  wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtm))
#因此:
wikiWords2$nwAdded = nwAdded
wikiWords2$nwRemoved = nwRemoved
mean(nwAdded) # 4.0501
## [1] 4.0501

What is the average number of words added?

  • 4.0501
2.4 Check accuracy again

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords2. Create the CART model again (using the training set and the default parameters).

tr3 = subset(wikiWords2, spl)
ts3 = subset(wikiWords2, !spl)
cart3 = rpart(Vandal~., tr3, method = "class")
pred3 = predict(cart3, newdata= ts3, type="class")
table(ts3$Vandal, pred3) %>% {sum(diag(.)) / sum(.)}
## [1] 0.6552

What is the new accuracy of the CART model on the test set?

  • 0.6552
3.1 Check accuracy again

We have two pieces of “metadata” (data about data) that we haven’t yet used. Make a copy of wikiWords2, and call it wikiWords3:

wikiWords3 = wikiWords2

Then add the two original variables Minor and Loggedin to this new data frame:

wikiWords3$Minor = wiki$Minor
wikiWords3$Loggedin = wiki$Loggedin

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.

Build a CART model using all the training data. What is the accuracy of the model on the test set?

tr4 = subset(wikiWords3, spl)
ts4 = subset(wikiWords3, !spl)
cart4 = rpart(Vandal~., tr4, method="class")
pred4 = predict(cart4, newdata= ts4, type="class")
table(ts4$Vandal, pred4) %>% {sum(diag(.))/sum(.)}
## [1] 0.71883
  • 0.71883
3.2 The Decision Tree

There is a substantial difference in the accuracy of the model using the meta data. Is this because we made a more complicated model?

Plot the CART tree. How many splits are there in the tree?

prp(cart4)

  • 3

整理好文集、建立文字矩陣後,後面的工作就是不停更動並找出能提升ACC的方法而已。



討論議題:
  ■ 請舉出一些可以繼續提高模型準確率的方法,方法越多越好:
    ● 考慮是片語的可能性
    ● 考慮文字出現的頻率
    ● 考慮情緒等等
    ●
    ●