Source: Analytics Edge Unit 5 Homework

Techniques involved: Text mining, CART, grepl

Wikipedia is a free online encyclopedia that anyone can edit and contribute to. One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it’s difficult for humans to detect all the instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.

The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

Vandal = 1 if this edit was vandalism, 0 if not. Minor = 1 if the user marked this edit as a “minor edit”, 0 if not. Loggedin = 1 if the user made this edit while using a Wikipedia account, 0 if they did not. Added = The unique words added. Removed = The unique words removed.

Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the “Removed” column.

Load the data

setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_5_Text_analytics")
wiki <- read.csv("wiki.csv", stringsAsFactors = FALSE)
str(wiki)
## 'data.frame':    3876 obs. of  7 variables:
##  $ X.1     : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ X       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Vandal  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Minor   : int  1 1 0 1 1 0 0 0 1 0 ...
##  $ Loggedin: int  1 1 1 0 1 1 1 1 1 0 ...
##  $ Added   : chr  "  represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
##  $ Removed : chr  " " " talklanguagetalk" " regarded as technologytechnologies human first" "  represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
table(wiki$Vandal)
## 
##    0    1 
## 2061 1815

Bag of words approach

We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We’ll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

  1. Create the corpus for the Added column, and call it “corpusAdded”.

  2. Remove the English-language stopwords.

  3. Stem the words.

  4. Build the DocumentTermMatrix, and call it dtmAdded.

Create a corpus

library(tm)
## Loading required package: NLP
corpusAdded <- Corpus(VectorSource(wiki$Added))

Pre-process the corpus

corpusAdded <- tm_map(corpusAdded, removeWords, stopwords("english"))
corpusAdded <- tm_map(corpusAdded, stemDocument)

Build the DocumentTermMatrix

dtmAdded <- DocumentTermMatrix(corpusAdded)
dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

Filter out sparse terms

sparseAdded <- removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

Create a data frame

wordsAdded <- as.data.frame(as.matrix(sparseAdded))
colnames(wordsAdded) <- paste("A", colnames(wordsAdded))

Create corpus for Removed and repeat all the steps above

corpusRemoved <- Corpus(VectorSource(wiki$Removed))
corpusRemoved <- tm_map(corpusRemoved, removeWords, stopwords("english"))
corpusRemoved <- tm_map(corpusRemoved, stemDocument)
dtmRemoved <- DocumentTermMatrix(corpusRemoved)
sparseRemoved <- removeSparseTerms(dtmRemoved, 0.997)
wordsRemoved <- as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))

Rebuild the data frame

Combine the two data frames

wikiWords <- cbind(wordsAdded, wordsRemoved)

Add the vandal column

wikiWords$Vandal <- wiki$Vandal

Split the data

library(caTools)
set.seed(123)
split <- sample.split(wikiWords$Vandal, SplitRatio = 0.7)
train <- subset(wikiWords, split == TRUE)
test <- subset(wikiWords, split == FALSE)

Baseline model

table(test$Vandal)
## 
##   0   1 
## 618 545

Baseline model accuracy is 618/nrow(test) = 0.5313

CART model

Below is a bad model. Method needs to be specified in the model.

library(rpart)
library(rpart.plot)
wikiCART <- rpart(Vandal ~., data = train)
prp(wikiCART)

Correct model

library(rpart)
library(rpart.plot)
wikiCART <- rpart(Vandal ~., data = train, method = "class")
prp(wikiCART)

Evaluate the model

Make predictions

wikiPredict <- predict(wikiCART, newdata = test, type = "class")

Model accuracy is (618+12)/nrow(test)=0.5417025

table(test$Vandal, wikiPredict)
##    wikiPredict
##       0   1
##   0 618   0
##   1 533  12

The model didn’t beat the baseline well enough.

Alternative approach

We weren’t able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.

The key class of words we will use are website addresses. “Website addresses” (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be “http://www.google.com”. The first part is the protocol, which is usually “http” (HyperText Transfer Protocol). The second part is the address of the site, e.g. “www.google.com”. We have stripped all punctuation so links to websites appear in the data as one word, e.g. “httpwwwgooglecom”. We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.

We can search for the presence of a web address in the words added by searching for “http” in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.

grepl("cat","dogs and cats",fixed=TRUE) # TRUE

grepl("cat","dogs and rats",fixed=TRUE) # FALSE

Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

Create a new variable

Make a new column in wikiWords2 that is 1 if “http” was in Added:

wikiWords2$HTTP <- ifelse(grepl("http", wiki$Added, fixed = TRUE),1,0)
table(wikiWords2$HTTP)
## 
##    0    1 
## 3659  217

Re-create training and test set

wikiTrain2 <- subset(wikiWords2, split == TRUE)
wikiTest2 <- subset(wikiWords2, split == FALSE)

Rebuild the CART model

wikiCART2 <- rpart(Vandal ~., data = wikiTrain2, method = "class")
prp(wikiCART2)

Evaluate the model

wikiPredict2 <- predict(wikiCART2, newdata = wikiTest2, type = "class")
table(wikiTest2$Vandal, wikiPredict2)
##    wikiPredict2
##       0   1
##   0 609   9
##   1 488  57

Model accuracy is (609+57)/nrow(wikiTest2) = 0.5726569

(609+57)/nrow(wikiTest2)
## [1] 0.5726569

Word count approach

Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).

wikiWords2$NumWordsAdded <- rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved <- rowSums(as.matrix(dtmRemoved))
summary(wikiWords2$NumWordsAdded)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    4.05    3.00  259.00

Recreate training and test set, because we have new variables

wikiTrain2 <- subset(wikiWords2, split == TRUE)
wikiTest2 <- subset(wikiWords2, split == FALSE)

Rebuild the CART model

wikiCART2 <- rpart(Vandal ~., data = wikiTrain2, method = "class")
prp(wikiCART2)

Evaluate the model

wikiPredict2 <- predict(wikiCART2, newdata = wikiTest2, type = "class")
table(wikiTest2$Vandal, wikiPredict2)
##    wikiPredict2
##       0   1
##   0 514 104
##   1 297 248

Model accuracy is 0.6552021

(514+248)/nrow(wikiTest2)
## [1] 0.6552021

We have two pieces of “metadata” (data about data) that we haven’t yet used.

wikiWords3 <- wikiWords2

Introducing two new variables

Then add the two original variables Minor and Loggedin to this new data frame

wikiWords3$Minor <- wiki$Minor
wikiWords3$Loggedin <- wiki$Loggedin

Recreate training and test set

wikiTrain3 <- subset(wikiWords3, split == TRUE)
wikiTest3 <- subset(wikiWords3, split == FALSE)

Build a CART model

wikiCART3 <- rpart(Vandal ~., data = wikiTrain3, method = "class")
prp(wikiCART3)

Evaluate the model

wikiPredict3 <- predict(wikiCART3, newdata = wikiTest3, type = "class")
table(wikiTest3$Vandal, wikiPredict3)
##    wikiPredict3
##       0   1
##   0 595  23
##   1 304 241

Model accuracy is 0.7188306

(595+241)/nrow(wikiTest3)
## [1] 0.7188306

There is a substantial difference in the accuracy of the model using the meta data.