Wikipedia is a free online encyclopedia that anyone can edit and contribute to. One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it’s difficult for humans to detect all the instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.
The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.
As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:
Vandal = 1 if this edit was vandalism, 0 if not. Minor = 1 if the user marked this edit as a “minor edit”, 0 if not. Loggedin = 1 if the user made this edit while using a Wikipedia account, 0 if they did not. Added = The unique words added. Removed = The unique words removed.
Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the “Removed” column.
setwd("C:/Users/jzchen/Documents/Courses/Analytics Edge/Unit_5_Text_analytics")
wiki <- read.csv("wiki.csv", stringsAsFactors = FALSE)
str(wiki)
## 'data.frame': 3876 obs. of 7 variables:
## $ X.1 : int 1 2 3 4 5 6 7 8 9 10 ...
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Vandal : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Minor : int 1 1 0 1 1 0 0 0 1 0 ...
## $ Loggedin: int 1 1 1 0 1 1 1 1 1 0 ...
## $ Added : chr " represent psycholinguisticspsycholinguistics orthographyorthography help text all actions through human ethnologue relationsh"| __truncated__ " website external links" " " " afghanistan used iran mostly that farsiis is countries some xmlspacepreservepersian parts tajikestan region" ...
## $ Removed : chr " " " talklanguagetalk" " regarded as technologytechnologies human first" " represent psycholinguisticspsycholinguistics orthographyorthography help all actions through ethnologue relationships linguis"| __truncated__ ...
table(wiki$Vandal)
##
## 0 1
## 2061 1815
We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We’ll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:
Create the corpus for the Added column, and call it “corpusAdded”.
Remove the English-language stopwords.
Stem the words.
Build the DocumentTermMatrix, and call it dtmAdded.
library(tm)
## Loading required package: NLP
corpusAdded <- Corpus(VectorSource(wiki$Added))
corpusAdded <- tm_map(corpusAdded, removeWords, stopwords("english"))
corpusAdded <- tm_map(corpusAdded, stemDocument)
Build the DocumentTermMatrix
dtmAdded <- DocumentTermMatrix(corpusAdded)
dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity : 100%
## Maximal term length: 784
## Weighting : term frequency (tf)
Filter out sparse terms
sparseAdded <- removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity : 100%
## Maximal term length: 28
## Weighting : term frequency (tf)
Create a data frame
wordsAdded <- as.data.frame(as.matrix(sparseAdded))
colnames(wordsAdded) <- paste("A", colnames(wordsAdded))
Create corpus for Removed and repeat all the steps above
corpusRemoved <- Corpus(VectorSource(wiki$Removed))
corpusRemoved <- tm_map(corpusRemoved, removeWords, stopwords("english"))
corpusRemoved <- tm_map(corpusRemoved, stemDocument)
dtmRemoved <- DocumentTermMatrix(corpusRemoved)
sparseRemoved <- removeSparseTerms(dtmRemoved, 0.997)
wordsRemoved <- as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))
Combine the two data frames
wikiWords <- cbind(wordsAdded, wordsRemoved)
Add the vandal column
wikiWords$Vandal <- wiki$Vandal
library(caTools)
set.seed(123)
split <- sample.split(wikiWords$Vandal, SplitRatio = 0.7)
train <- subset(wikiWords, split == TRUE)
test <- subset(wikiWords, split == FALSE)
table(test$Vandal)
##
## 0 1
## 618 545
Baseline model accuracy is 618/nrow(test) = 0.5313
Below is a bad model. Method needs to be specified in the model.
library(rpart)
library(rpart.plot)
wikiCART <- rpart(Vandal ~., data = train)
prp(wikiCART)
Correct model
library(rpart)
library(rpart.plot)
wikiCART <- rpart(Vandal ~., data = train, method = "class")
prp(wikiCART)
Make predictions
wikiPredict <- predict(wikiCART, newdata = test, type = "class")
Model accuracy is (618+12)/nrow(test)=0.5417025
table(test$Vandal, wikiPredict)
## wikiPredict
## 0 1
## 0 618 0
## 1 533 12
The model didn’t beat the baseline well enough.
We weren’t able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.
The key class of words we will use are website addresses. “Website addresses” (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be “http://www.google.com”. The first part is the protocol, which is usually “http” (HyperText Transfer Protocol). The second part is the address of the site, e.g. “www.google.com”. We have stripped all punctuation so links to websites appear in the data as one word, e.g. “httpwwwgooglecom”. We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.
We can search for the presence of a web address in the words added by searching for “http” in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.
grepl("cat","dogs and cats",fixed=TRUE) # TRUE
grepl("cat","dogs and rats",fixed=TRUE) # FALSE
Create a copy of your dataframe from the previous question:
wikiWords2 = wikiWords
Make a new column in wikiWords2 that is 1 if “http” was in Added:
wikiWords2$HTTP <- ifelse(grepl("http", wiki$Added, fixed = TRUE),1,0)
table(wikiWords2$HTTP)
##
## 0 1
## 3659 217
Re-create training and test set
wikiTrain2 <- subset(wikiWords2, split == TRUE)
wikiTest2 <- subset(wikiWords2, split == FALSE)
wikiCART2 <- rpart(Vandal ~., data = wikiTrain2, method = "class")
prp(wikiCART2)
wikiPredict2 <- predict(wikiCART2, newdata = wikiTest2, type = "class")
table(wikiTest2$Vandal, wikiPredict2)
## wikiPredict2
## 0 1
## 0 609 9
## 1 488 57
Model accuracy is (609+57)/nrow(wikiTest2) = 0.5726569
(609+57)/nrow(wikiTest2)
## [1] 0.5726569
Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).
wikiWords2$NumWordsAdded <- rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved <- rowSums(as.matrix(dtmRemoved))
summary(wikiWords2$NumWordsAdded)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 0.00 1.00 4.05 3.00 259.00
Recreate training and test set, because we have new variables
wikiTrain2 <- subset(wikiWords2, split == TRUE)
wikiTest2 <- subset(wikiWords2, split == FALSE)
wikiCART2 <- rpart(Vandal ~., data = wikiTrain2, method = "class")
prp(wikiCART2)
wikiPredict2 <- predict(wikiCART2, newdata = wikiTest2, type = "class")
table(wikiTest2$Vandal, wikiPredict2)
## wikiPredict2
## 0 1
## 0 514 104
## 1 297 248
Model accuracy is 0.6552021
(514+248)/nrow(wikiTest2)
## [1] 0.6552021
We have two pieces of “metadata” (data about data) that we haven’t yet used.
wikiWords3 <- wikiWords2
Then add the two original variables Minor and Loggedin to this new data frame
wikiWords3$Minor <- wiki$Minor
wikiWords3$Loggedin <- wiki$Loggedin
wikiTrain3 <- subset(wikiWords3, split == TRUE)
wikiTest3 <- subset(wikiWords3, split == FALSE)
wikiCART3 <- rpart(Vandal ~., data = wikiTrain3, method = "class")
prp(wikiCART3)
wikiPredict3 <- predict(wikiCART3, newdata = wikiTest3, type = "class")
table(wikiTest3$Vandal, wikiPredict3)
## wikiPredict3
## 0 1
## 0 595 23
## 1 304 241
Model accuracy is 0.7188306
(595+241)/nrow(wikiTest3)
## [1] 0.7188306
There is a substantial difference in the accuracy of the model using the meta data.