Background Information on the Dataset

Wikipedia is a free online encyclopedia that anyone can edit and contribute to. It is available in many languages and is growing all the time. On the English language version of Wikipedia:

One of the consequences of being editable by anyone is that some people vandalize pages. This can take the form of removing content, adding promotional or inappropriate content, or more subtle shifts that change the meaning of the article. With this many articles and edits per day it is difficult for humans to detect all instances of vandalism and revert (undo) them. As a result, Wikipedia uses bots - computer programs that automatically revert edits that look like vandalism. In this assignment we will attempt to develop a vandalism detector that uses machine learning to distinguish between a valid edit and vandalism.

The data for this problem is based on the revision history of the page Language. Wikipedia provides a history for each page that consists of the state of the page at each revision. Rather than manually considering each revision, a script was run that checked whether edits stayed or were reverted. If a change was eventually reverted then that revision is marked as vandalism. This may result in some misclassifications, but the script performs well enough for our needs.

As a result of this preprocessing, some common processing tasks have already been done, including lower-casing and punctuation removal. The columns in the dataset are:

Notice the repeated use of unique. The data we have available is not the traditional bag of words - rather it is the set of words that were removed or added. For example, if a word was removed multiple times in a revision it will only appear one time in the “Removed” column.

R Exercises

Loading the Dataset

Load the data wiki.csv with the option stringsAsFactors=FALSE, calling the data frame “wiki”. Convert the “Vandal” column to a factor using the command wiki$Vandal = as.factor(wiki$Vandal).

# Load the dataset
wiki = read.csv("wiki.csv", stringsAsFactors=FALSE)
# Convert the column to a factor
wiki$Vandal = as.factor(wiki$Vandal)

How many cases of vandalism were detected in the history of this page?

# Tabulates the vandalism cases
table(wiki$Vandal)
## 
##    0    1 
## 2061 1815

There are 1815 observations with value 1, which denotes vandalism.

How many terms appear in dtmAdded?

We will now use the bag of words approach to build a model. We have two columns of textual data, with different meanings. For example, adding rude words has a different meaning to removing rude words. We’ll start like we did in class by building a document term matrix from the Added column. The text already is lowercase and stripped of punctuation. So to pre-process the data, just complete the following four steps:

  1. Create the corpus for the Added column, and call it “corpusAdded”.

  2. Remove the English-language stopwords.

  3. Stem the words.

  4. Build the DocumentTermMatrix, and call it dtmAdded.

If the code length(stopwords(“english”)) does not return 174 for you, then please run the line of code in this file, which will store the standard stop words in a variable called sw. When removing stop words, use tm_map(corpusAdded, removeWords, sw) instead of tm_map(corpusAdded, removeWords, stopwords(“english”)).

sw = c(“i”, “me”, “my”, “myself”, “we”, “our”, “ours”, “ourselves”, “you”, “your”, “yours”, “yourself”, “yourselves”, “he”, “him”, “his”, “himself”, “she”, “her”, “hers”, “herself”, “it”, “its”, “itself”, “they”, “them”, “their”, “theirs”, “themselves”, “what”, “which”, “who”, “whom”, “this”, “that”, “these”, “those”, “am”, “is”, “are”, “was”, “were”, “be”, “been”, “being”, “have”, “has”, “had”, “having”, “do”, “does”, “did”, “doing”, “would”, “should”, “could”, “ought”, “i’m”, “you’re”, “he’s”, “she’s”, “it’s”, “we’re”, “they’re”, “i’ve”, “you’ve”, “we’ve”, “they’ve”, “i’d”, “you’d”, “he’d”, “she’d”, “we’d”, “they’d”, “i’ll”, “you’ll”, “he’ll”, “she’ll”, “we’ll”, “they’ll”, “isn’t”, “aren’t”, “wasn’t”, “weren’t”, “hasn’t”, “haven’t”, “hadn’t”, “doesn’t”, “don’t”, “didn’t”, “won’t”, “wouldn’t”, “shan’t”, “shouldn’t”, “can’t”, “cannot”, “couldn’t”, “mustn’t”, “let’s”, “that’s”, “who’s”, “what’s”, “here’s”, “there’s”, “when’s”, “where’s”, “why’s”, “how’s”, “a”, “an”, “the”, “and”, “but”, “if”, “or”, “because”, “as”, “until”, “while”, “of”, “at”, “by”, “for”, “with”, “about”, “against”, “between”, “into”, “through”, “during”, “before”, “after”, “above”, “below”, “to”, “from”, “up”, “down”, “in”, “out”, “on”, “off”, “over”, “under”, “again”, “further”, “then”, “once”, “here”, “there”, “when”, “where”, “why”, “how”, “all”, “any”, “both”, “each”, “few”, “more”, “most”, “other”, “some”, “such”, “no”, “nor”, “not”, “only”, “own”, “same”, “so”, “than”, “too”, “very”)

library(tm)
# Create the corpus for the Added column, and call it "corpusAdded"
corpusAdded = VCorpus(VectorSource(wiki$Added))
# Remove the English-language stopwords
corpusAdded = tm_map(corpusAdded, removeWords, stopwords("english"))
# Stem the words
corpusAdded = tm_map(corpusAdded, stemDocument)
# Build the DocumentTermMatrix, and call it dtmAdded
dtmAdded = DocumentTermMatrix(corpusAdded)
dtmAdded
## <<DocumentTermMatrix (documents: 3876, terms: 6675)>>
## Non-/sparse entries: 15368/25856932
## Sparsity           : 100%
## Maximal term length: 784
## Weighting          : term frequency (tf)

Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions, and call the new matrix sparseAdded. How many terms appear in sparseAdded?

# Filter out the sparse terms
sparseAdded = removeSparseTerms(dtmAdded, 0.997)
sparseAdded
## <<DocumentTermMatrix (documents: 3876, terms: 166)>>
## Non-/sparse entries: 2681/640735
## Sparsity           : 100%
## Maximal term length: 28
## Weighting          : term frequency (tf)

6675 terms appear in dtmAdded.

How many words are in the wordsRemoved data frame?

Convert sparseAdded to a data frame called wordsAdded, and then prepend all the words with the letter A, by using the command:

colnames(wordsAdded) = paste(“A”, colnames(wordsAdded))

Now repeat all of the steps we’ve done so far (create a corpus, remove stop words, stem the document, create a sparse document term matrix, and convert it to a data frame) to create a Removed bag-of-words dataframe, called wordsRemoved, except this time, prepend all of the words with the letter R:

colnames(wordsRemoved) = paste(“R”, colnames(wordsRemoved))

#Convert sparseAdded into a dataframe wordAdded
wordsAdded = as.data.frame(as.matrix(sparseAdded))
# Prepend all the words with letter A
colnames(wordsAdded) = paste("A", colnames(wordsAdded))
# Create the corpus for the Added column, and call it "corpusRemoved"
corpusRemoved <- Corpus(VectorSource(wiki$Removed))
# Remove the English-language stopwords
corpusRemoved <- tm_map(corpusRemoved, removeWords, stopwords("english"))
# Stem the words
corpusRemoved <- tm_map(corpusRemoved, stemDocument)
# Build the DocumentTermMatrix, and call it dtmRemoved
dtmRemoved <- DocumentTermMatrix(corpusRemoved)
sparseRemoved <- removeSparseTerms(dtmRemoved, 0.997)
wordsRemoved <- as.data.frame(as.matrix(sparseRemoved))
colnames(wordsRemoved) = paste("R", colnames(wordsRemoved))
ncol(wordsRemoved)
## [1] 162

162 words are in wordsRemoved.

What is the accuracy on the test set of a baseline method that always predicts “not vandalism” (the most frequent outcome)?

Combine the two data frames into a data frame called wikiWords with the following line of code:

wikiWords = cbind(wordsAdded, wordsRemoved)

The cbind function combines two sets of variables for the same observations into one data frame. Then add the Vandal column (HINT: remember how we added the dependent variable back into our data frame in the Twitter lecture). Set the random seed to 123 and then split the data set using sample.split from the “caTools” package to put 70% in the training set.

# Combine the data frame
wikiWords = cbind(wordsAdded, wordsRemoved)
wikiWords$Vandal = wiki$Vandal
# Split the dataset into a training and testing set
library(caTools)
set.seed(123)
spl = sample.split(wikiWords$Vandal, SplitRatio = 0.7)
wikiTrain = subset(wikiWords, spl==TRUE)
wikiTest = subset(wikiWords, spl==FALSE)
# Tabulates the amount of vandalism in cases
a = table(wikiTest$Vandal)
kable(a)
Var1 Freq
0 618
1 545
# Computes the Accuracy
a[1]/(sum(a))
##         0 
## 0.5313844

Testing Set Accuracy = 0.5313844

Build a CART Model

Build a CART model to predict Vandal, using all of the other variables as independent variables. Use the training set to build the model and the default parameters (don’t set values for minbucket or cp).

What is the accuracy of the model on the test set, using a threshold of 0.5? (Remember that if you add the argument type=“class” when making predictions, the output of predict will automatically use a threshold of 0.5.)

# Implement the CART Model
library(rpart)
library(rpart.plot)
wikiCART = rpart(Vandal ~ ., data=wikiTrain, method="class")
prp(wikiCART)

# Make predictions using the CART model
testPredictCART = predict(wikiCART, newdata=wikiTest, type="class")
# Tabulates the testing set vs the predictions
a = table(wikiTest$Vandal, testPredictCART)
kable(a)
0 1
0 614 4
1 526 19
### Computes the Accuracy
sum(diag(a))/(sum(a))
## [1] 0.544282

Test Set Accuracy = 0.544282

New Approach: Identifying a key class of words

We weren’t able to improve on the baseline using the raw textual information. More specifically, the words themselves were not useful. There are other options though, and in this section we will try two techniques - identifying a key class of words, and counting words.

The key class of words we will use are website addresses. “Website addresses” (also known as URLs - Uniform Resource Locators) are comprised of two main parts. An example would be “http://www.google.com”. The first part is the protocol, which is usually “http” (HyperText Transfer Protocol). The second part is the address of the site, e.g. “www.google.com”. We have stripped all punctuation so links to websites appear in the data as one word, e.g. “httpwwwgooglecom”. We hypothesize that given that a lot of vandalism seems to be adding links to promotional or irrelevant websites, the presence of a web address is a sign of vandalism.

We can search for the presence of a web address in the words added by searching for “http” in the Added column. The grepl function returns TRUE if a string is found in another string, e.g.

grepl(“cat”,“dogs and cats”,fixed=TRUE) # TRUE

grepl(“cat”,“dogs and rats”,fixed=TRUE) # FALSE

Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

Make a new column in wikiWords2 that is 1 if “http” was in Added:

wikiWords2$HTTP = ifelse(grepl(“http”,wiki$Added,fixed=TRUE), 1, 0)

# Create a copy of your data frame with http added
wikiWords2 = wikiWords
wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0)
# Tabulates the amount of words with http
z = table(wikiWords2$HTTP)
kable(z)
Var1 Freq
0 3659
1 217

217 revisions added a link.

CART Model #2, What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

#Subsetting the data into training and test sets
wikiTrain2 = subset(wikiWords2, spl==TRUE)
wikiTest2 = subset(wikiWords2, spl==FALSE)
# Create the CART model
wikiCART2 = rpart(Vandal ~ ., data=wikiTrain2, method="class")
prp(wikiCART2)

# Predict using the test set
testPredictCART2 = predict(wikiCART2, newdata=wikiTest2, type="class")
# Tabulate the predictions vs testing set data
a = table(wikiTest2$Vandal, testPredictCART2)
kable(a)
0 1
0 605 13
1 481 64
# Computes the Accuracy
sum(diag(a))/(sum(a))
## [1] 0.5752365

Accuracy = 0.5752365

New Approach: Counting Words

Another possibility is that the number of words added and removed is predictive, perhaps more so than the actual words themselves. We already have a word count available in the form of the document-term matrices (DTMs).

Sum the rows of dtmAdded and dtmRemoved and add them as new variables in your data frame wikiWords2 (called NumWordsAdded and NumWordsRemoved) by using the following commands:

wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))

wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))

What is the average number of words added?

# Sum the rows and add them as new variables
wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))
wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved))
# Output a summary
summary(wikiWords2$NumWordsAdded)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    0.00    1.00    4.05    3.00  259.00

4.05 average number of words are added.

CART Model #3, What is the new accuracy of the CART model on the test set, using a threshold of 0.5?

# Split the data into a training set and testing set
wikiTrain3 = subset(wikiWords2, spl==TRUE)
wikiTest3 = subset(wikiWords2, spl==FALSE)
# Implement the CART Model
wikiCART3 = rpart(Vandal ~ ., data=wikiTrain3, method="class")
prp(wikiCART3)

# Predict using the testing set
testPredictCART3 = predict(wikiCART3, newdata=wikiTest3, type="class")
# Tabulates the testing set vs the predictions
a = table(wikiTest3$Vandal, testPredictCART3)
kable(a)
0 1
0 514 104
1 297 248
# Computes the Accuracy
sum(diag(a))/(sum(a))
## [1] 0.6552021

Accuracy = 0.6552021 #### Final Approach: Metadata - CART Model #4

We have two pieces of “metadata” (data about data) that we haven’t yet used. Make a copy of wikiWords2, and call it wikiWords3:

wikiWords3 = wikiWords2

Then add the two original variables Minor and Loggedin to this new data frame:

wikiWords3$Minor = wiki$Minor

wikiWords3$Loggedin = wiki$Loggedin

In problem 1.5, you computed a vector called “spl” that identified the observations to put in the training and testing sets. Use that variable (do not recompute it with sample.split) to make new training and testing sets with wikiWords3.

Build a CART model using all the training data. What is the accuracy of the model on the test set?

# Create a copy of the data and then add the two original variables
wikiWords3 = wikiWords2
wikiWords3$Minor = wiki$Minor
wikiWords3$Loggedin = wiki$Loggedin
# Subset the training and test sets
wikiTrain4 = subset(wikiWords3, spl==TRUE)
wikiTest4 = subset(wikiWords3, spl==FALSE)
# Create the CART Model
wikiCART4 = rpart(Vandal ~ ., data=wikiTrain4, method="class")
prp(wikiCART4)

# Make predictions using the CART Model
testPredictCART4 = predict(wikiCART4, newdata=wikiTest4, type="class")
# Tabulate the testing set with the predictions
a = table(wikiTest4$Vandal, testPredictCART4)
kable (a)
0 1
0 595 23
1 304 241
# Computes the accuracy
sum(diag(a))/(sum(a))
## [1] 0.7188306

Accuracy = 0.7188306