Detecting Vandalism on Wikipedia

reading data

wiki = read.csv(“wiki.csv”, stringsAsFactors = FALSE) summary(wiki)

converting vandal as factor

wiki$Vandal = as.factor(wiki$Vandal) str(wiki)

finding cases of vandalism were detected in the history of this page

table(wiki$Vandal)

library(tm) library(SnowballC)

Pre-processing

corpusAdded = VCorpus(VectorSource(wiki$Added))

corpusAdded = tm_map(corpusAdded, removeWords, stopwords(“english”))

corpusAdded = tm_map(corpusAdded, stemDocument)

dtmAdded = DocumentTermMatrix(corpusAdded)

Filter out sparse terms by keeping only terms that appear in 0.3% or more of the revisions

sparseAdded = removeSparseTerms(dtmAdded, 0.997)
sparseAdded

wordsAdded = as.data.frame(as.matrix(sparseAdded))

colnames(wordsAdded) = paste(“A”, colnames(wordsAdded))

corpusRemoved = VCorpus(VectorSource(wiki$Removed))

corpusRemoved = tm_map(corpusRemoved, removeWords, stopwords(“english”))

corpusRemoved = tm_map(corpusRemoved, stemDocument)

dtmRemoved = DocumentTermMatrix(corpusRemoved)

sparseRemoved = removeSparseTerms(dtmRemoved, 0.997)

wordsRemoved = as.data.frame(as.matrix(sparseRemoved))

colnames(wordsRemoved) = paste(“R”, colnames(wordsRemoved))

combine the two data frames by using the command:

wikiWords = cbind(wordsAdded, wordsRemoved)

And then add the Vandal variable by using the command:

wikiWords$Vandal = wiki$Vandal

To split the data, you can use the following commands:

library(caTools)

set.seed(123)

spl = sample.split(wikiWords$Vandal, SplitRatio = 0.7)

wikiTrain = subset(wikiWords, spl==TRUE)

wikiTest = subset(wikiWords, spl==FALSE)

table(wikiTest$Vandal)

Baseline Accuracy

618/(618+545)

build the CART model with the following command:

wikiCART = rpart(Vandal ~ ., data=wikiTrain, method=“class”)

predictions on the test set:

testPredictCART = predict(wikiCART, newdata=wikiTest, type=“class”)

compute the accuracy by comparing the actual values to the predicted values:

table(wikiTest$Vandal, testPredictCART)

The accuracy is

(618+12)/(618+533+12)

prp(wikiCART)

Create a copy of your dataframe from the previous question:

wikiWords2 = wikiWords

Make a new column in wikiWords2 that is 1 if “http” was in Added:

wikiWords2$HTTP = ifelse(grepl("http",wiki$Added,fixed=TRUE), 1, 0) table(wikiWords2$HTTP)

wikiTrain2 = subset(wikiWords2, spl==TRUE)

wikiTest2 = subset(wikiWords2, spl==FALSE)

new model

wikiCART2 = rpart(Vandal ~ ., data=wikiTrain2, method=“class”)

testPredictCART2 = predict(wikiCART2, newdata=wikiTest2, type=“class”)

table(wikiTest2$Vandal, testPredictCART2)

Then the accuracy is

(609+57)/(609+9+488+57)

wikiWords2$NumWordsAdded = rowSums(as.matrix(dtmAdded))

wikiWords2$NumWordsRemoved = rowSums(as.matrix(dtmRemoved)) mean(wikiWords2$NumWordsAdded)

To split the data again, use the following commands:

wikiTrain3 = subset(wikiWords2, spl==TRUE)

wikiTest3 = subset(wikiWords2, spl==FALSE)

compute the accuracy of the new CART model with the following commands:

wikiCART3 = rpart(Vandal ~ ., data=wikiTrain3, method=“class”)

testPredictCART3 = predict(wikiCART3, newdata=wikiTest3, type=“class”)

table(wikiTest3$Vandal, testPredictCART3)

The accuracy is

(514+248)/(514+104+297+248)

wikiWords3 = wikiWords2

Then add the two original variables Minor and Loggedin to this new data frame:

wikiWords3$Minor = wiki$Minor

wikiWords3$Loggedin = wiki$Loggedin

wikiTrain4 = subset(wikiWords3, spl==TRUE)

wikiTest4 = subset(wikiWords3, spl==FALSE)

This model can be built and evaluated using the following commands:

wikiCART4 = rpart(Vandal ~ ., data=wikiTrain4, method=“class”)

predictTestCART4 = predict(wikiCART4, newdata=wikiTest4, type=“class”)

table(wikiTest4$Vandal, predictTestCART4)

The accuracy of the model is

(595+241)/(595+23+304+241)

prp(wikiCART4)