Crowd Sourcing and Truth Discovery

Kyle Kirkpatrick

May 13, 2020

Introduction

With any major event, several different data sources (news, social media, blogs, etc.) describe the events based on their viewpoints or bias, causing conflicts to aris. Crowdsourcing is a way to resolve these conflicts and can be broken down into two main seperate methods: truth discovery and crowdsourcing aggregation.

Finding and idenitifying trustworthy information from multiple sources is the process of truth discovery and is important when discovering the truth of an event. A single source can give a story that can be difficult to prove, but having multiple sources provide the same story will be much easier to identify the true story. A larger amount of data that has that same answers is how crowdsorucing aggregation helps find what is true.

Crowdsourcing aggregation is an active process (data taken from people real time), whereas truth discovery is a passive process (using data already available). Metrics need to be defined to help characterize the truth.

The number of relaiable sources -> the higher the number the more trustworthy the data
The similarity between a variety of sources -> more similar storires are more truthful
Reliability of the sources (who and why)
Data that is historically accurate -> data written at a certain point in time doesn’t change

These metrics of truth discovery will be used when finding the data necessary for the scenario.

The Scenario

March 16, 1996 - The Chicago Bulls are playing the New Jersey Nets getting ready for playoff basketball and a championship run. Early in the first quarter, Bulls power forward, Dennis Rodman, recieves two technical fouls and gets ejected from the game. Before leaving the court, Rodman appears to headbutt the referee during an argument. Did Dennis Rodman actually headbutt the ref? This is the question we are trying to identify.

Data will be collected from news articles written in 1996. The articles have been preserved online and have not been changed since the date published.

Data Collection

First, news articles about the event in 1996 need to be collected and saved as text files. A sample size of five was chosen, mainly due to the difficulty of finding articles actually written (and preserved) in 1996. One article is pulled from the Chicago Tribune and one article is pulled from the New York Times. The other three articles are pulled from random cities to try and remove as much bias as possible.

# reads in the articles
article1 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt1.txt")
article2 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt2.txt")
article3 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt3.txt")
article4 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt4.txt")
article5 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt5.txt")

folder <- "C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI"

Creating the Corpuses

After the articles are saved as text files, a corpus is created for each of the files. Clean up the corpus by removing punctuation, numbers, and whitespace. The text is analyzed and the word “headbutt” is searched in each article to aggregate what happened during the argument between Rodman and the ref.

# cleans up article 1
firstCorpus <- Corpus(VectorSource(article1))
firstCorpus <- tm_map(firstCorpus, removePunctuation)
firstCorpus <- tm_map(firstCorpus, removeNumbers)
firstCorpus <- tm_map(firstCorpus, tolower)
firstCorpus <- tm_map(firstCorpus, stripWhitespace)
tdm1 <- term_stats(firstCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm1

##   term                                                              type1   
## 1 headbutt probably would get you kicked out the door now           headbutt
## 2 headbutt that might just be a suspension pending an investigation headbutt
##   type2    type3 type4 type5 type6  type7      type8   type9 type10        count
## 1 probably would get   you   kicked out        the     door  now               1
## 2 that     might just  be    a      suspension pending an    investigation     1
##   support
## 1       1
## 2       1

The term ‘headbutt’ was found twice in this article. The author explains that Rodman would get kicked out instantly once the headbutt happened and might get a suspension after the investigation.

# cleans up article 2
secondCorpus <- Corpus(VectorSource(article2))
secondCorpus <- tm_map(secondCorpus, removePunctuation)
secondCorpus <- tm_map(secondCorpus, removeNumbers)
secondCorpus <- tm_map(secondCorpus, tolower)
secondCorpus <- tm_map(secondCorpus, stripWhitespace)
tdm2 <- term_stats(secondCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm2

##   term                                                             type1   
## 1 headbutt rodman continued to argue removed his shirt and knocked headbutt
## 2 headbutt was an accident but theyre going to make it             headbutt
##   type2  type3     type4    type5 type6   type7 type8 type9 type10  count
## 1 rodman continued to       argue removed his   shirt and   knocked     1
## 2 was    an        accident but   theyre  going to    make  it          1
##   support
## 1       1
## 2       1

‘Headbutt’ was found twice in this article. The author writes that the headbutt was an accident, but Rodman continued to argue and removed his jersey to make a scene.

# cleans up article 3
thirdCorpus <- Corpus(VectorSource(article3))
thirdCorpus <- tm_map(thirdCorpus, removePunctuation)
thirdCorpus <- tm_map(thirdCorpus, removeNumbers)
thirdCorpus <- tm_map(thirdCorpus, tolower)
thirdCorpus <- tm_map(thirdCorpus, stripWhitespace)
tdm3 <- term_stats(thirdCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm3

##   term                                              type1    type2  type3 type4
## 1 headbutt jordan and be exiled for life or wind up headbutt jordan and   be   
##   type5  type6 type7 type8 type9 type10 count support
## 1 exiled for   life  or    wind  up         1       1

The term ‘headbutt’ was found once in this article and the author explains that Rodman should be exiled for life.

# cleans up article 4
fourthCorpus <- Corpus(VectorSource(article4))
fourthCorpus <- tm_map(fourthCorpus, removePunctuation)
fourthCorpus <- tm_map(fourthCorpus, removeNumbers)
fourthCorpus <- tm_map(fourthCorpus, tolower)
fourthCorpus <- tm_map(fourthCorpus, stripWhitespace)
tdm4 <- term_stats(fourthCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm4

##   term                                                                type1   
## 1 headbutt an official before storming off the court the goldenhaired headbutt
##   type2 type3    type4  type5    type6 type7 type8 type9 type10       count
## 1 an    official before storming off   the   court the   goldenhaired     1
##   support
## 1       1

‘Headbutt’ was found once in the fourth article. Rodman stormed off the court after the argument with the referee.

# cleans up article 5
fifthCorpus <- Corpus(VectorSource(article5))
fifthCorpus <- tm_map(fifthCorpus, removePunctuation)
fifthCorpus <- tm_map(fifthCorpus, removeNumbers)
fifthCorpus <- tm_map(fifthCorpus, tolower)
fifthCorpus <- tm_map(fifthCorpus, stripWhitespace)
tdm5 <- term_stats(fifthCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm5

##   term                                                  type1    type2 type3   
## 1 headbutt an official jordan did it on feb against the headbutt an    official
##   type4  type5 type6 type7 type8 type9   type10 count support
## 1 jordan did   it    on    feb   against the        1       1

The final article also has the word ‘headbutt’ once and the author writes that Rodman did “headbutt an official”.

Hypothesis - Dennis Rodman headbutted the referee with some head to head contact made, based on the context of the articles.

Every article contains the word ‘headbutt’ with context that Rodman did in fact meet head to head with an official. The general consensus between all of the articles is that Rodman did come head to head with the official. A couple articles explain that Rodman should get fines and a suspension, whereas the article from the Chicago Tribune says that it was an accident.

Next, lets analyze the articles all together in one final corpus.

# reads in all of the documents
allDocs <- VCorpus(DirSource(directory = folder, pattern = "*.txt"))

# cleans up the corpus
cleanDocs <- allDocs %>%
             tm_map(removePunctuation) %>%
             tm_map(content_transformer(tolower)) %>%
             tm_map(removeWords, c(stopwords("english"))) %>%
             tm_map(stripWhitespace)

# creates a data frame of the word matrix
dtm <- DocumentTermMatrix(cleanDocs)
df <- as.data.frame(as.matrix(dtm))

# makes all column names readable
colnames(df) <- make.names(colnames(df))

Sentiment Analysis

A sentiment analysis is performed to analyze the ‘positive’ and ‘negative’ context of the corpus. A analysis with a larger amount of negative entries alludes to the fact the Rodman did make contact with the official’s head.

# outputs the most frequent words
frequent <- colSums(df)
frequent <- sort(frequent, decreasing = TRUE)
frequent[1:25]

##    rodman      said     bulls      game    jordan   chicago    dennis     thorn 
##        65        26        20        19        15        13        12        12 
##      will     court     games      team      last    league   referee   rodmans 
##        12        11        11        11        10        10        10        10 
##       get    season       nba     stern bernhardt    called   ejected      head 
##         9         9         8         8         7         7         7         7 
##  headbutt 
##         7

nrc <- get_sentiments("nrc")

#creates a table for the most frequent words
temp_table <- data.frame(word = names(frequent), 
             word_count = frequent)%>% 
             inner_join(nrc)

## Joining, by = "word"

temp_table %>% 
  group_by(sentiment) %>% # groups the words by sentiment
  top_n(10, word_count) %>%
  ungroup() %>%
  mutate(word = reorder(word,word_count)) %>%
  ggplot(aes(x = word, # creates a plot (number of words vs frequent words used)
             y = word_count, fill = sentiment)) +
  geom_col() +
  facet_wrap(~sentiment, scales = "free")+
  coord_flip() +
  theme(axis.text.y = element_text(size = 5), 
        axis.text.x = element_text(size = 5))

Based on the analysis, the corpus has more negative context on the incident. The categories that have the largest sentiment are ‘negative’, ‘anger’, and ‘fear’, whereas the smallest categories include ‘joy’ and ‘sadness’. The authors were able to portray the feeling of the event through writing, without even being present at the arena.

Regardless if Rodman did or did not headbutt the referee, people around the U.S. come the conclusion that Rodman did come in contact with the official because of how the articles are written. This is why truth discovery is important. Lets revisit the metrics described earlier to see how truthful these articles are.

Each article does talk about Rodman headbutting the official in some way, which directly correlates to our second metric. The sources are similar in that Rodman came head to head. Each of the articles were also posted on the same day (March 16, 1996) and were archived without being edited (4th metric). The articles were all written by sports analyst/writers for the specific newspaper. Since only five sources are analyzed, the first metric is not really satisfied. It was difficult to only find articles written about the event in 1996 which results in the texts chosen.

This analysis shows that Rodman did in fact headbutt the referee (backing up the hypothesis), or that is what the majority would have believed at when these articles were released. Finally, lets create a confusion matrix to show the accuracy of the news article coverage.

Confusion Matrix

# sample data for predicted and actual
pred <- c(0,1,1,0,1)
act <- c(1,1,1,1,1)

# prints the confusion matrix
comb <- union(pred, act)
tab <- table(factor(pred, comb), factor(act, comb))
print(confusionMatrix(tab))

## Confusion Matrix and Statistics
## 
##    
##     0 1
##   0 0 2
##   1 0 3
##                                           
##                Accuracy : 0.6             
##                  95% CI : (0.1466, 0.9473)
##     No Information Rate : 1               
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0               
##                                           
##  Mcnemar's Test P-Value : 0.4795          
##                                           
##             Sensitivity :  NA             
##             Specificity : 0.6             
##          Pos Pred Value :  NA             
##          Neg Pred Value :  NA             
##              Prevalence : 0.0             
##          Detection Rate : 0.0             
##    Detection Prevalence : 0.4             
##       Balanced Accuracy :  NA             
##                                           
##        'Positive' Class : 0               
##

The confusion matrix shows that there is an accuracy rate of 60% that the article is truthful regarding the Rodman headbutt event. Based on the five articles, three are trusted and reported the event correctly. However, based the defined metrics, all of the articles generally have the same story. Even though a couple articles could not be trusted, having the same story has the trusted author makes the confusion matrix slightly misleading. It is also difficult to construct a matrix based on just five articles and would have been much more helpful having more sources, which directly relates to the first metric.

Conlcusion

Based on the text-based searches and analysis, as well as the truth discovery application, Dennis Rodman headbutted the referee. Most of the metrics were satisfied except for the first due to difficulty finding historically accurate articles. The writing context and overall negative sentiment analysis shows that Rodman did come in contact with the official.

Since crowdsourcing aggregation is an active process, the truth discovery method was performed over archived news articles. An example of crowdsourcing aggregation would have been asking each of the fans if they thought Rodman headbutted the ref right after the event in 1996. Havng the input of over 20,000 people would have easily completed the first metric and given a better representation of the general consensus.

Finally, here is a video of the event taking place over 20 years ago: https://www.youtube.com/watch?v=niSd-GtPCGk