Kyle Kirkpatrick
May 13, 2020
With any major event, several different data sources (news, social media, blogs, etc.) describe the events based on their viewpoints or bias, causing conflicts to aris. Crowdsourcing is a way to resolve these conflicts and can be broken down into two main seperate methods: truth discovery and crowdsourcing aggregation.
Finding and idenitifying trustworthy information from multiple sources is the process of truth discovery and is important when discovering the truth of an event. A single source can give a story that can be difficult to prove, but having multiple sources provide the same story will be much easier to identify the true story. A larger amount of data that has that same answers is how crowdsorucing aggregation helps find what is true.
Crowdsourcing aggregation is an active process (data taken from people real time), whereas truth discovery is a passive process (using data already available). Metrics need to be defined to help characterize the truth.
These metrics of truth discovery will be used when finding the data necessary for the scenario.
March 16, 1996 - The Chicago Bulls are playing the New Jersey Nets getting ready for playoff basketball and a championship run. Early in the first quarter, Bulls power forward, Dennis Rodman, recieves two technical fouls and gets ejected from the game. Before leaving the court, Rodman appears to headbutt the referee during an argument. Did Dennis Rodman actually headbutt the ref? This is the question we are trying to identify.
Data will be collected from news articles written in 1996. The articles have been preserved online and have not been changed since the date published.
First, news articles about the event in 1996 need to be collected and saved as text files. A sample size of five was chosen, mainly due to the difficulty of finding articles actually written (and preserved) in 1996. One article is pulled from the Chicago Tribune and one article is pulled from the New York Times. The other three articles are pulled from random cities to try and remove as much bias as possible.
# reads in the articles
article1 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt1.txt")
article2 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt2.txt")
article3 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt3.txt")
article4 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt4.txt")
article5 <- readLines("C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI/headbutt5.txt")
folder <- "C:/Users/kylek/OneDrive/Desktop/College/CST-425/CSI"
After the articles are saved as text files, a corpus is created for each of the files. Clean up the corpus by removing punctuation, numbers, and whitespace. The text is analyzed and the word “headbutt” is searched in each article to aggregate what happened during the argument between Rodman and the ref.
# cleans up article 1
firstCorpus <- Corpus(VectorSource(article1))
firstCorpus <- tm_map(firstCorpus, removePunctuation)
firstCorpus <- tm_map(firstCorpus, removeNumbers)
firstCorpus <- tm_map(firstCorpus, tolower)
firstCorpus <- tm_map(firstCorpus, stripWhitespace)
tdm1 <- term_stats(firstCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm1
## term type1
## 1 headbutt probably would get you kicked out the door now headbutt
## 2 headbutt that might just be a suspension pending an investigation headbutt
## type2 type3 type4 type5 type6 type7 type8 type9 type10 count
## 1 probably would get you kicked out the door now 1
## 2 that might just be a suspension pending an investigation 1
## support
## 1 1
## 2 1
The term ‘headbutt’ was found twice in this article. The author explains that Rodman would get kicked out instantly once the headbutt happened and might get a suspension after the investigation.
# cleans up article 2
secondCorpus <- Corpus(VectorSource(article2))
secondCorpus <- tm_map(secondCorpus, removePunctuation)
secondCorpus <- tm_map(secondCorpus, removeNumbers)
secondCorpus <- tm_map(secondCorpus, tolower)
secondCorpus <- tm_map(secondCorpus, stripWhitespace)
tdm2 <- term_stats(secondCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm2
## term type1
## 1 headbutt rodman continued to argue removed his shirt and knocked headbutt
## 2 headbutt was an accident but theyre going to make it headbutt
## type2 type3 type4 type5 type6 type7 type8 type9 type10 count
## 1 rodman continued to argue removed his shirt and knocked 1
## 2 was an accident but theyre going to make it 1
## support
## 1 1
## 2 1
‘Headbutt’ was found twice in this article. The author writes that the headbutt was an accident, but Rodman continued to argue and removed his jersey to make a scene.
# cleans up article 3
thirdCorpus <- Corpus(VectorSource(article3))
thirdCorpus <- tm_map(thirdCorpus, removePunctuation)
thirdCorpus <- tm_map(thirdCorpus, removeNumbers)
thirdCorpus <- tm_map(thirdCorpus, tolower)
thirdCorpus <- tm_map(thirdCorpus, stripWhitespace)
tdm3 <- term_stats(thirdCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm3
## term type1 type2 type3 type4
## 1 headbutt jordan and be exiled for life or wind up headbutt jordan and be
## type5 type6 type7 type8 type9 type10 count support
## 1 exiled for life or wind up 1 1
The term ‘headbutt’ was found once in this article and the author explains that Rodman should be exiled for life.
# cleans up article 4
fourthCorpus <- Corpus(VectorSource(article4))
fourthCorpus <- tm_map(fourthCorpus, removePunctuation)
fourthCorpus <- tm_map(fourthCorpus, removeNumbers)
fourthCorpus <- tm_map(fourthCorpus, tolower)
fourthCorpus <- tm_map(fourthCorpus, stripWhitespace)
tdm4 <- term_stats(fourthCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm4
## term type1
## 1 headbutt an official before storming off the court the goldenhaired headbutt
## type2 type3 type4 type5 type6 type7 type8 type9 type10 count
## 1 an official before storming off the court the goldenhaired 1
## support
## 1 1
‘Headbutt’ was found once in the fourth article. Rodman stormed off the court after the argument with the referee.
# cleans up article 5
fifthCorpus <- Corpus(VectorSource(article5))
fifthCorpus <- tm_map(fifthCorpus, removePunctuation)
fifthCorpus <- tm_map(fifthCorpus, removeNumbers)
fifthCorpus <- tm_map(fifthCorpus, tolower)
fifthCorpus <- tm_map(fifthCorpus, stripWhitespace)
tdm5 <- term_stats(fifthCorpus, ngrams = 10, types = TRUE, subset = type1 == "headbutt")
tdm5
## term type1 type2 type3
## 1 headbutt an official jordan did it on feb against the headbutt an official
## type4 type5 type6 type7 type8 type9 type10 count support
## 1 jordan did it on feb against the 1 1
The final article also has the word ‘headbutt’ once and the author writes that Rodman did “headbutt an official”.
Hypothesis - Dennis Rodman headbutted the referee with some head to head contact made, based on the context of the articles.
Every article contains the word ‘headbutt’ with context that Rodman did in fact meet head to head with an official. The general consensus between all of the articles is that Rodman did come head to head with the official. A couple articles explain that Rodman should get fines and a suspension, whereas the article from the Chicago Tribune says that it was an accident.
Next, lets analyze the articles all together in one final corpus.
# reads in all of the documents
allDocs <- VCorpus(DirSource(directory = folder, pattern = "*.txt"))
# cleans up the corpus
cleanDocs <- allDocs %>%
tm_map(removePunctuation) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removeWords, c(stopwords("english"))) %>%
tm_map(stripWhitespace)
# creates a data frame of the word matrix
dtm <- DocumentTermMatrix(cleanDocs)
df <- as.data.frame(as.matrix(dtm))
# makes all column names readable
colnames(df) <- make.names(colnames(df))
A sentiment analysis is performed to analyze the ‘positive’ and ‘negative’ context of the corpus. A analysis with a larger amount of negative entries alludes to the fact the Rodman did make contact with the official’s head.
# outputs the most frequent words
frequent <- colSums(df)
frequent <- sort(frequent, decreasing = TRUE)
frequent[1:25]
## rodman said bulls game jordan chicago dennis thorn
## 65 26 20 19 15 13 12 12
## will court games team last league referee rodmans
## 12 11 11 11 10 10 10 10
## get season nba stern bernhardt called ejected head
## 9 9 8 8 7 7 7 7
## headbutt
## 7
nrc <- get_sentiments("nrc")
#creates a table for the most frequent words
temp_table <- data.frame(word = names(frequent),
word_count = frequent)%>%
inner_join(nrc)
## Joining, by = "word"
temp_table %>%
group_by(sentiment) %>% # groups the words by sentiment
top_n(10, word_count) %>%
ungroup() %>%
mutate(word = reorder(word,word_count)) %>%
ggplot(aes(x = word, # creates a plot (number of words vs frequent words used)
y = word_count, fill = sentiment)) +
geom_col() +
facet_wrap(~sentiment, scales = "free")+
coord_flip() +
theme(axis.text.y = element_text(size = 5),
axis.text.x = element_text(size = 5))
Based on the analysis, the corpus has more negative context on the incident. The categories that have the largest sentiment are ‘negative’, ‘anger’, and ‘fear’, whereas the smallest categories include ‘joy’ and ‘sadness’. The authors were able to portray the feeling of the event through writing, without even being present at the arena.
Regardless if Rodman did or did not headbutt the referee, people around the U.S. come the conclusion that Rodman did come in contact with the official because of how the articles are written. This is why truth discovery is important. Lets revisit the metrics described earlier to see how truthful these articles are.
Each article does talk about Rodman headbutting the official in some way, which directly correlates to our second metric. The sources are similar in that Rodman came head to head. Each of the articles were also posted on the same day (March 16, 1996) and were archived without being edited (4th metric). The articles were all written by sports analyst/writers for the specific newspaper. Since only five sources are analyzed, the first metric is not really satisfied. It was difficult to only find articles written about the event in 1996 which results in the texts chosen.
This analysis shows that Rodman did in fact headbutt the referee (backing up the hypothesis), or that is what the majority would have believed at when these articles were released. Finally, lets create a confusion matrix to show the accuracy of the news article coverage.
# sample data for predicted and actual
pred <- c(0,1,1,0,1)
act <- c(1,1,1,1,1)
# prints the confusion matrix
comb <- union(pred, act)
tab <- table(factor(pred, comb), factor(act, comb))
print(confusionMatrix(tab))
## Confusion Matrix and Statistics
##
##
## 0 1
## 0 0 2
## 1 0 3
##
## Accuracy : 0.6
## 95% CI : (0.1466, 0.9473)
## No Information Rate : 1
## P-Value [Acc > NIR] : 1.0000
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 0.4795
##
## Sensitivity : NA
## Specificity : 0.6
## Pos Pred Value : NA
## Neg Pred Value : NA
## Prevalence : 0.0
## Detection Rate : 0.0
## Detection Prevalence : 0.4
## Balanced Accuracy : NA
##
## 'Positive' Class : 0
##
The confusion matrix shows that there is an accuracy rate of 60% that the article is truthful regarding the Rodman headbutt event. Based on the five articles, three are trusted and reported the event correctly. However, based the defined metrics, all of the articles generally have the same story. Even though a couple articles could not be trusted, having the same story has the trusted author makes the confusion matrix slightly misleading. It is also difficult to construct a matrix based on just five articles and would have been much more helpful having more sources, which directly relates to the first metric.
Based on the text-based searches and analysis, as well as the truth discovery application, Dennis Rodman headbutted the referee. Most of the metrics were satisfied except for the first due to difficulty finding historically accurate articles. The writing context and overall negative sentiment analysis shows that Rodman did come in contact with the official.
Since crowdsourcing aggregation is an active process, the truth discovery method was performed over archived news articles. An example of crowdsourcing aggregation would have been asking each of the fans if they thought Rodman headbutted the ref right after the event in 1996. Havng the input of over 20,000 people would have easily completed the first metric and given a better representation of the general consensus.
Finally, here is a video of the event taking place over 20 years ago: https://www.youtube.com/watch?v=niSd-GtPCGk