The Usenet data is publicly available at: http://qwone.com/~jason/20Newsgroups/.
The Usenet bulletin boards in this dataset include newsgroups for topics like politics, religion, cars, sports, and cryptography. The text is diverse, rich, and includes many different authors. Thus, analysis of the text is interesting and applicable to current developments in machine learning.
Step 1: Pre-Processing our data
We will read in all the messages from the 20news-bydate folder. The 20news-bydate folder contains sub-folders. We will use the sub-folders to create a training set of data and a test set of data. Read in files with read_lines(), map(), and unnest().
We will be using these packages throughout the analysis.
library(dplyr)
library(purrr)
library(readr)
library(tidyr)
library(ggplot2)
library(stringr)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(topicmodels)
Unzip the tar file with the untar() function:
untar("20news-bydate.tar.gz")
Create a training folder that contains the 20news-bydate-train data, make function that reads folders, and create a data frame to hold the title of the newsgroup, the message id, and the text that goes with it.
training_folder <- "20news-bydate-train/"
# Create a function to read all files from a folder into a data frame
read_folder <- function(infolder) {
data_frame(file = dir(infolder, full.names = TRUE)) %>%
mutate(text = map(file, read_lines)) %>%
transmute(id = basename(file), text) %>%
unnest(text)
}
# Use unnest() and map() to apply read_folder to each subfolder
(raw_text <- data_frame(folder = dir(training_folder, full.names = TRUE)) %>%
unnest(map(folder, read_folder)) %>%
transmute(newsgroup = basename(folder), id, text))
## # A tibble: 511,655 x 3
## newsgroup id
## <chr> <chr>
## 1 alt.atheism 49960
## 2 alt.atheism 49960
## 3 alt.atheism 49960
## 4 alt.atheism 49960
## 5 alt.atheism 49960
## 6 alt.atheism 49960
## 7 alt.atheism 49960
## 8 alt.atheism 49960
## 9 alt.atheism 49960
## 10 alt.atheism 49960
## # ... with 511,645 more rows, and 1 more variables: text <chr>
The newsgroup column identifies which of the 20 newsgroups the article came from, the id column identifies a unique message within that newsgroup, and the text column contains the text associated with the newsgroup and id number (i.e. the message).
Here are the names of all 20 newsgroups and the total number of messages within each newsgroup.
raw_text %>%
group_by(newsgroup) %>%
summarise(messages = n_distinct(id)) %>%
ggplot(aes(x = newsgroup, y = messages)) +
geom_col() +
coord_flip()
Next we will pre-process the text from the messages within the newsgroups (i.e. clean the data). All the messages include a header like this “from:” or “in_reply_to:” and may include email signatures which occur after “–”. Use cumsum() from dplyr and str_detect() from stringr to remove these lines of text.
# remove header and email sig text
cleaned_text <- raw_text %>%
group_by(newsgroup, id) %>%
filter(cumsum(text == "") > 0,
cumsum(str_detect(text, "^--")) == 0) %>%
ungroup()
head(cleaned_text)
## # A tibble: 6 x 3
## newsgroup id text
## <chr> <chr> <chr>
## 1 alt.atheism 49960
## 2 alt.atheism 49960 Archive-name: atheism/resources
## 3 alt.atheism 49960 Alt-atheism-archive-name: resources
## 4 alt.atheism 49960 Last-modified: 11 December 1992
## 5 alt.atheism 49960 Version: 1.0
## 6 alt.atheism 49960
Some text also included quotes by other writers (use regex), and messages 9704 and 9985 contain limited to no textual data so lets remove them too.
cleaned_text <- cleaned_text %>%
filter(str_detect(text, "^[^>]+[A-Za-z\\d]") | text == "",
!str_detect(text, "writes(:|\\.\\.\\.)$"),
!str_detect(text, "^In article <"),
!id %in% c(9704, 9985))
head(cleaned_text)
## # A tibble: 6 x 3
## newsgroup id text
## <chr> <chr> <chr>
## 1 alt.atheism 49960
## 2 alt.atheism 49960 Archive-name: atheism/resources
## 3 alt.atheism 49960 Alt-atheism-archive-name: resources
## 4 alt.atheism 49960 Last-modified: 11 December 1992
## 5 alt.atheism 49960 Version: 1.0
## 6 alt.atheism 49960
Next we tokenize the data using unnest_tokens() to seperate the messages by each word. Then, we remove stop words from the data.
usenet_words <- cleaned_text %>%
unnest_tokens(word, text) %>%
filter(str_detect(word, "[a-z']$"),
!word %in% stop_words$word)
head(usenet_words)
## # A tibble: 6 x 3
## newsgroup id word
## <chr> <chr> <chr>
## 1 alt.atheism 49960 archive
## 2 alt.atheism 49960 atheism
## 3 alt.atheism 49960 resources
## 4 alt.atheism 49960 alt
## 5 alt.atheism 49960 atheism
## 6 alt.atheism 49960 archive
What are the most common words in the newsgroups?
usenet_words %>%
count(word, sort = TRUE)
## # A tibble: 68,137 x 2
## word n
## <chr> <int>
## 1 people 3655
## 2 time 2705
## 3 god 1626
## 4 system 1595
## 5 program 1103
## 6 bit 1097
## 7 information 1094
## 8 windows 1088
## 9 government 1084
## 10 space 1072
## # ... with 68,127 more rows
What about the most common words in each newsgroup?
words_by_newsgroup <- usenet_words %>%
group_by(newsgroup) %>%
count(word, sort = TRUE) %>%
ungroup()
head(words_by_newsgroup, n = 20)
## # A tibble: 20 x 3
## newsgroup word n
## <chr> <chr> <int>
## 1 soc.religion.christian god 917
## 2 sci.space space 840
## 3 talk.politics.mideast people 728
## 4 sci.crypt key 704
## 5 comp.os.ms-windows.misc windows 625
## 6 talk.politics.mideast armenian 582
## 7 sci.crypt db 549
## 8 talk.politics.mideast turkish 514
## 9 rec.autos car 509
## 10 talk.politics.mideast armenians 509
## 11 comp.sys.ibm.pc.hardware scsi 483
## 12 soc.religion.christian jesus 440
## 13 soc.religion.christian people 436
## 14 talk.politics.mideast israel 428
## 15 talk.politics.guns gun 425
## 16 comp.windows.x file 414
## 17 sci.crypt encryption 410
## 18 talk.politics.misc people 396
## 19 comp.sys.ibm.pc.hardware drive 370
## 20 rec.sport.hockey game 370
We know that each group talks about different topics and uses different words. But, do the some newsgroups use very different words when discussing topics than other newsgroups? We can use tf-idf to find out.Tf-idf is the term frequency inverse document frequency. We use tf-idf to identify words that are important to a document. For example, many use groups use the word “are” but in this context because are is used many times by every newsgroup it isn’t very helpful to know that “are” is a common word in each newsgroup. I won’t elaborate on the math behind tf-idf because it isn’t important to this analysis. With that said, all you need to know and accept is that tf-idf identifies important words in each newsgroup.
tf_idf_all <- words_by_newsgroup %>%
bind_tf_idf(word, newsgroup, n) %>%
arrange(desc(tf-idf))
tf_idf_all
## # A tibble: 173,913 x 6
## newsgroup word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 soc.religion.christian god 917 0.017054754 0 0
## 2 sci.space space 840 0.016093496 0 0
## 3 sci.crypt key 704 0.014013018 0 0
## 4 comp.sys.ibm.pc.hardware drive 370 0.013495277 0 0
## 5 talk.politics.misc people 396 0.010432034 0 0
## 6 rec.sport.hockey game 370 0.010361534 0 0
## 7 talk.politics.mideast people 728 0.010068042 0 0
## 8 alt.atheism people 276 0.010015967 0 0
## 9 talk.religion.misc god 296 0.009877202 0 0
## 10 alt.atheism god 268 0.009725650 0 0
## # ... with 173,903 more rows
We can also view the highest ranking tf-idf words for a specific topic (i.e. talk.something.something).
(talk_tf_idf <- tf_idf_all %>%
filter(str_detect(newsgroup, "^talk\\.")) %>%
group_by(newsgroup) %>%
top_n(10, tf_idf) %>%
ungroup() %>%
mutate(word = reorder(word, tf_idf)))
## # A tibble: 40 x 6
## newsgroup word n tf idf tf_idf
## <chr> <fctr> <int> <dbl> <dbl> <dbl>
## 1 talk.politics.guns gun 425 0.010172575 0.2231436 0.002269944
## 2 talk.politics.misc president 295 0.007771338 0.2231436 0.001734124
## 3 talk.politics.guns guns 209 0.005002513 0.5978370 0.002990688
## 4 talk.politics.guns weapons 159 0.003805740 0.5978370 0.002275212
## 5 talk.religion.misc bible 158 0.005272290 0.6931472 0.003654473
## 6 talk.politics.misc congress 96 0.002528978 0.6931472 0.001752954
## 7 talk.religion.misc jesus 256 0.008542445 0.7985077 0.006821208
## 8 talk.religion.misc ra 96 0.003203417 0.7985077 0.002557953
## 9 talk.politics.guns weapon 128 0.003063740 0.7985077 0.002446420
## 10 talk.politics.misc secretary 79 0.002081138 0.7985077 0.001661805
## # ... with 30 more rows
# and visualize
ggplot(talk_tf_idf, aes(x = word, y = tf_idf, fill = newsgroup)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup, scales = "free") +
ylab("tf-idf") +
coord_flip()
As one could guess, within each newsgroup and topic the highest ranked tf-idf words accurately represent the topic of the newsgroup.
Here is another interesting question: Which newsgroups were similar to each other in terms of text content?
We can use the pairwise_cor() function to find an answer.
(newsgroup_word_cors <- words_by_newsgroup %>%
pairwise_cor(newsgroup, word, n, sort = TRUE))
## # A tibble: 380 x 3
## item1 item2 correlation
## <chr> <chr> <dbl>
## 1 talk.religion.misc soc.religion.christian 0.8347275
## 2 soc.religion.christian talk.religion.misc 0.8347275
## 3 alt.atheism talk.religion.misc 0.7793079
## 4 talk.religion.misc alt.atheism 0.7793079
## 5 alt.atheism soc.religion.christian 0.7510723
## 6 soc.religion.christian alt.atheism 0.7510723
## 7 comp.sys.mac.hardware comp.sys.ibm.pc.hardware 0.6799043
## 8 comp.sys.ibm.pc.hardware comp.sys.mac.hardware 0.6799043
## 9 rec.sport.baseball rec.sport.hockey 0.5770378
## 10 rec.sport.hockey rec.sport.baseball 0.5770378
## # ... with 370 more rows
The newsgroups who most often used the same words are talk.religion.misc and soc.religion.christian. These two newsgroups use the same words about 83% of the time.
Next, we will make a network to that displays strong correlations between newsgroups (i.e. newsgroups that frequently use the same words).
set.seed(1234)
newsgroup_word_cors %>%
filter(correlation > 0.4) %>%
graph_from_data_frame() %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = correlation, width = correlation)) +
geom_node_point(size = 6, color = "lightblue") +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void()
The network suggest there are four main topics: computers/electronics, sports/vehicles, politics, and religion.
Now we will apply latent Dirichlet allocation (LDA) to our newsgroup data. Will will use LDA methods to randomly guess which newsgroup each message belongs to.
First we make a Document-Term Matrix by using the cast_dtm() function from the topicmodels package.
# group by word and add a word total column for each word in each document. Include words used more than 50 times
words_count_newsgroup <- usenet_words %>%
group_by(word) %>%
mutate(word_total = n()) %>%
ungroup() %>%
filter(word_total > 50)
words_count_newsgroup %>%
arrange(desc(word_total))
## # A tibble: 424,097 x 4
## newsgroup id word word_total
## <chr> <chr> <chr> <int>
## 1 alt.atheism 49960 people 3655
## 2 alt.atheism 49960 people 3655
## 3 alt.atheism 49960 people 3655
## 4 alt.atheism 49960 people 3655
## 5 alt.atheism 49960 people 3655
## 6 alt.atheism 49960 people 3655
## 7 alt.atheism 51122 people 3655
## 8 alt.atheism 51124 people 3655
## 9 alt.atheism 51124 people 3655
## 10 alt.atheism 51131 people 3655
## # ... with 424,087 more rows
# df containing only words and their totals that belong to the four science related newsgroups
(word_sci_newsgroups <- usenet_words %>%
filter(str_detect(newsgroup, "^sci")) %>%
group_by(word) %>%
mutate(word_total = n()) %>%
ungroup() %>%
filter(word_total > 50))
## # A tibble: 57,306 x 4
## newsgroup id word word_total
## <chr> <chr> <chr> <int>
## 1 sci.crypt 14147 archive 80
## 2 sci.crypt 14147 faq 173
## 3 sci.crypt 14147 sun 93
## 4 sci.crypt 14147 posting 101
## 5 sci.crypt 14831 archive 80
## 6 sci.crypt 14831 faq 173
## 7 sci.crypt 14831 posting 101
## 8 sci.crypt 14832 archive 80
## 9 sci.crypt 14982 faq 173
## 10 sci.crypt 14982 file 106
## # ... with 57,296 more rows
# DTM of the words_count_newsgroup
(dtm_all_newsgroups <- words_count_newsgroup %>%
unite(document, newsgroup, id, sep = "_") %>%
count(document, word, sort = TRUE) %>%
cast_dtm(document, word, n))
## <<DocumentTermMatrix (documents: 11117, terms: 2757)>>
## Non-/sparse entries: 292189/30357380
## Sparsity : 99%
## Maximal term length: 15
## Weighting : term frequency (tf)
# convert to an LDA model. A quick note, this will take about 1-2 minutes to complete.
(lda_all_newsgroups <- LDA(dtm_all_newsgroups, k = 4, control = list(seed = 1234)))
## A LDA_VEM topic model with 4 topics.
What topics did the LDA model predict? Do the topics match the four main topics we identified earlier (i.e. computers/electronics, sports/vehicles, politics, and religion)? Before we answer that question, we should get a feel for what words were assigned to each topic. Let’s check!
(newsgroup_topics <- tidy(lda_all_newsgroups, matrix = "beta"))
## # A tibble: 11,028 x 3
## topic term beta
## <int> <chr> <dbl>
## 1 1 db 1.095177e-16
## 2 2 db 1.135525e-28
## 3 3 db 7.293096e-03
## 4 4 db 6.922673e-56
## 5 1 mov 6.620114e-47
## 6 2 mov 3.759340e-104
## 7 3 mov 1.678378e-03
## 8 4 mov 6.142452e-102
## 9 1 output 3.797681e-03
## 10 2 output 1.426611e-30
## # ... with 11,018 more rows
# top 10 words assigned to each topic
(top_words_by_topics <- newsgroup_topics %>%
group_by(topic) %>%
top_n(10, beta) %>%
arrange(topic, -beta) %>%
ungroup() %>%
mutate(term = factor(term, rev(unique(term)))))
## # A tibble: 40 x 3
## topic term beta
## <int> <fctr> <dbl>
## 1 1 windows 0.008011024
## 2 1 system 0.007851254
## 3 1 file 0.007506175
## 4 1 mail 0.006725197
## 5 1 drive 0.006720866
## 6 1 program 0.006295376
## 7 1 bit 0.005860197
## 8 1 key 0.005527687
## 9 1 software 0.005521186
## 10 1 card 0.005349582
## # ... with 30 more rows
top_words_by_topics %>%
ggplot(aes(x = term, y = beta, fill = as.factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, ncol = 2, scales = "free") +
ylab("beta") +
coord_flip()
It looks like our LDA model did a pretty good job identifying the four main topics. Based of the barcharts, I think topic 1 is the computers/electronics topics, topic 2 is the religion topic, topic 3 is the sports/vehicles topic, and topic 4 is the politics topic. What do you think?
Let’s see how accurate the LDA topics are by viewing the documents assigned to each topic.
(documents_by_topic <- lda_all_newsgroups %>%
tidy(matrix = "gamma"))
## # A tibble: 44,468 x 3
## document topic gamma
## <chr> <int> <dbl>
## 1 sci.crypt_14991 1 2.947834e-02
## 2 comp.windows.x_64831 1 9.998396e-01
## 3 comp.windows.x_66409 1 9.998441e-01
## 4 sci.electronics_53569 1 8.265134e-01
## 5 talk.politics.mideast_76013 1 9.594500e-05
## 6 comp.windows.x_64830 1 9.892861e-01
## 7 comp.windows.x_66408 1 9.886978e-01
## 8 alt.atheism_53519 1 6.718424e-05
## 9 comp.sys.mac.hardware_51892 1 9.975757e-01
## 10 sci.med_58766 1 1.473460e-02
## # ... with 44,458 more rows
# top 10 docs associated with each topic
(top_docs_by_topic <- documents_by_topic %>%
group_by(topic) %>%
top_n(10, gamma) %>%
arrange(topic, -gamma) %>%
ungroup() %>%
mutate(document = factor(document, rev(unique(document))),
topic = factor(topic,
levels = c("comp/electronics", "religion", "sports/vehicles", "politics"))))
## # A tibble: 40 x 3
## document topic gamma
## <fctr> <fctr> <dbl>
## 1 comp.windows.x_66409 NA 0.9998441
## 2 comp.windows.x_64831 NA 0.9998396
## 3 comp.graphics_38682 NA 0.9993638
## 4 comp.graphics_38726 NA 0.9984200
## 5 comp.sys.ibm.pc.hardware_60143 NA 0.9981591
## 6 comp.sys.ibm.pc.hardware_60392 NA 0.9981183
## 7 comp.sys.ibm.pc.hardware_60275 NA 0.9979262
## 8 comp.os.ms-windows.misc_9741 NA 0.9978215
## 9 comp.sys.ibm.pc.hardware_60381 NA 0.9976434
## 10 comp.sys.mac.hardware_51892 NA 0.9975757
## # ... with 30 more rows
# barchart of the assignment of docs for each topic
documents_by_topic %>%
separate(document, c("newsgroup", "id"), sep = "_", convert = TRUE) %>%
mutate(newsgroup = reorder(newsgroup, topic * gamma)) %>%
ggplot(aes(x = as.factor(topic), y = gamma, fill = as.factor(topic))) +
geom_boxplot() +
facet_wrap(~ newsgroup) +
labs(x = "Topic",
y = "# of messages where this document was the highest % of topic")
This plot shows that not every document was easily assigned to a topic. This may lead us to increase our k-value in our LDA model. However, I believe a k-value of 4 accurately reflects the most frequent topics discussed in each newsgroup.
Which newsgroups tended to use positive words? Which newsgroups tended to us negative words? Lets find out!
First will will load the AFINN sentiment lexicon, which provides a positve rating score for words. Each word is assigned a score ranging between -5 (negative), 0 (neutral), and 5 (positive).
(newsgroup_scores <- words_by_newsgroup %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
group_by(newsgroup) %>%
summarise(score = sum(score * n) / sum(n)))
## # A tibble: 20 x 2
## newsgroup score
## <chr> <dbl>
## 1 alt.atheism -0.20900322
## 2 comp.graphics 0.52917666
## 3 comp.os.ms-windows.misc 0.45899172
## 4 comp.sys.ibm.pc.hardware 0.15653221
## 5 comp.sys.mac.hardware 0.25963489
## 6 comp.windows.x 0.23649010
## 7 misc.forsale 0.67283494
## 8 rec.autos -0.24139514
## 9 rec.motorcycles -0.19439421
## 10 rec.sport.baseball 0.22085890
## 11 rec.sport.hockey 0.25540346
## 12 sci.crypt -0.06614405
## 13 sci.electronics 0.08923077
## 14 sci.med -0.29604079
## 15 sci.space 0.18295652
## 16 soc.religion.christian 0.10290170
## 17 talk.politics.guns -0.83101278
## 18 talk.politics.mideast -0.80498950
## 19 talk.politics.misc -0.20904218
## 20 talk.religion.misc -0.08766152
newsgroup_scores %>%
mutate(newsgroup = reorder(newsgroup, score)) %>%
ggplot(aes(x = newsgroup, y = score, fill = score > 0)) +
geom_col(show.legend = FALSE) +
coord_flip() +
ylab("Average sentiment score")
Interestingly enough, the misc.forsale newsgroup received the highest sentiment score. I wonder if misc.forsale has a lot of positive words because people tend to use words to show that an item is in good condition? Also, I wonder why the talk.politics.guns newsgroup has the most negative sentiment score? Let’s find out by looking at words that impacted sentiment score most.
(contributions <- usenet_words %>%
inner_join(get_sentiments("afinn")) %>%
group_by(word) %>%
summarise(occurances = n(),
contribution = sum(score)))
## Joining, by = "word"
## # A tibble: 1,909 x 3
## word occurances contribution
## <chr> <int> <int>
## 1 abandon 13 -26
## 2 abandoned 19 -38
## 3 abandons 3 -6
## 4 abduction 2 -4
## 5 abhor 4 -12
## 6 abhorred 1 -3
## 7 abhorrent 2 -6
## 8 abilities 16 32
## 9 ability 177 354
## 10 aboard 8 8
## # ... with 1,899 more rows
contributions %>%
top_n(20, abs(contribution)) %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
ylab("Words that had the most impact on the sentiment score") +
coord_flip()
Now, let’s take a look at the top 10 words by contribution to sentiment score within each newsgroup.
%>% top_n(10, contribution) %>% ggplot(aes(x = word, y = contribution, fill = contribution > 0)) + geom_col(show.legend = FALSE) + facet_wrap(~ newsgroup, ncol = 4, scales = “free”) + ylab(“Words within each newsgroup that contributed most to their overal sentiment score”) + coord_flip()
(top_contributions_by_newsgroup <- words_by_newsgroup %>%
inner_join(get_sentiments("afinn"), by = "word") %>%
mutate(contribution = score * n / sum(n)))
## # A tibble: 13,063 x 5
## newsgroup word n score contribution
## <chr> <chr> <int> <int> <dbl>
## 1 soc.religion.christian god 917 1 0.014418012
## 2 soc.religion.christian jesus 440 1 0.006918130
## 3 talk.politics.guns gun 425 -1 -0.006682285
## 4 talk.religion.misc god 296 1 0.004654015
## 5 alt.atheism god 268 1 0.004213770
## 6 soc.religion.christian faith 257 1 0.004040817
## 7 talk.religion.misc jesus 256 1 0.004025094
## 8 talk.politics.mideast killed 202 -3 -0.009528152
## 9 talk.politics.mideast war 187 -2 -0.005880411
## 10 soc.religion.christian true 179 2 0.005628842
## # ... with 13,053 more rows
top_contributions_by_newsgroup %>%
group_by(newsgroup) %>%
top_n(10, abs(contribution)) %>%
ungroup() %>%
mutate(word = reorder(word, contribution)) %>%
ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
geom_col(show.legend = FALSE) +
facet_wrap(~ newsgroup, ncol = 4, scales = "free") +
coord_flip()
It looks like talk.politics.guns contains words about violence, which is obviously associated with a negative sentiment score. It also looks like my guess that misc.forsale contained words that hyped up the quality/condition of an item was true. Also, notice that comp.os.ms-windows had a highly positive sentiment score, however, the word contributing most to their sentiment score was “win,” which is probably an abbreviation of the OS Windows system.
There you have it, we successfully analysed Usenet text! We know which newsgroups had the most words in common, which words were used most frequently, which words had the highest tf-idf scores, how positive or negative each newsgroup is, which words contributed most to a newsgroups sentiment score, and the potential errors that occured in our sentiment analysis.