A start-to-finish analysis of Usenet text data

The Usenet data consists of a set of 20,000 messages sent to 20 Usenet bulletin boards in 1993.

The Usenet data is publicly available at: http://qwone.com/~jason/20Newsgroups/.

The Usenet bulletin boards in this dataset include newsgroups for topics like politics, religion, cars, sports, and cryptography. The text is diverse, rich, and includes many different authors. Thus, analysis of the text is interesting and applicable to current developments in machine learning.

Step 1: Pre-Processing our data

We will read in all the messages from the 20news-bydate folder. The 20news-bydate folder contains sub-folders. We will use the sub-folders to create a training set of data and a test set of data. Read in files with read_lines(), map(), and unnest().

We will be using these packages throughout the analysis.

library(dplyr)
library(purrr)
library(readr)
library(tidyr)
library(ggplot2)
library(stringr)
library(tidytext)
library(widyr)
library(igraph)
library(ggraph)
library(topicmodels)

Unzip the tar file with the untar() function:

untar("20news-bydate.tar.gz")

Create a training folder that contains the 20news-bydate-train data, make function that reads folders, and create a data frame to hold the title of the newsgroup, the message id, and the text that goes with it.

training_folder <- "20news-bydate-train/"

# Create a function to read all files from a folder into a data frame
read_folder <- function(infolder) {
  data_frame(file = dir(infolder, full.names = TRUE)) %>%
    mutate(text = map(file, read_lines)) %>%
    transmute(id = basename(file), text) %>%
    unnest(text)
}

# Use unnest() and map() to apply read_folder to each subfolder
(raw_text <- data_frame(folder = dir(training_folder, full.names = TRUE)) %>%
  unnest(map(folder, read_folder)) %>%
  transmute(newsgroup = basename(folder), id, text))

## # A tibble: 511,655 x 3
##      newsgroup    id
##          <chr> <chr>
##  1 alt.atheism 49960
##  2 alt.atheism 49960
##  3 alt.atheism 49960
##  4 alt.atheism 49960
##  5 alt.atheism 49960
##  6 alt.atheism 49960
##  7 alt.atheism 49960
##  8 alt.atheism 49960
##  9 alt.atheism 49960
## 10 alt.atheism 49960
## # ... with 511,645 more rows, and 1 more variables: text <chr>

The newsgroup column identifies which of the 20 newsgroups the article came from, the id column identifies a unique message within that newsgroup, and the text column contains the text associated with the newsgroup and id number (i.e. the message).

Here are the names of all 20 newsgroups and the total number of messages within each newsgroup.

raw_text %>%
  group_by(newsgroup) %>%
  summarise(messages = n_distinct(id)) %>%
  ggplot(aes(x = newsgroup, y = messages)) +
  geom_col() +
  coord_flip()

Next we will pre-process the text from the messages within the newsgroups (i.e. clean the data). All the messages include a header like this “from:” or “in_reply_to:” and may include email signatures which occur after “–”. Use cumsum() from dplyr and str_detect() from stringr to remove these lines of text.

# remove header and email sig text
cleaned_text <- raw_text %>%
  group_by(newsgroup, id) %>%
  filter(cumsum(text == "") > 0, 
         cumsum(str_detect(text, "^--")) == 0) %>%
  ungroup()
head(cleaned_text)

## # A tibble: 6 x 3
##     newsgroup    id                                text
##         <chr> <chr>                               <chr>
## 1 alt.atheism 49960                                    
## 2 alt.atheism 49960     Archive-name: atheism/resources
## 3 alt.atheism 49960 Alt-atheism-archive-name: resources
## 4 alt.atheism 49960     Last-modified: 11 December 1992
## 5 alt.atheism 49960                        Version: 1.0
## 6 alt.atheism 49960

Some text also included quotes by other writers (use regex), and messages 9704 and 9985 contain limited to no textual data so lets remove them too.

cleaned_text <- cleaned_text %>%
  filter(str_detect(text, "^[^>]+[A-Za-z\\d]") | text == "",
         !str_detect(text, "writes(:|\\.\\.\\.)$"),
         !str_detect(text, "^In article <"),
         !id %in% c(9704, 9985))

head(cleaned_text)

## # A tibble: 6 x 3
##     newsgroup    id                                text
##         <chr> <chr>                               <chr>
## 1 alt.atheism 49960                                    
## 2 alt.atheism 49960     Archive-name: atheism/resources
## 3 alt.atheism 49960 Alt-atheism-archive-name: resources
## 4 alt.atheism 49960     Last-modified: 11 December 1992
## 5 alt.atheism 49960                        Version: 1.0
## 6 alt.atheism 49960

Next we tokenize the data using unnest_tokens() to seperate the messages by each word. Then, we remove stop words from the data.

usenet_words <- cleaned_text %>%
  unnest_tokens(word, text) %>%
  filter(str_detect(word, "[a-z']$"),
         !word %in% stop_words$word)

head(usenet_words)

## # A tibble: 6 x 3
##     newsgroup    id      word
##         <chr> <chr>     <chr>
## 1 alt.atheism 49960   archive
## 2 alt.atheism 49960   atheism
## 3 alt.atheism 49960 resources
## 4 alt.atheism 49960       alt
## 5 alt.atheism 49960   atheism
## 6 alt.atheism 49960   archive

Exploring words in the newsgroups

What are the most common words in the newsgroups?

usenet_words %>%
  count(word, sort = TRUE)

## # A tibble: 68,137 x 2
##           word     n
##          <chr> <int>
##  1      people  3655
##  2        time  2705
##  3         god  1626
##  4      system  1595
##  5     program  1103
##  6         bit  1097
##  7 information  1094
##  8     windows  1088
##  9  government  1084
## 10       space  1072
## # ... with 68,127 more rows

What about the most common words in each newsgroup?

words_by_newsgroup <- usenet_words %>%
  group_by(newsgroup) %>%
  count(word, sort = TRUE) %>%
  ungroup()

head(words_by_newsgroup, n = 20)

## # A tibble: 20 x 3
##                   newsgroup       word     n
##                       <chr>      <chr> <int>
##  1   soc.religion.christian        god   917
##  2                sci.space      space   840
##  3    talk.politics.mideast     people   728
##  4                sci.crypt        key   704
##  5  comp.os.ms-windows.misc    windows   625
##  6    talk.politics.mideast   armenian   582
##  7                sci.crypt         db   549
##  8    talk.politics.mideast    turkish   514
##  9                rec.autos        car   509
## 10    talk.politics.mideast  armenians   509
## 11 comp.sys.ibm.pc.hardware       scsi   483
## 12   soc.religion.christian      jesus   440
## 13   soc.religion.christian     people   436
## 14    talk.politics.mideast     israel   428
## 15       talk.politics.guns        gun   425
## 16           comp.windows.x       file   414
## 17                sci.crypt encryption   410
## 18       talk.politics.misc     people   396
## 19 comp.sys.ibm.pc.hardware      drive   370
## 20         rec.sport.hockey       game   370

We know that each group talks about different topics and uses different words. But, do the some newsgroups use very different words when discussing topics than other newsgroups? We can use tf-idf to find out.Tf-idf is the term frequency inverse document frequency. We use tf-idf to identify words that are important to a document. For example, many use groups use the word “are” but in this context because are is used many times by every newsgroup it isn’t very helpful to know that “are” is a common word in each newsgroup. I won’t elaborate on the math behind tf-idf because it isn’t important to this analysis. With that said, all you need to know and accept is that tf-idf identifies important words in each newsgroup.

tf_idf_all <- words_by_newsgroup %>%
  bind_tf_idf(word, newsgroup, n) %>%
  arrange(desc(tf-idf))

tf_idf_all

## # A tibble: 173,913 x 6
##                   newsgroup   word     n          tf   idf tf_idf
##                       <chr>  <chr> <int>       <dbl> <dbl>  <dbl>
##  1   soc.religion.christian    god   917 0.017054754     0      0
##  2                sci.space  space   840 0.016093496     0      0
##  3                sci.crypt    key   704 0.014013018     0      0
##  4 comp.sys.ibm.pc.hardware  drive   370 0.013495277     0      0
##  5       talk.politics.misc people   396 0.010432034     0      0
##  6         rec.sport.hockey   game   370 0.010361534     0      0
##  7    talk.politics.mideast people   728 0.010068042     0      0
##  8              alt.atheism people   276 0.010015967     0      0
##  9       talk.religion.misc    god   296 0.009877202     0      0
## 10              alt.atheism    god   268 0.009725650     0      0
## # ... with 173,903 more rows

We can also view the highest ranking tf-idf words for a specific topic (i.e. talk.something.something).

(talk_tf_idf <- tf_idf_all %>%
  filter(str_detect(newsgroup, "^talk\\.")) %>%
  group_by(newsgroup) %>%
  top_n(10, tf_idf) %>%
  ungroup() %>%
  mutate(word = reorder(word, tf_idf)))

## # A tibble: 40 x 6
##             newsgroup      word     n          tf       idf      tf_idf
##                 <chr>    <fctr> <int>       <dbl>     <dbl>       <dbl>
##  1 talk.politics.guns       gun   425 0.010172575 0.2231436 0.002269944
##  2 talk.politics.misc president   295 0.007771338 0.2231436 0.001734124
##  3 talk.politics.guns      guns   209 0.005002513 0.5978370 0.002990688
##  4 talk.politics.guns   weapons   159 0.003805740 0.5978370 0.002275212
##  5 talk.religion.misc     bible   158 0.005272290 0.6931472 0.003654473
##  6 talk.politics.misc  congress    96 0.002528978 0.6931472 0.001752954
##  7 talk.religion.misc     jesus   256 0.008542445 0.7985077 0.006821208
##  8 talk.religion.misc        ra    96 0.003203417 0.7985077 0.002557953
##  9 talk.politics.guns    weapon   128 0.003063740 0.7985077 0.002446420
## 10 talk.politics.misc secretary    79 0.002081138 0.7985077 0.001661805
## # ... with 30 more rows

# and visualize
ggplot(talk_tf_idf, aes(x = word, y = tf_idf, fill = newsgroup)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, scales = "free") +
  ylab("tf-idf") +
  coord_flip()

As one could guess, within each newsgroup and topic the highest ranked tf-idf words accurately represent the topic of the newsgroup.

Here is another interesting question: Which newsgroups were similar to each other in terms of text content?

We can use the pairwise_cor() function to find an answer.

(newsgroup_word_cors <- words_by_newsgroup %>%
   pairwise_cor(newsgroup, word, n, sort = TRUE))

## # A tibble: 380 x 3
##                       item1                    item2 correlation
##                       <chr>                    <chr>       <dbl>
##  1       talk.religion.misc   soc.religion.christian   0.8347275
##  2   soc.religion.christian       talk.religion.misc   0.8347275
##  3              alt.atheism       talk.religion.misc   0.7793079
##  4       talk.religion.misc              alt.atheism   0.7793079
##  5              alt.atheism   soc.religion.christian   0.7510723
##  6   soc.religion.christian              alt.atheism   0.7510723
##  7    comp.sys.mac.hardware comp.sys.ibm.pc.hardware   0.6799043
##  8 comp.sys.ibm.pc.hardware    comp.sys.mac.hardware   0.6799043
##  9       rec.sport.baseball         rec.sport.hockey   0.5770378
## 10         rec.sport.hockey       rec.sport.baseball   0.5770378
## # ... with 370 more rows

The newsgroups who most often used the same words are talk.religion.misc and soc.religion.christian. These two newsgroups use the same words about 83% of the time.

Next, we will make a network to that displays strong correlations between newsgroups (i.e. newsgroups that frequently use the same words).

set.seed(1234)

newsgroup_word_cors %>%
  filter(correlation > 0.4) %>%
  graph_from_data_frame() %>%
  ggraph(layout = "fr") +
  geom_edge_link(aes(alpha = correlation, width = correlation)) +
  geom_node_point(size = 6, color = "lightblue") +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void()

The network suggest there are four main topics: computers/electronics, sports/vehicles, politics, and religion.

Topic Modeling

Now we will apply latent Dirichlet allocation (LDA) to our newsgroup data. Will will use LDA methods to randomly guess which newsgroup each message belongs to.

First we make a Document-Term Matrix by using the cast_dtm() function from the topicmodels package.

# group by word and add a word total column for each word in each document. Include words used more than 50 times
words_count_newsgroup <- usenet_words %>%
  group_by(word) %>%
  mutate(word_total = n()) %>%
  ungroup() %>%
  filter(word_total > 50)

words_count_newsgroup %>%
  arrange(desc(word_total))

## # A tibble: 424,097 x 4
##      newsgroup    id   word word_total
##          <chr> <chr>  <chr>      <int>
##  1 alt.atheism 49960 people       3655
##  2 alt.atheism 49960 people       3655
##  3 alt.atheism 49960 people       3655
##  4 alt.atheism 49960 people       3655
##  5 alt.atheism 49960 people       3655
##  6 alt.atheism 49960 people       3655
##  7 alt.atheism 51122 people       3655
##  8 alt.atheism 51124 people       3655
##  9 alt.atheism 51124 people       3655
## 10 alt.atheism 51131 people       3655
## # ... with 424,087 more rows

# df containing only words and their totals that belong to the four science related newsgroups
(word_sci_newsgroups <- usenet_words %>%
  filter(str_detect(newsgroup, "^sci")) %>%
  group_by(word) %>%
  mutate(word_total = n()) %>%
  ungroup() %>%
  filter(word_total > 50))

## # A tibble: 57,306 x 4
##    newsgroup    id    word word_total
##        <chr> <chr>   <chr>      <int>
##  1 sci.crypt 14147 archive         80
##  2 sci.crypt 14147     faq        173
##  3 sci.crypt 14147     sun         93
##  4 sci.crypt 14147 posting        101
##  5 sci.crypt 14831 archive         80
##  6 sci.crypt 14831     faq        173
##  7 sci.crypt 14831 posting        101
##  8 sci.crypt 14832 archive         80
##  9 sci.crypt 14982     faq        173
## 10 sci.crypt 14982    file        106
## # ... with 57,296 more rows

# DTM of the words_count_newsgroup
(dtm_all_newsgroups <- words_count_newsgroup %>%
    unite(document, newsgroup, id, sep = "_") %>%
    count(document, word, sort = TRUE) %>%
    cast_dtm(document, word, n))

## <<DocumentTermMatrix (documents: 11117, terms: 2757)>>
## Non-/sparse entries: 292189/30357380
## Sparsity           : 99%
## Maximal term length: 15
## Weighting          : term frequency (tf)

# convert to an LDA model. A quick note, this will take about 1-2 minutes to complete. 
(lda_all_newsgroups <- LDA(dtm_all_newsgroups, k = 4, control = list(seed = 1234)))

## A LDA_VEM topic model with 4 topics.

What topics did the LDA model predict? Do the topics match the four main topics we identified earlier (i.e. computers/electronics, sports/vehicles, politics, and religion)? Before we answer that question, we should get a feel for what words were assigned to each topic. Let’s check!

(newsgroup_topics <- tidy(lda_all_newsgroups, matrix = "beta"))

## # A tibble: 11,028 x 3
##    topic   term          beta
##    <int>  <chr>         <dbl>
##  1     1     db  1.095177e-16
##  2     2     db  1.135525e-28
##  3     3     db  7.293096e-03
##  4     4     db  6.922673e-56
##  5     1    mov  6.620114e-47
##  6     2    mov 3.759340e-104
##  7     3    mov  1.678378e-03
##  8     4    mov 6.142452e-102
##  9     1 output  3.797681e-03
## 10     2 output  1.426611e-30
## # ... with 11,018 more rows

# top 10 words assigned to each topic
(top_words_by_topics <- newsgroup_topics %>%
  group_by(topic) %>%
  top_n(10, beta) %>%
  arrange(topic, -beta) %>%
  ungroup() %>%
  mutate(term = factor(term, rev(unique(term)))))

## # A tibble: 40 x 3
##    topic     term        beta
##    <int>   <fctr>       <dbl>
##  1     1  windows 0.008011024
##  2     1   system 0.007851254
##  3     1     file 0.007506175
##  4     1     mail 0.006725197
##  5     1    drive 0.006720866
##  6     1  program 0.006295376
##  7     1      bit 0.005860197
##  8     1      key 0.005527687
##  9     1 software 0.005521186
## 10     1     card 0.005349582
## # ... with 30 more rows

top_words_by_topics %>%
  ggplot(aes(x = term, y = beta, fill = as.factor(topic))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ topic, ncol = 2, scales = "free") +
  ylab("beta") +
  coord_flip()

It looks like our LDA model did a pretty good job identifying the four main topics. Based of the barcharts, I think topic 1 is the computers/electronics topics, topic 2 is the religion topic, topic 3 is the sports/vehicles topic, and topic 4 is the politics topic. What do you think?

Let’s see how accurate the LDA topics are by viewing the documents assigned to each topic.

(documents_by_topic <- lda_all_newsgroups %>%
   tidy(matrix = "gamma"))

## # A tibble: 44,468 x 3
##                       document topic        gamma
##                          <chr> <int>        <dbl>
##  1             sci.crypt_14991     1 2.947834e-02
##  2        comp.windows.x_64831     1 9.998396e-01
##  3        comp.windows.x_66409     1 9.998441e-01
##  4       sci.electronics_53569     1 8.265134e-01
##  5 talk.politics.mideast_76013     1 9.594500e-05
##  6        comp.windows.x_64830     1 9.892861e-01
##  7        comp.windows.x_66408     1 9.886978e-01
##  8           alt.atheism_53519     1 6.718424e-05
##  9 comp.sys.mac.hardware_51892     1 9.975757e-01
## 10               sci.med_58766     1 1.473460e-02
## # ... with 44,458 more rows

# top 10 docs associated with each topic
(top_docs_by_topic <- documents_by_topic %>%
    group_by(topic) %>%
    top_n(10, gamma) %>%
    arrange(topic, -gamma) %>%
    ungroup() %>%
    mutate(document = factor(document, rev(unique(document))),
           topic = factor(topic, 
                          levels = c("comp/electronics", "religion", "sports/vehicles", "politics"))))

## # A tibble: 40 x 3
##                          document  topic     gamma
##                            <fctr> <fctr>     <dbl>
##  1           comp.windows.x_66409     NA 0.9998441
##  2           comp.windows.x_64831     NA 0.9998396
##  3            comp.graphics_38682     NA 0.9993638
##  4            comp.graphics_38726     NA 0.9984200
##  5 comp.sys.ibm.pc.hardware_60143     NA 0.9981591
##  6 comp.sys.ibm.pc.hardware_60392     NA 0.9981183
##  7 comp.sys.ibm.pc.hardware_60275     NA 0.9979262
##  8   comp.os.ms-windows.misc_9741     NA 0.9978215
##  9 comp.sys.ibm.pc.hardware_60381     NA 0.9976434
## 10    comp.sys.mac.hardware_51892     NA 0.9975757
## # ... with 30 more rows

# barchart of the assignment of docs for each topic
documents_by_topic %>%
  separate(document, c("newsgroup", "id"), sep = "_", convert = TRUE) %>%
  mutate(newsgroup = reorder(newsgroup, topic * gamma)) %>%
  ggplot(aes(x = as.factor(topic), y = gamma, fill = as.factor(topic))) +
  geom_boxplot() +
  facet_wrap(~ newsgroup) +
  labs(x = "Topic",
       y = "# of messages where this document was the highest % of topic")

This plot shows that not every document was easily assigned to a topic. This may lead us to increase our k-value in our LDA model. However, I believe a k-value of 4 accurately reflects the most frequent topics discussed in each newsgroup.

Sentiment Analysis

Which newsgroups tended to use positive words? Which newsgroups tended to us negative words? Lets find out!

First will will load the AFINN sentiment lexicon, which provides a positve rating score for words. Each word is assigned a score ranging between -5 (negative), 0 (neutral), and 5 (positive).

(newsgroup_scores <- words_by_newsgroup %>%
   inner_join(get_sentiments("afinn"), by = "word") %>%
   group_by(newsgroup) %>%
   summarise(score = sum(score * n) / sum(n)))

## # A tibble: 20 x 2
##                   newsgroup       score
##                       <chr>       <dbl>
##  1              alt.atheism -0.20900322
##  2            comp.graphics  0.52917666
##  3  comp.os.ms-windows.misc  0.45899172
##  4 comp.sys.ibm.pc.hardware  0.15653221
##  5    comp.sys.mac.hardware  0.25963489
##  6           comp.windows.x  0.23649010
##  7             misc.forsale  0.67283494
##  8                rec.autos -0.24139514
##  9          rec.motorcycles -0.19439421
## 10       rec.sport.baseball  0.22085890
## 11         rec.sport.hockey  0.25540346
## 12                sci.crypt -0.06614405
## 13          sci.electronics  0.08923077
## 14                  sci.med -0.29604079
## 15                sci.space  0.18295652
## 16   soc.religion.christian  0.10290170
## 17       talk.politics.guns -0.83101278
## 18    talk.politics.mideast -0.80498950
## 19       talk.politics.misc -0.20904218
## 20       talk.religion.misc -0.08766152

newsgroup_scores %>%
  mutate(newsgroup = reorder(newsgroup, score)) %>%
  ggplot(aes(x = newsgroup, y = score, fill = score > 0)) +
  geom_col(show.legend = FALSE) +
  coord_flip() +
  ylab("Average sentiment score")

Interestingly enough, the misc.forsale newsgroup received the highest sentiment score. I wonder if misc.forsale has a lot of positive words because people tend to use words to show that an item is in good condition? Also, I wonder why the talk.politics.guns newsgroup has the most negative sentiment score? Let’s find out by looking at words that impacted sentiment score most.

(contributions <- usenet_words %>%
  inner_join(get_sentiments("afinn")) %>%
  group_by(word) %>%
  summarise(occurances = n(),
            contribution = sum(score)))

## Joining, by = "word"

## # A tibble: 1,909 x 3
##         word occurances contribution
##        <chr>      <int>        <int>
##  1   abandon         13          -26
##  2 abandoned         19          -38
##  3  abandons          3           -6
##  4 abduction          2           -4
##  5     abhor          4          -12
##  6  abhorred          1           -3
##  7 abhorrent          2           -6
##  8 abilities         16           32
##  9   ability        177          354
## 10    aboard          8            8
## # ... with 1,899 more rows

contributions %>%
  top_n(20, abs(contribution)) %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  ylab("Words that had the most impact on the sentiment score") +
  coord_flip()

Now, let’s take a look at the top 10 words by contribution to sentiment score within each newsgroup.

%>% top_n(10, contribution) %>% ggplot(aes(x = word, y = contribution, fill = contribution > 0)) + geom_col(show.legend = FALSE) + facet_wrap(~ newsgroup, ncol = 4, scales = “free”) + ylab(“Words within each newsgroup that contributed most to their overal sentiment score”) + coord_flip()

(top_contributions_by_newsgroup <- words_by_newsgroup %>%
   inner_join(get_sentiments("afinn"), by = "word") %>%
   mutate(contribution = score * n / sum(n)))

## # A tibble: 13,063 x 5
##                 newsgroup   word     n score contribution
##                     <chr>  <chr> <int> <int>        <dbl>
##  1 soc.religion.christian    god   917     1  0.014418012
##  2 soc.religion.christian  jesus   440     1  0.006918130
##  3     talk.politics.guns    gun   425    -1 -0.006682285
##  4     talk.religion.misc    god   296     1  0.004654015
##  5            alt.atheism    god   268     1  0.004213770
##  6 soc.religion.christian  faith   257     1  0.004040817
##  7     talk.religion.misc  jesus   256     1  0.004025094
##  8  talk.politics.mideast killed   202    -3 -0.009528152
##  9  talk.politics.mideast    war   187    -2 -0.005880411
## 10 soc.religion.christian   true   179     2  0.005628842
## # ... with 13,053 more rows

top_contributions_by_newsgroup %>%
  group_by(newsgroup) %>%
  top_n(10, abs(contribution)) %>%
  ungroup() %>%
  mutate(word = reorder(word, contribution)) %>%
  ggplot(aes(x = word, y = contribution, fill = contribution > 0)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~ newsgroup, ncol = 4, scales = "free") +
  coord_flip()

It looks like talk.politics.guns contains words about violence, which is obviously associated with a negative sentiment score. It also looks like my guess that misc.forsale contained words that hyped up the quality/condition of an item was true. Also, notice that comp.os.ms-windows had a highly positive sentiment score, however, the word contributing most to their sentiment score was “win,” which is probably an abbreviation of the OS Windows system.

Summary

There you have it, we successfully analysed Usenet text! We know which newsgroups had the most words in common, which words were used most frequently, which words had the highest tf-idf scores, how positive or negative each newsgroup is, which words contributed most to a newsgroups sentiment score, and the potential errors that occured in our sentiment analysis.

Usenet text Analysis

Kai Middlebrook

July 27, 2017