Milestone Report

Text Mining Exploratory Analysis

The following analysis relies on the tidytext package, which contains functions that are wrappers for classic tm package functions, as well as basic dplyr, tidyr, and other Wickham-universe packages.

Blogs Dataset

The blogs, twitter, and news datasets have been downloaded and transformed into tibbles for processing with tidytext. Because of the sizes of the datasets, 50,000 observations were randomly selected from each for analysis. Doing simple word frequency and bigram frequency analysis on the entire blogs dataset revealed nearly identical results, so 50,000 observations appears to be sufficient.

The exploratory analysis begins with the blogs dataset.

library(tidytext)
library(dplyr)
library(ggplot2)
library(stringr)
library(tidyr)

# Set working directory
setwd("~/Capstone")

# Read in blogs data
blogs_df <- readRDS('./blogsdata.rds') %>%
    sample_n(50000)

The tidytext function unnest_tokens is used to create tokens. This function also makes every word lower case and removes punctuation. To get a sense of the meaningful words used, stop words are removed. Many non-English words with accent marks appeared among the most common so the dataset was filtered for words that contain only letters A-Z. Numbers also appeared among the common words so the dataset was filtered to remove words that contain numbers.

# Tokenize by words
blogs_tokens <- blogs_df %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>% # remove stop words
    filter(!str_detect(word, "[0-9]"), # remove words that contain numbers
           str_detect(word, "^[a-zA-Z]+$")) # retain words that contain letters only

## Joining, by = "word"

# Get counts
blogs_count <- blogs_tokens %>%
    count(word, sort = TRUE) %>%
    mutate(word = reorder(word, n))

Exploratory Plots

A basic histogram of word frequencies shows a large number of words that appear only a few times, and very few words that appear many times. This is as expected

Below is a plot of the 20 most frequently used words. Consistent with the results of the quiz from the first week, the word is love is the fourth most common word, whereas hate is not in the top twenty.

Bigrams

The same unnest_tokens function is used to create bigrams. Bigrams are then split into two words, and the same filters on numbers and letters are applied to each word of the bigram, before reuniting the words as bigrams, in order to filter for bigrams that contain only meaningful words.

# Bigram Counts and Graphs
blogs_bigrams <- blogs_df %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) 

# Filter bigrams for meaningful words
bbigrams_filtered <- blogs_bigrams %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(!word1 %in% stop_words$word,
           !str_detect(word1, "[0-9]"),
           str_detect(word1, "^[a-zA-Z]+$"),
           !word2 %in% stop_words$word,
           !str_detect(word2, "[0-9]"),
           str_detect(word2, "^[a-zA-Z]+$")) 

# Get counts of most common bigrams
bbigrams_counts <- bbigrams_filtered %>%
    count(word1, word2, sort = TRUE)

# Unite filtered bigrams
bbigrams_united <- bbigrams_filtered %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(bigram, sort = TRUE) %>%
    mutate(bigram = reorder(bigram, n))

The most common bigrams are seen in the graph below. Several city names, religious terms, and time references appear on the list, including the self-referential term “blog post”.

ggplot(data = bbigrams_united[1:20, ], aes(bigram, n, fill = bigram)) + 
    geom_bar(stat = "identity") +
    theme(legend.position = "blank") +
    ylab("Frequency of Bigrams") +
    xlab(NULL) +
    coord_flip()

Trigrams

The same unnest_tokens function is used to create trigrams. Trigrams are then split into three words, and the same filters on numbers and letters are applied to each word of the trigram, before reuniting the words as trigrams, in order to filter for trigrams that contain only meaningful words.

# Trigram Counts and Graphs
blogs_trigrams <- blogs_df %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3) 

btrigrams_filtered <- blogs_trigrams %>%
    separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
    filter(!word1 %in% stop_words$word,
           !str_detect(word1, "[0-9]"),
           str_detect(word1, "^[a-zA-Z]+$"),
           !word2 %in% stop_words$word,
           !str_detect(word2, "[0-9]"),
           str_detect(word2, "^[a-zA-Z]+$"),
           !word3 %in% stop_words$word,
           !str_detect(word3, "[0-9]"),
           str_detect(word3, "^[a-zA-Z]+$")) 

btrigrams_counts <- btrigrams_filtered %>%
    count(word1, word2, word3, sort = TRUE)

btrigrams_united <- btrigrams_filtered %>%
    unite(trigram, word1, word2, word3, sep = " ") %>%
    count(trigram, sort = TRUE) %>%
    mutate(trigram = reorder(trigram, n))

The most common trigrams are seen in the graph below. The terms on this graph make less sense to me than do those on previous graphs. The frequency counts are much lower here, with none appearing more than 25 times. One hypothesis is that stop words are so common, it’s somewhat difficult to find regularly occuring trigrams that do not contain stop words. In generating the predictive algorithm later on in the course, including stop words should yield more logical predictions.

ggplot(data = btrigrams_united[1:20, ], aes(trigram, n, fill = trigram)) + 
    geom_bar(stat = "identity") +
    theme(legend.position = "blank") +
    ylab("Frequency of Trigrams") +
    xlab(NULL) +
    coord_flip()

Twitter Dataset

For the Twitter dataset, the same analysis is conducted. Some of the most frequent words tweeted appeared in the list of most frequent blogpost words as well, such as love and people. Time-related words are common once again, such as day, tonight, night, tomorrow and week. Twitter-specific words appear such as “rt”, “twitter”, and “follow”.

# Read in blogs data
twitter_df <- readRDS('./twitterdata.rds') %>%
    sample_n(50000)

# Tokenize by words
twitter_tokens <- twitter_df %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>% # remove stop words
    filter(!str_detect(word, "[0-9]"), # remove numbers
           str_detect(word, "^[a-zA-Z]+$")) # only letters

## Joining, by = "word"

# Get counts
twitter_count <- twitter_tokens %>%
    count(word, sort = TRUE) %>%
    mutate(word = reorder(word, n))

# Plot counts for twenty most frequent words
ggplot(data = twitter_count[1:20, ], aes(word, n, fill = word)) + 
    geom_bar(stat = "identity") +
    theme(legend.position = "blank") +
    xlab(NULL) +
    ylab("Word Frequency") +
    coord_flip()

Twitter Bigrams

The most common Twitter bigrams are shown below. Again, there are some in common with the blogs dataset, such as “social media”. Morever, as with the blogs dataset, the most common bigrams appear to be positive words, though a more thorough sentiment analysis was not performed.

# Bigram Counts and Graphs
twitter_bigrams <- twitter_df %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) 

tbigrams_filtered <- twitter_bigrams %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(!word1 %in% stop_words$word,
           !str_detect(word1, "[0-9]"),
           str_detect(word1, "^[a-zA-Z]+$"),
           !word2 %in% stop_words$word,
           !str_detect(word2, "[0-9]"),
           str_detect(word2, "^[a-zA-Z]+$")) 

tbigrams_counts <- tbigrams_filtered %>%
    count(word1, word2, sort = TRUE)

# Same results as above - could be useful format down the road
tbigrams_united <- tbigrams_filtered %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(bigram, sort = TRUE) %>%
    mutate(bigram = reorder(bigram, n))

ggplot(data = tbigrams_united[1:20, ], aes(bigram, n, fill = bigram)) + 
    geom_bar(stat = "identity") +
    theme(legend.position = "blank") +
    ylab("Frequency of Bigrams") +
    xlab(NULL) +
    coord_flip()

News Dataset

For the News dataset, the same analysis is conducted. Again, some words appeared in the blogs and/or twitter datasets. Words that refer to time such as time, day, and week are common. Game, season, team, and play suggest there are many sports stories in this dataset. Police, public, and president are words one would expect to appear frequently in news stories.

# Read in news data
news_df <- readRDS('./newsdata.rds') %>%
    sample_n(50000)


# Tokenize by words
news_tokens <- news_df %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words) %>% # remove stop words
    filter(!str_detect(word, "[0-9]"), # remove numbers
           str_detect(word, "^[a-zA-Z]+$")) # only letters

## Joining, by = "word"

# Get counts
news_count <- news_tokens %>%
    count(word, sort = TRUE) %>%
    mutate(word = reorder(word, n))

# Plot counts for twenty most frequent words
ggplot(data = news_count[1:20, ], aes(word, n, fill = word)) + 
    geom_bar(stat = "identity") +
    theme(legend.position = "blank") +
    xlab(NULL) +
    ylab("Word Frequency") +
    coord_flip()

News Bigrams

The most common bigrams appearing in news stories are city names. Political terms and references are also common. Although sports words were common, it appears there are more political bigrams than sports bigrams. The only sports-related bigram in the top twenty is “regular season”, suggesting there may be fewer common two-word sports phrases.

# Bigram Counts and Graphs
news_bigrams <- news_df %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2) 

nbigrams_filtered <- news_bigrams %>%
    separate(bigram, c("word1", "word2"), sep = " ") %>%
    filter(!word1 %in% stop_words$word,
           !str_detect(word1, "[0-9]"),
           str_detect(word1, "^[a-zA-Z]+$"),
           !word2 %in% stop_words$word,
           !str_detect(word2, "[0-9]"),
           str_detect(word2, "^[a-zA-Z]+$")) 

nbigrams_counts <- nbigrams_filtered %>%
    count(word1, word2, sort = TRUE)

# Same results as above - could be useful format down the road
nbigrams_united <- nbigrams_filtered %>%
    unite(bigram, word1, word2, sep = " ") %>%
    count(bigram, sort = TRUE) %>%
    mutate(bigram = reorder(bigram, n))

ggplot(data = nbigrams_united[1:20, ], aes(bigram, n, fill = bigram)) + 
    geom_bar(stat = "identity") +
    theme(legend.position = "blank") +
    ylab("Frequency of Bigrams") +
    xlab(NULL) +
    coord_flip()

Way Forward

The next step is to develop an algorithm that will read a word or two words set by a user and predict the following word. The approach I’m considering now is essentially storing bigrams and trigrams in dataframes and filtering for user-defined words. Something I’ll need to think carefully about is the treatment of stop words.