This report presents findings and observations about three text data sets which are to be used for this project. The end goal is a Shiny app that will take a word or phrase as input and return a likely next word that the writer would write. The data, provided by the course, are collected from various news sources, blog posts, and twitter. They are given as three text files, which we will download, unzip and read into R after loading the requisite libraries.
# Load packages ----
library(here)
library(quanteda)
library(readtext)
library(beepr)
library(dplyr)
library(tidyr)
library(ngram)
library(ggplot2)
library(cowplot)
The major work of the text analysis is preformed with the quanteda, readtext, and ngram packages. The here package is for reproducibility; learn more here.
# Data downloading and extraction ----
dataURL <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
# If the data isn't here, get it.
if(!file.exists("data.zip")){
download.file(dataURL, "data.zip")
}
# If the data is zipped, unzip it.
if(!dir.exists("final")){
unzip("data.zip")
}
# Data loading ----
news_text <- readtext(here("final/en_US/en_US.news.txt"))
twit_text <- readtext(here("final/en_US/en_US.twitter.txt"))
blog_text <- readtext(here("final/en_US/en_US.blogs.txt"))
At this point the data are loaded into R. We prepare the data by giving it a common formatting, namely makeing all letters case lower case, removing punctuation, and removing numerals. Future work may be able to make use of some of this lost information, but since much of it confounds prediction rather help it. We also extract each separate line of text from each source into its own value in a string vector for further analysis, after which we note the count for the number of lines in each source document.
news_text$text <- preprocess(
news_text$text,
case = "lower",
remove.punct = TRUE,
remove.numbers = TRUE,
fix.spacing = TRUE
)
news_lines <- news_text$text %>%
strsplit("\n") %>%
unlist
twit_text$text <- preprocess(
twit_text$text,
case = "lower",
remove.punct = TRUE,
remove.numbers = TRUE,
fix.spacing = TRUE
)
twit_lines <- twit_text$text %>%
strsplit("\n") %>%
unlist
blog_text$text <- preprocess(
blog_text$text,
case = "lower",
remove.punct = TRUE,
remove.numbers = TRUE,
fix.spacing = TRUE
)
blog_lines <- blog_text$text %>%
strsplit("\n") %>%
unlist
The ngram package has functions to generate ngrams and phrasetables. An ngram is a phrase composed of ‘n’ words. To demonstrate, an example of a 4gram is “king of the hill” or “can have the best”. A phrasetable is an R object created by the ngram package which gives the freqencies and percentages of each ngram. To begin, we will generate the phrasetable for unigrams or 1grams, that is, individual words.
# Create ngrams and phrasetables----
word_count_vec <- function(input_string_vec){
sapply(input_string_vec, wordcount)
}
news_gram1 <- ngram(news_lines[word_count_vec(news_lines) >= 1], n = 1)
news_phrasetable1 <- get.phrasetable(news_gram1)
twit_gram1 <- ngram(twit_lines[word_count_vec(twit_lines) >= 1], n = 1)
twit_phrasetable1 <- get.phrasetable(twit_gram1)
blog_gram1 <- ngram(blog_lines[word_count_vec(blog_lines) >= 1], n = 1)
blog_phrasetable1 <- get.phrasetable(blog_gram1)
We are now ready to create a table summarizing the dataset.
| File | Number of Lines | Number of Words |
|---|---|---|
| en_US.news.txt | 1010242 | 33513338 |
| en_US.twitter.txt | 2360148 | 29417909 |
| en_US.blogs.txt | 899288 | 36871309 |
So we can see that the twitter file has more lines, more than the other two files put together. However, the news articles and blog posts being longer than tweets, with the blogs having the greatest number of words at almost 37 million.
Let’s see what words are popular.
Unsurprisingly, ‘the’ is the most popular word in each source. We can also notice that the pronoun “I” is common in tweets and blogs but not in news articles. There was some consideration about whether these very common words, or “stop words” should be included in analysis, since the installed packages have methods for weeding them out. Removing stop words would make sense if our purpose were sentiment analysis, which is another popular application of textual data analysis. However, our goal is text prediction, and we want to predict words to likely follow any given phrase setup, so here it makes sense to include them.
We want to generate and inspect some bigrams and trigrams. First we will aggregate our data into one body and sample it.
# Bigrams and Trigrams
textbank <- c(news_lines, twit_lines, blog_lines)
portion <- 0.1
textbank <- sample(textbank, length(textbank)*portion)
gram2 <- ngram(textbank[word_count_vec(textbank) >= 2], n = 2)
phrasetable2 <- get.phrasetable(gram2)
gram3 <- ngram(textbank[word_count_vec(textbank) >= 3], n = 3)
phrasetable3 <- get.phrasetable(gram3)
gram4 <- ngram(textbank[word_count_vec(textbank) >= 4], n = 4)
phrasetable4 <- get.phrasetable(gram4)
And corresponding plots.
So we see that “of the” is the most popular bigram, and it is no surprice that that phrase is also contained in the more popular trigrams and 4grams. We could also note that “thanks for the follow,” a phrase that clearly comes from the twitter source, was so common even in just tweets that it made it to the top six 4grams when samling all sources. This suggests that some attentention should perhaps be given in final versions of the prediction engine to source frequency. Since the app will be built in Shiny, a phrase common to twitter will not necessarily be correlated to user input entered there.
To predict what words will likely follow an entered phrase, we would like to capitalize on how long the phrase is, to the extent that we can. Thus we intend to implement some version of what’s called the Katz-Backoff method. This method uses a weighted average of the words that most commonly follow the entered phrase’s last word, last bigram, last trigram, etc., to the extent that this information is known. Due to resource constraints, we will most likely only involve 5grams or smaller. The challenge will be to decide exactly what the weights of each candidate word should be.