The goal of this project is to display that I’ve gotten used to working with the data and that I am on track to create my prediction algorithm. This report includes:
- Major identifiable features of the data (size, word counts, line counts)
- Basic summary report of statistics about the data sets (along with supporting tables/figures)
- Interesting preliminary findings (along with supporting tables/figures)
- Plans for creating the prediction algorithm and Shiny app
The data is available in four languages (English [US], German, Finnish, and Russian). For the sake of this assignment, I am limiting the data to the english files. The english files contain three separate sources of data in three text files (twitter.txt, blogs.txt, and news.txt). The code below loads the data from the text files, prints the file size (in bytes), and prints the number of lines.
#Read in the english data files with a for loop:
for (i in c("twitter", "blogs", "news")) {
#Since the for loop doesn't know how to read in the value of i
#in a name, use the paste0 function:
filename = paste0("en_US.", i, ".txt")
print(paste0("Size of ", i, " file = ", file.info(paste0("./", filename))$size, " bytes"))
#Read in the text file, suppressing warning messages:
assign(i, suppressWarnings(readLines(filename)))
#Transform the data into a tidy data frame using tibble
#Create two columns: 1 "line" and 2 "text"
length = length(suppressWarnings(readLines(filename)))
print(paste0("Total number of lines of ", i, " file = ", length))
assign(paste0(i, "_df"), tibble(line = 1:length, text = assign(i, suppressWarnings(readLines(filename)))))
}
## [1] "Size of twitter file = 167105338 bytes"
## [1] "Total number of lines of twitter file = 2360148"
## [1] "Size of blogs file = 210160014 bytes"
## [1] "Total number of lines of blogs file = 899288"
## [1] "Size of news file = 205811889 bytes"
## [1] "Total number of lines of news file = 77259"
Although all words should be in english, as a precaution, I removed any words that included non-ASCII characters from each data source.
#Remove words with non-ASCII characters
nonenglish = grep("twitter", iconv(twitter, "latin1", "ASCII", sub="twitter"))
twitter = twitter[-nonenglish]
nonenglish = grep("blogs", iconv(blogs, "latin1", "ASCII", sub="blogs"))
blogs = blogs[-nonenglish]
nonenglish = grep("news", iconv(news, "latin1", "ASCII", sub="news"))
news = news[-nonenglish]
Next, I created a random sample of text sources from 10,000 lines in each data file: twitter, blogs, and news so that I can more easily train the prediction algorithm in future reports.
#Create a random sample of text across all three sources: twitter, blogs, news
#in order to explore the data and train our prediction algorithm:
twitter_sample = twitter[sample(1:length(twitter), 10000)]
blogs_sample = blogs[sample(1:length(blogs), 10000)]
news_sample = news[sample(1:length(news), 10000)]
subsample = c(twitter_sample, blogs_sample, news_sample)
An important alternative to Corpus object has emerged in recent years in the form of tidytext. Instead of saving a group of documents and associated meta data, text that is in tidytext format contains one word per row, and each row also includes additional information about the name of the document where the word appears, and the order in which the words appear. Therefore, the code below removes unwanted characters, restructures the data so that each word is on a separate row, and removes stop words.
#Unwanted characters:
remove_reg = "[0123456789!@#$%^*+=}{/><]"
subsample_df_word = subsample_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>% ##Remove unwanted characters like "&"
unnest_tokens(word, text) %>% ##Restructure the data so that each word is on a separate row
anti_join(stop_words) #Often in text analysis, we will want to remove stop words;
#stop words are words that are not useful for an analysis,
#typically extremely common words such as "the", "of", "to",
#and so forth in English. We can remove stop words
#(kept in the tidytext dataset stop_words) with an anti_join().
Now, let’s sort the sample by “most common” word, then exclude any profane words, and finally visualize the most common words among the “clean” list:
#We can also use dplyr's count() to find the most common words
subsample_words = subsample_df_word %>%
count(word, sort = TRUE)
#Calculate the total number of words:
total_subsample_words = subsample_words %>%
summarize(total = sum(n))
subsample_words = cbind(subsample_words, total_subsample_words)
#Exclude list of profane words/phrases
#Source of profane word list: https://www.cs.cmu.edu/~biglou/resources/
profane_words = read.delim("./profane_words.txt", header = FALSE, sep="\n") #sep is signaling "new line"
#Rename the V1 column as "word" so that anti_join and inner_join work:
profane_words = profane_words %>% rename(word = V1)
#Since this is based on single word, we need to apply it to the word dataset
#where each word is its own separate line:
subsample_words_noprofane = anti_join(subsample_words, profane_words)
subsample_words_profane = inner_join(subsample_words, profane_words)
##Visualize the most common words:
ggplot(subsample_words[1:50,], aes(x = reorder(word, -n), y = n, fill = n)) + #reorder in aes orders the bars by frequency, not alphabetical
geom_col() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Most frequent words in sample (excluding stop words)",
x = "Word", y = "Frequency")
Another way to examine text data is to investigate n-grams instead of words. N-grams are groups of words. In the code below, I group the data by bigrams (two words) and trigrams (three words).
#Evaluate n-grams instead of words
##Bi-grams (n=2)##
subsample_df_bigram = subsample_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>% ##Remove unwanted characters like "&"
unnest_tokens(bigram, text, token = "ngrams", n = 2) ##Restructure the data so that each ngram is on a separate row
##Tri-grams (n=3)##
subsample_df_trigram = subsample_df %>%
mutate(text = str_remove_all(text, remove_reg)) %>% ##Remove unwanted characters like "&"
unnest_tokens(trigram, text, token = "ngrams", n = 3) ##Restructure the data so that each ngram is on a separate row
As one might expect, a lot of the most common bigrams are pairs of common (uninteresting) words, such as “of the” and “to be” (what are commonly referred to as “stop words”. Therefore, I used tidyr’s separate() command to separate bigrams and trigrams into separate columns: “word1” and “word2” for bigrams and “word1”, “word2”, and “word3”" for trigrams. Then, I excluded each ngram when any word was a stop word. Then, I plotted the most frequent bigrams and trigrams
##Bi-grams##
subsample_df_bigram_sep = subsample_df_bigram %>%
separate(bigram, c("word1", "word2"), sep = " ")
subsample_df_bigram_filtered = subsample_df_bigram_sep %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word)
# new bigram counts:
subsample_df_bigram_counts = subsample_df_bigram_filtered %>%
count(word1, word2, sort = TRUE)
#Remove the "NA NA" row
subsample_df_bigram_counts = na.omit(subsample_df_bigram_counts)
#Create a new column that is the non-separated version of the bigram:
subsample_df_bigram_counts$bigram = paste(subsample_df_bigram_counts$word1, subsample_df_bigram_counts$word2, sep=" ")
#Only plot the 2nd-51st rows because the first row is "NA NA"
ggplot(subsample_df_bigram_counts[1:50,], aes(x = reorder(bigram, -n), y = n, fill = n)) + #reorder in aes orders the bars by frequency, not alphabetical
geom_col() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Most frequent bigrams in sample (excluding stop words)", x = "Bigram", y = "Frequency")
##Tri-grams##
subsample_df_trigram_sep = subsample_df_trigram %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ")
subsample_df_trigram_filtered = subsample_df_trigram_sep %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
filter(!word3 %in% stop_words$word)
# new trigram counts:
subsample_df_trigram_counts = subsample_df_trigram_filtered %>%
count(word1, word2, word3, sort = TRUE)
#Remove the "NA NA NA" row
subsample_df_trigram_counts = na.omit(subsample_df_trigram_counts)
#Create a new column that is the non-separated version of the trigram:
subsample_df_trigram_counts$trigram = paste(subsample_df_trigram_counts$word1, subsample_df_trigram_counts$word2, subsample_df_trigram_counts$word3, sep=" ")
#Only plot the 2nd-51st rows because the first row is "NA NA NA"
ggplot(subsample_df_trigram_counts[2:51,], aes(x = reorder(trigram, -n), y = n, fill = n)) + #reorder in aes orders the bars by frequency, not alphabetical
geom_col() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title = "Most frequent trigrams in sample (excluding stop words)", x = "Trigram", y = "Frequency")
The next steps of this Capstone project include developing a prediction model to predict the next word based on the previous 1, 2, or 3 words. I will also need to build a model to handle unseen ngrams (i.e., instances when people will type a combination of words that do not appear in the corpora). This means that the model will need to handle cases in which an ngram is unobserved.