The main goal of this report is to acquire and examine the data set provided and begin to understand to scope of building a predicitve text application
In order to keep this report concise, I won’t be showing (i.e. echo = FALSE) all of the R code. But, you can find the full R Markdown on GitHub: (https://github.com/kerskine/coursera_data_science_capstone/blob/master/capstone.Rmd)
Course mentions using R’s text mining package (tm) but recent but the tidyverse community offers an alternative in tidytext. Tidytext relies on the tidy principle, applied in this analysis, it’s one token per row. This makes it easy to apply other tools (dplyr, tidyr) when exploring and manipulating data.
The zip files contain a directory “final” which is 1.4 GB in size. In it are subdirectories of different languages (English, German, Finish, and Russian) each of which have three text files; news, blogs, and twitter. For this report, I’ll be using the English files.
Here is some basic information on the text files I’ll be examining
## # A tibble: 3 x 3
## file sizeMB lines_of_text
## <chr> <dbl> <int>
## 1 twitter 167105. 2360148
## 2 blogs 210160. 899288
## 3 news 205812. 1010242
Next, we’ll need to get each of the text files into a tibble that will allow tidytext to manipulate the data. I’ll do this using a purpose built function texttidy(). It takes the text file as input and a sample size (5%). The reason for only taking a sample is due to the limitations of my current system used for analysis. I’m making an assumption that the data is sufficiently randomized:
twitter <- texttidy("./final/en_US/en_US.twitter.txt", 0.05)
head(twitter, 3)
## # A tibble: 3 x 1
## text
## <chr>
## 1 How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Lov…
## 2 When you meet someone special... you'll know. Your heart will beat more …
## 3 they've decided its more fun if I don't.
dim(twitter)
## [1] 118007 1
So, we can see that ‘twitter’ now a tibble that contains the line of text as it’s only variable. the blogs and news text files are prepared using the same function.
With the text files now ready for analysis, we can look at the number of distinct words in each tibble. I’ll do this with another purpose built function called wordcount(). It uses the tidytext function unnest_tokens() to tease out each word, and then uses the stop_words data set to eliminate words like is, it, the, etc.. (see appendix for source code):
wctwitter <- wordcount(twitter)
wctwitter
## # A tibble: 64,046 x 2
## word n
## <chr> <int>
## 1 love 5340
## 2 day 4638
## 3 rt 4423
## 4 time 3942
## 5 lol 3427
## 6 3 2739
## 7 people 2607
## 8 follow 2459
## 9 happy 2386
## 10 tonight 2260
## # … with 64,036 more rows
This is interesting; Love is one of the top ten words! You also see a word “rt” which means “retweet”. One of the challenges in predictive text will be to determine if “rt” is something to include in an algorithm. This phrase has been used less on Twitter since a number of UI changes have been introduced.
What about the other two text files? Are there any words common to all three text files? Here’s a graph of the top 20 words in each file:
We can see some numbers in the top 20 words of each text file. Further analysis is needed to see if they’re used as a contraction for spelled out words (e.g, the previous sentence uses “top 20 words” instead of “top twenty words”).
A predictive text system not only needs to predict the word being typed, but also what the next word will after that. Bigrams, the occurrence of two words together, is an important variable in building a good prediction algorithm.
Let’s look for bigrams in the Twitter text. I’ll use a purpose built function getngrams() that uses unnest_tokens() to find word combinations (see appendix for source code):
getngrams(twitter, 2)
## # A tibble: 47,929 x 2
## ngram n
## <chr> <int>
## 1 in the 4018
## 2 for the 3795
## 3 of the 2826
## 4 on the 2391
## 5 to be 2329
## 6 thanks for 2219
## 7 to the 2207
## 8 at the 1823
## 9 i love 1794
## 10 going to 1681
## # … with 47,919 more rows
The top bigrams mostly consist of “stop words” which getngrams() doesn’t filter out (unlike wordcount()). At this time, I don’t know whether filtering out stop words are important to the eventual prediction product. More thought will be required.
Let’s look at trigrams:
getngrams(twitter, 3)
## # A tibble: 24,628 x 2
## ngram n
## <chr> <int>
## 1 thanks for the 1228
## 2 thank you for 458
## 3 looking forward to 450
## 4 i love you 419
## 5 for the follow 416
## 6 i want to 385
## 7 going to be 379
## 8 can't wait to 374
## 9 a lot of 322
## 10 i need to 306
## # … with 24,618 more rows
Here’s a table of each of the sampled text file, the number of words, bigrams. and trigrams:
## # A tibble: 3 x 5
## File Lines Words Bigrams Trigrams
## <chr> <int> <int> <int> <int>
## 1 twitter 118007 64046 47929 24628
## 2 blogs 44964 67314 61850 33571
## 3 news 50512 69769 61373 26902
texttidy <- function(filename, samplesize = 0.05) {
# This function will take a text file and create a tibble with each line of text
count <- countLines(filename)
ttout <- readLines(filename, n = count * samplesize) %>% # Read in the text
as_tibble() # Turn the output into a tibble
names(ttout) <- c("text") # Name the variable 'text'
# Done
ttout
}
wctwitter <- wordcount(twitter)
wcblogs <- wordcount(blogs)
wcnews <- wordcount(news)
totwords <- rbind(wctwitter %>% top_n(20) %>% mutate(source = "twitter"),
wcblogs %>% top_n(20) %>% mutate(source = "blogs"),
wcnews %>% top_n(20) %>% mutate(source = "news")
) %>%
group_by(source) %>%
mutate(word = reorder(word, n))
g <- totwords %>% ggplot(aes(reorder(word,n), n, fill = source))
g + geom_col() + coord_flip() + xlab("Top 20 Words")
getngrams <- function(ttout, combo) {
# This function takes texttidy() output, plus an integer to find ngrames
# bigrams: combo = 2. trigrams: combo = 3
ttout <- ttout %>%
unnest_tokens(ngram, text, token = "ngrams", n = combo) %>%
count(ngram, sort = TRUE) %>%
# Look for any occurance more than three times
filter(n > 3) %>%
# drop NA - happens when looking for trigrams
drop_na()
ttout
}