I have beeen given a large corpora of text for use in building a predictive text algorithm. This algoritm will be used in a web application built using Shiny. The application will be similar to the predictive text you see as you type a text message on your phone. To do this, we will use the last few words typed to predict and display the words most likely to be typed next. The person typing will be able to select one of these words instead of typing it.
I began by collecting some resources to aid me in my project.
Text Mining in R This is a technical description of the tm package, with many examples pertaining to common text mining tasks.
CRAN page for tm The ‘Introduction to the tm package’ vignette is especially helpful and highly readable. It describes how to create a Corpus from a Source, transform it using the tm_map function, filter it using the tm_filter function and manage its metadata. There are also sections on common operations used in text analysis that I may add details on later.
A gentle introduction to text mining using R This blog post is a novice’s guide, and even covers installing R, but eventually does address some deeper aspects of using the tm and SnowballC packages.
CRAN Task View: Natural Language Processing This resource contains a brief descriptions and links to many NLP frameworks and resources available in R.
Stanford NLP Course Videos A link to a YouTube playlist for Stanford Professor Dan Jurafsky’s NLP Course, no longer on Coursera.
NTLK Book This is a textbook about text analysis in Python, but the exercises are good for any language.
Text Mining with R, A Tidy Approach Julia Silge and David Robinson’s new book on O’Reilly. In short, the tidyverse meets NLP.
Text Processing in R This is a brief tutorial.
Text Mining, The Tidy Way Julia Silge’s January 14, 2017 presentation at RStudio::Conf advocating the application of tidyverse to text mining tasks.
JHU Data Science Speicalization - Capstone Course - Week 1 This gives a good overview of the capstone project ‘task 0’ - understanding the problem and ‘task 1’ - getting and cleaning the data.
JHU Data Science Speicalization - Capstone Course - Week 2 This provides an overview of the capstone project ‘task 2 - exploratory data analysis’ and ‘task 3 - modeling’.
I downloaded and extract the data files into my project directory. For brevity sake, I am not displaying this code as there is nothing remarkable about it, but it is a part of the source document.
More information on these corpora of text was located using Wayback Machine and can be read here: About the Corpora
After considering the many options for reading, cleaning and tokenizing data for analysis, I decided to proceed using an approach as outlined by Hadley Wickham here.
The first step is to read and sample the data. The entire data set is too large for the resources I have available, so I am sampling 5% of the lines of text for my analysis and discarding the remainder, for the time being. I am showing the full code here so that this step can be reproduced if desired. We’ll read, sample, and calculate basic statistics on the three data sources provided.
# This function takes as input the filename of a text file representing a corpus, a percentage of lines to select for a sample, and a tag for the source, and return a list incluing the size of the corpus, the sample size, and a data_frame of the selected lines
read_afile <- function(filename, sample_pct, tag) {
if (sample_pct < 0 | sample_pct > 1) {
stop("sample_pct must be between 0 and 1")
}
fullfile <- read_lines(filename, progress=FALSE)
sampled <- data_frame(source = tag,
text = sample(x = fullfile, size = length(fullfile)*sample_pct))
list(sample=sampled, sample_length=nrow(sampled), corpus_length=length(fullfile))
}
# Use the function above to samples of the input files. The function returns the data and some statistics in a list. We will parse the list to extract the lines of text and these statistics.
samp_pct = 0.05
blogl <- read_afile("final/en_US/en_US.blogs.txt", sample_pct = samp_pct, tag = "b")
blogs <- blogl[["sample"]]
newsl <- read_afile("final/en_US/en_US.news.txt", sample_pct = samp_pct, tag = "n")
newss <- newsl[["sample"]]
tweetl <- read_afile("final/en_US/en_US.twitter.txt", sample_pct = samp_pct, tag = "t")
tweets <- tweetl[["sample"]]
# Begin to create a structure to contain some key statistics about the corpora.
text_stats <- data_frame(
source=c("b","n","t"),
all_lines=c(blogl[["corpus_length"]],newsl[["corpus_length"]],tweetl[["corpus_length"]]),
sampled_lines=c(blogl[["sample_length"]],newsl[["sample_length"]],tweetl[["sample_length"]]))
Next, I joined these data together into a variable named “sample_text” and remove the temporary variables I used to build it to free up memory. Also, I added line numbers to the data, for backtracking, if needed.
# Simply join all of the inputs, add a linenum so we don't lose track of where things came from, and re-order the columns.
sample_text <- union_all(blogs,newss) %>%
union_all(tweets) %>%
group_by(source) %>%
mutate(linenum = row_number()) %>%
ungroup() %>%
select(source, linenum, text)
# Clean up some duplicate object consuming mamory
rm(list=c("blogl","newsl","tweetl",
"blogs","newss","tweets"))
My first pass at cleaning the data is at the character level and including these four steps. I dealt with a multiple characters that appear as single quotes in the text, convert underscores to spaces, translate other non-ASCII characters to ASCII equivalents, if they eixts, and then remove any remaining non-ASCII characters.
sample_text$text <- stri_replace_all(sample_text$text,"\u0027",regex="[\u0027\u2018\u2019]")
sample_text$text <- stri_replace_all(sample_text$text,"\u0020",regex="[\u005F]")
sample_text$text <- stri_trans_general(sample_text$text,"latin-ascii")
sample_text$text <- stri_replace_all(sample_text$text, " ", regex = "[^\u0001-\u007F]")
There were seveal categoies of words (stop words, profanity and word containing numerals) that I wasn’t sure about removing from the sample text prior to exploration. I gathered infomration about these and decided to exclude many details for brevity. After some analysis, I decided on using the tidytext “stop word”" list and LDNOOBW’s repo of “profane words”.
There are 1149 stop words and
377 profane words in the list. At this point I tokenized the text and added a coulmn for word number in the line. Ten rows are shown corresponding to the first ten words in the first line of text in the ’b’log source.
# Tokenize the sample text to words
sample_words <- sample_text %>%
unnest_tokens(word,text) %>%
group_by(source, linenum) %>%
mutate(wordnum = row_number()) %>%
ungroup() %>%
arrange(source, linenum, wordnum)
print(sample_words,n=10)
## # A tibble: 5,136,634 × 4
## source linenum word wordnum
## <chr> <int> <chr> <int>
## 1 b 1 how 1
## 2 b 1 to 2
## 3 b 1 walk 3
## 4 b 1 in 4
## 5 b 1 the 5
## 6 b 1 woods 6
## 7 b 1 without 7
## 8 b 1 being 8
## 9 b 1 in 9
## 10 b 1 the 10
## # ... with 5.137e+06 more rows
To deepen my analysis, I calcualted some additional statistics about the total word count, stop word count, profane word count, and numeric word count, by source. A table is shown below.
# Count occurances of word categories that we may exclude for some analysis
total_words <- group_by(sample_words,source) %>%
count() %>%
rename(total_words=n)
numeric <- filter(sample_words, grepl("[0-9]",word)) %>%
group_by(source) %>%
count() %>%
rename(numeric=n)
profanity <- inner_join(sample_words, profane_words, by="word") %>%
group_by(source) %>%
count() %>%
rename(profanity=n)
stop_wds <- inner_join(sample_words, stop_words, by="word") %>%
group_by(source) %>%
count() %>%
rename(stop_wds=n)
# Add them to our test_stats structure
text_stats <- text_stats %>%
inner_join(total_words, by="source") %>%
inner_join(numeric, by="source") %>%
inner_join(profanity, by="source") %>%
inner_join(stop_wds, by="source")
text_stats
## # A tibble: 3 × 7
## source all_lines sampled_lines total_words numeric profanity stop_wds
## <chr> <int> <int> <int> <int> <int> <int>
## 1 b 899288 44964 1895380 23502 1239 3117804
## 2 n 1010242 50512 1735435 46304 554 2450622
## 3 t 2360148 118007 1505819 28317 5284 2249485
I decided for this project, I would only remove numeric and profane words.
clean_words <- sample_words %>%
anti_join(profane_words, by="word") %>%
filter(!grepl("[0-9]",word))
word_count <- nrow(clean_words)
I was left with 5031434 words in the sample. Here is a chart showing the 30 most frequent words in each source of the sample.
top30 <- clean_words %>%
count(word, source) %>%
group_by(source) %>%
top_n(30,wt=n) %>%
arrange(source,desc(n))
facet_labels <- c(b="Blogs",n="News",t="Tweets")
g <- ggplot(data=top30, aes(x=word, y=n/100, fill=source))
g <- g + geom_bar(stat="identity")
g <- g + facet_grid(. ~ source, labeller = labeller(source=facet_labels))
g <- g + theme(legend.position="none")
g <- g + xlab(NULL) + ylab("Frequency x 100")
g <- g + coord_flip()
g
Next I looked at word frequencies across the entire sample, regardless of source. Here are the top 30 words.
word_freq <- clean_words %>%
count(word, source) %>%
spread(source,n) %>%
mutate(freq=sum(c(b,n,t),na.rm=TRUE),
pct = freq/word_count) %>%
rename(blogs=b, news=n, tweets=t) %>%
arrange(desc(freq)) %>%
ungroup() %>%
mutate(rank = row_number())
print(word_freq,n=30)
## # A tibble: 120,299 × 7
## word blogs news tweets freq pct rank
## <chr> <int> <int> <int> <int> <dbl> <int>
## 1 the 93443 99096 47068 239607 0.047622010 1
## 2 to 53551 44690 39318 137559 0.027339919 2
## 3 and 55730 44316 21589 121635 0.024175017 3
## 4 a 45182 43819 30378 119379 0.023726635 4
## 5 of 44287 39031 18013 101331 0.020139586 5
## 6 in 30217 34102 19096 83415 0.016578773 6
## 7 i 39095 7884 36234 83213 0.016538625 7
## 8 for 18400 17976 19473 55849 0.011100016 8
## 9 is 21694 14268 17926 53888 0.010710267 9
## 10 that 23186 17379 11651 52216 0.010377956 10
## 11 you 15212 4858 27576 47646 0.009469666 11
## 12 it 20417 11102 14560 46079 0.009158224 12
## 13 on 13883 13671 13943 41497 0.008247549 13
## 14 with 14576 12668 8711 35955 0.007146074 14
## 15 was 13967 11464 6081 31512 0.006263026 15
## 16 my 13795 2051 14591 30437 0.006049369 16
## 17 at 8679 10651 9472 28802 0.005724412 17
## 18 be 10386 7741 9252 27379 0.005441590 18
## 19 this 12941 6023 8048 27012 0.005368648 19
## 20 have 11186 7172 8416 26774 0.005321346 20
## 21 are 9864 6878 8076 24818 0.004932590 21
## 22 but 10423 7576 6364 24363 0.004842158 22
## 23 as 11121 9334 3587 24042 0.004778359 23
## 24 he 7474 11210 2831 21515 0.004276117 24
## 25 we 9610 5045 6680 21335 0.004240342 25
## 26 not 8865 5812 6253 20930 0.004159848 26
## 27 so 8418 2860 8206 19484 0.003872455 27
## 28 from 7384 7629 4256 19269 0.003829723 28
## 29 me 7147 1365 10031 18543 0.003685430 29
## 30 all 7337 3579 6003 16919 0.003362660 30
## # ... with 1.203e+05 more rows
At this point, it is the word frequency that interests me. I looked at descriptive statistics for this variable. The data is positively skewed and has a high standard deviation and kurtosis. In layman’s terms there is a big pileup of words with low frequencies, and relatively small number of words account for a large portion of the sample. We’ll do more work on that as we start a statistical analysis.
Below is a histogram of word frequencies for all words in the sample that shows this pile-up.
summary(word_freq$freq)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 1.00 1.00 41.82 5.00 239600.00
describe(word_freq$freq)
## vars n mean sd median trimmed mad min max range skew
## X1 1 120299 41.82 1174.49 1 3.39 0 1 239607 239606 118.94
## kurtosis se
## X1 18876.18 3.39
g <- ggplot(word_freq, aes(pct))
g <- g + geom_histogram(bins=20, fill="blue", na.rm=TRUE)
g <- g + xlim(NA,0.0009)
g
Interestingly, this “piled-up” distribution of words in a corpus of text is common and is predicted by Zipf’s Law. This law says that the frequency of any word in a corpus of words is inversely proportional to its rank. So, the most frequent word occurs about twice as often as the next most frequent word, three times as often as the third most frequent word, and so forth. A plot of term frequncy vs rank in our sample demonstrates this.
g <- ggplot(word_freq, aes(x=rank, y=pct))
g <- g + geom_line(size = 1.5, color="darkseagreen")
g <- g + scale_x_log10() + scale_y_log10()
g <- g + xlab("log10 Rank") + ylab("log10 Word Frequency")
g
As I move forward, I will complete a similar analysis for n-grams in the data set to see if they follow a similar pattern. As above, the unnest_tokens function in tidytext has this capability, so the analysis is similar to the above and omitted for brevity.
My plan for moving forward is to determine the smallest set of words and n-grams needed to cover a given percentage of the language by calculating a cumulative distribution function for word and n-gram frequency. This will allow me to remove many low-frequency words from the data and focus on specific words and phrases that occur most frequently. Then, I want an efficient way to calculate the probabity a number of possible next words, across the remaining set of common words.
There are several models that can be used to effectively model the data. There has been a suggestion that Markhov chains may provide an efficient way to store and use the data, and this is my default plan, however, I want to do some additional reading to see what new techniques may be available
As a creative direction, I want the product to be intuitive and easy to use, responsive to input from the user, and interesting. I would like to keep some statistics as it gets used to try and improve the model for a specific user.
I will use Slidify to complete the data product.
I have a few ideas for a product pitch, these remain in development, for now.