Data Science Capstone Milestone Report

Guanglai Li
July 25, 2015

Summary of the project

In this project we will build a model to predict the next word after typing one or more words. This kind of models have been widely used for text input in modbile devices such as cell phones and tablets.

data: we will use three text files, 'en_US.blogs.txt', 'en_US.news.txt', and 'en_US.twitter.txt' to build models for English language.
tools: R language is the major tool, combined with Linux commands.

This milestone report summarizes the works we have done that lead to the final goal of the project.

Sumarry of the data files

Linux command wc gives the number of lines, words, and bytes directly. It can be called from R. An example code is shown below.

system("wc en_US.blogs.txt")

The folowing table summarizes the number of lines and words for each file.

files	# lines	# words
en_US.blogs.txt	0.90 million	37.3 million
en_US.news.txt	1.01	34.3
en_US.twitter.txt	2.36	30.4

Word frequency I

use en_US.blogs.txt as an example
The most frequently used words are all the so-called stopwords. The top 10 and their percentages are shown below. Stopwords are usually excluded from model building.

wordfrequency

Word frequency II

55% of words only appear once in a total number of 37 million words
83% of the words appear less than 10 times
92% of words appear less than 37 times, one out of one million
95% of words appear less than 100 times
98% of words appear less than 370 times, one out of 100 thousands
or there are 7244 words that appears at least once in 100,000 words

outline of model building

delete the most frequent words that does not have much value to predict, such as 'a', 'the', 'and' …
delete rare words that appears less than one time out of 100,000 words
generate bigrams and trigrams of words
calculate the probability of all words following a particular word or a bigram
the most probable word is selected as next word