Data Science Capstone Milestone Report

Guanglai Li
July 25, 2015

Summary of the project

In this project we will build a model to predict the next word after typing one or more words. This kind of models have been widely used for text input in modbile devices such as cell phones and tablets.

  • data: we will use three text files, 'en_US.blogs.txt', 'en_US.news.txt', and 'en_US.twitter.txt' to build models for English language.

  • tools: R language is the major tool, combined with Linux commands.

This milestone report summarizes the works we have done that lead to the final goal of the project.

Sumarry of the data files

Linux command wc gives the number of lines, words, and bytes directly. It can be called from R. An example code is shown below.

system("wc en_US.blogs.txt")

The folowing table summarizes the number of lines and words for each file.

files # lines # words
en_US.blogs.txt 0.90 million 37.3 million
en_US.news.txt 1.01 34.3
en_US.twitter.txt 2.36 30.4

Word frequency I

  • use en_US.blogs.txt as an example
  • The most frequently used words are all the so-called stopwords. The top 10 and their percentages are shown below. Stopwords are usually excluded from model building.

wordfrequency

Word frequency II

  • 55% of words only appear once in a total number of 37 million words
  • 83% of the words appear less than 10 times
  • 92% of words appear less than 37 times, one out of one million
  • 95% of words appear less than 100 times
  • 98% of words appear less than 370 times, one out of 100 thousands
  • or there are 7244 words that appears at least once in 100,000 words

outline of model building

  • delete the most frequent words that does not have much value to predict, such as 'a', 'the', 'and' …
  • delete rare words that appears less than one time out of 100,000 words
  • generate bigrams and trigrams of words
  • calculate the probability of all words following a particular word or a bigram
  • the most probable word is selected as next word