Executive Summary

The capstone is the final part of Data science specialization and we are asked to apply data science techniques in the area of natural language processing. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).The objective is to create a prediction algorithm that can predict the next word from a short phrase. In this milestone report:

Reading the .txt files

blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")

Summary statistics

File_Name File_Size (MB) # Lines (thousand) # Words (thousand)
Blogs 200.42 899.288 37334.131
News 196.28 77.259 2643.969
Twitter 159.36 2360.148 30373.543

Working with a sample from the data

Due to the large size of data it will be necessary to work with a smaller sample. We will do that by taking a random sample of 10% of each .txt file and create one corpus (sample.txt)

Preprocessing

Our computers can’t actually read. Punctuation and other special characters only look like more words to our computer and R. So by using TM package we will:

  • remove punctuation,
  • remove numbers,
  • convert all characters to lower case
  • remove stopwords (a, and, also, the, etc)
  • remove profanity words (we used a txt file that contains most of the pofanity words in EN language) here
  • Removing common word endings (e.g., “ing”, “es”, “s”)
  • Removing unnecessary whitespace

So in the end we will have corpus of plain text only.

Tokenizing

we will create 1-gram, 2-gram, and 3-gram tokenizers that we will use to make term document matrices to find the frequency of each n-gram in our corpus. Then we will be able to find histograms. Also the wordcloud package offers a neat visualisation of the most appearing n-grams in our corpus.

Key points

  • Computational times for reading and creating Term Document Matrices for each n-gram in particular is rough.
  • The algorithm should have moderate accuracy while being fast. We might need to reduce our sample to reduce the calculation on prediction algorithm.
  • Our algorthim will be based probably on Stupid Backoff N-gram Model which is fairly simple but we will see how it will perform!