Executive Summary

The capstone is the final part of Data science specialization and we are asked to apply data science techniques in the area of natural language processing. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).The objective is to create a prediction algorithm that can predict the next word from a short phrase. In this milestone report:

we will try to describe our first approach in the data.
we will present some simple statistics for the data
we will do exploratory analysis to get an insight of the datathoughts for next steps
we will explore a variety of R packages like NLP, TM, RWeka etc to help us in this project.

Reading the .txt files

blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")

Summary statistics

File_Name	File_Size (MB)	# Lines (thousand)	# Words (thousand)
Blogs	200.42	899.288	37334.131
News	196.28	77.259	2643.969
Twitter	159.36	2360.148	30373.543

Working with a sample from the data

Due to the large size of data it will be necessary to work with a smaller sample. We will do that by taking a random sample of 10% of each .txt file and create one corpus (sample.txt)

Preprocessing

Our computers can’t actually read. Punctuation and other special characters only look like more words to our computer and R. So by using TM package we will:

remove punctuation,
remove numbers,
convert all characters to lower case
remove stopwords (a, and, also, the, etc)
remove profanity words (we used a txt file that contains most of the pofanity words in EN language) here
Removing common word endings (e.g., “ing”, “es”, “s”)
Removing unnecessary whitespace

So in the end we will have corpus of plain text only.

Tokenizing

we will create 1-gram, 2-gram, and 3-gram tokenizers that we will use to make term document matrices to find the frequency of each n-gram in our corpus. Then we will be able to find histograms. Also the wordcloud package offers a neat visualisation of the most appearing n-grams in our corpus.

Key points

Computational times for reading and creating Term Document Matrices for each n-gram in particular is rough.
The algorithm should have moderate accuracy while being fast. We might need to reduce our sample to reduce the calculation on prediction algorithm.
Our algorthim will be based probably on Stupid Backoff N-gram Model which is fairly simple but we will see how it will perform!

CapStone Project Milestone Report

Thymios C

11 February 2017