The capstone is the final part of Data science specialization and we are asked to apply data science techniques in the area of natural language processing. The data is from a corpus called HC Corpora (www.corpora.heliohost.org).The objective is to create a prediction algorithm that can predict the next word from a short phrase. In this milestone report:
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8")
| File_Name | File_Size (MB) | # Lines (thousand) | # Words (thousand) |
|---|---|---|---|
| Blogs | 200.42 | 899.288 | 37334.131 |
| News | 196.28 | 77.259 | 2643.969 |
| 159.36 | 2360.148 | 30373.543 |
Due to the large size of data it will be necessary to work with a smaller sample. We will do that by taking a random sample of 10% of each .txt file and create one corpus (sample.txt)
Our computers can’t actually read. Punctuation and other special characters only look like more words to our computer and R. So by using TM package we will:
So in the end we will have corpus of plain text only.
we will create 1-gram, 2-gram, and 3-gram tokenizers that we will use to make term document matrices to find the frequency of each n-gram in our corpus. Then we will be able to find histograms. Also the wordcloud package offers a neat visualisation of the most appearing n-grams in our corpus.