1 Project overview

This project is about natural language processing (NLP), using computer to learn, understand language and make predictions on words or sentences. It is also a text mining project. We have the data from HC Corpora, including three data sets: 1) US blog 2) US twitter 3) US news.

We are going to do statistical NLP, by using statistical method (eg. corpora and Markov model in R language evironment. The method applied here is inspired by Text mining infrastucture in R, using R packages like tm, RWeka. Additionally, the online tutorials are extermely helpful.

2 Basic summary of the data

The data sets are loaded in to R separately and stored in to “.RData” files respectively. After that, I looked at some basic features of these data files summerized in Table 1.

Table 1: Data table
blogs news twitter
file size(M) 200 196 159
line number 899288 1010242 2360148
word counts 37334690 34372720 30374206

As you can see from the data, twitter file has largest number of lines and blog file has bigest file size with largest number of word and smallest number of lines. They are all big files.

3 Features of the data

The large size of data is a big problem for analysizing in personal computer. In oder to get features of the data, we randomly select 1/10 subset of each data set to do data explortary analysis. By using the tm framework, we constructed the corpus object, cleaned it up and obtained the term frequency matrix for each subsets of files.

3.1 Frequency of words in each file

Figure 1: Word frequncy distribution
Distributions of frequency of words

As you can seen from Fig.1, the distribution of the words in the files have similar shape, which has a long tail. This distribution means majority of the words in the file only appear one times, and only a small fraction of words have a high frequency.

Figure 2: Top 10 frequent word
Top 10 frequent word

3.2 Frequency of words in all three files (word cloud)

Next we look at what’s the most frquent words in these three files in Fig.2. It is interesting to see will and like are in Top 10 of all the files. Due to the charactistics of the file type, some top words are only specific to certain file type. For example, im is the most frequent word in twitter file because twitter is more personal. Also, one is the most frequent in news because news are supposed to be objective.

You can also plot the word cloud to see what are the most frequent words. Below, we show the clouds for all three datasets.

Figure 3: Word clouds of all three datasets combined.

The top 5 words are: of, a, and, to, the, s. If we eliminate them and replot the word cloud, it will give more information.

Figure 4: Word clouds of all three datasets combined (omiting top 5 words)

4 What’s the plan for next step

Next, I will breifly talk about my plan for next steps.

  1. Build up basic n-gram model with the help of back off model for smoothig
  2. Combine these basic models and using random tree method to improve prediction accuracy.
  3. Then put the model in a Shiny app with documentation of the exactly model. The Shiny app will allow user to input any length of words, and make a prediction of the next word.