This project is about natural language processing (NLP), using computer to learn, understand language and make predictions on words or sentences. It is also a text mining project. We have the data from HC Corpora, including three data sets: 1) US blog 2) US twitter 3) US news.
We are going to do statistical NLP, by using statistical method (eg. corpora and Markov model in R language evironment. The method applied here is inspired by Text mining infrastucture in R, using R packages like tm, RWeka. Additionally, the online tutorials are extermely helpful.
The data sets are loaded in to R separately and stored in to “.RData” files respectively. After that, I looked at some basic features of these data files summerized in Table 1.
| blogs | news | ||
|---|---|---|---|
| file size(M) | 200 | 196 | 159 |
| line number | 899288 | 1010242 | 2360148 |
| word counts | 37334690 | 34372720 | 30374206 |
As you can see from the data, twitter file has largest number of lines and blog file has bigest file size with largest number of word and smallest number of lines. They are all big files.
The large size of data is a big problem for analysizing in personal computer. In oder to get features of the data, we randomly select 1/10 subset of each data set to do data explortary analysis. By using the tm framework, we constructed the corpus object, cleaned it up and obtained the term frequency matrix for each subsets of files.
Distributions of frequency of words
As you can seen from Fig.1, the distribution of the words in the files have similar shape, which has a long tail. This distribution means majority of the words in the file only appear one times, and only a small fraction of words have a high frequency.
Top 10 frequent word
Next we look at what’s the most frquent words in these three files in Fig.2. It is interesting to see will and like are in Top 10 of all the files. Due to the charactistics of the file type, some top words are only specific to certain file type. For example, im is the most frequent word in twitter file because twitter is more personal. Also, one is the most frequent in news because news are supposed to be objective.
Next, I will breifly talk about my plan for next steps.
n-gram model with the help of back off model for smoothig