This report is a course milestone report for the second week of “Data Science Capstone” Coresera course provided by Johns Hopkins University.

Basic summaries of the Corpora

In this step, I first downloaded the corpora files, and then read them into R. The three corpora in English (en_US.twitter, en_US.blogs and en_US.news) were analysed. Here is a brief summary about how many lines do they contain and how long (the mean and maximum number of characters) are the lines in them:

##              lines mean_nchar max_nchar
## US.twitter 2360148         69       140
## US.blogs    899288        230     40833
## US.news    1010242        201     11384

The twitter file has the most lines, but the lines are short (no more than 140 characters). The blogs and news files contain less lines but they can be very long.

Sampling, cleaning and profanity filtering

A too large sample will decrease the speed for calculating, here I randomly sampled 1/50 of the lines from each of the three corpora and combined them into one sample object.

For cleaning, I replaced and removed some full-width and Unicode characters which won’t be correctly recognised in predicting models. And turned all the characters into lower-case.

Then, I removed 377 profanity words/phrases in the sample object to ensure that they won’t be predicted by the model.

Stemming and tokenization

In this step, I first broke down each line in the sample into shorter lines by sentance. Because punctuations are not included in this prediction model, a bigram of the two words before and after a punctuation will be mostly meaningless.

Punctuations, symbols, numbers, '@', ‘#’ and urls were removed from the texts, then the left text was stemmed.

At this stage, I just created 1-gram, 2-grams and 3-grams for starting a basic model.

Statisitcs

First, let’s start from 1-gram, that means all the single words that appeared in the sample. According to the data, there are 50613 different 1-grams in the sample, and they appeared 1.9413210^{6} times in total. A histogram is made to show the distribution of the frequencies that each 1-gram appeared in the sample:

Oops! It seems that a majority of 1-grams appeared less than 5000 times, but some words appeared over 80000 times. So this histogram has to be log-transformed as follows:

Now it’s much clearer that among all the 50613 different 1-grams, over 30000 1-grams only appeared once, about 10000 1-grams only appeared 2~10 times, while less than 100 1-grams appeared 1000~100000 times each. This frequency distribution can be displayed in another form:

From above it can be seen that, the accumulation curve of coverage grows very steeply, thus only 117 out of the 50613 1-grams are enough to cover 50% of the text in the sample, and 3382 1-grams can cover 90%. The top-10 1-grams are:

##    feature frequency
## 1      the     94049
## 2       to     54244
## 3      and     47040
## 4        a     46840
## 5       of     39329
## 6        i     32432
## 7       in     32182
## 8       it     24841
## 9     that     21893
## 10     for     21496

As to 2-grams, there are 567608 different 2-grams in the sample, and they appeared 1.70845210^{6} times in total. That’s much more than 1-grams. The histogram, accumulation curve and top-10 are shown below:

##     feature frequency
## 1    of_the      8647
## 2    in_the      7936
## 3    to_the      4268
## 4   for_the      3887
## 5    on_the      3854
## 6     to_be      3217
## 7    at_the      2850
## 8   and_the      2437
## 9      in_a      2262
## 10 with_the      2091

As to 3-grams, there are 1107117 different 3-grams in the sample, and they appeared 1.47558410^{6} times in total. The histogram, accumulation curve and top-10 are shown below:

##            feature frequency
## 1       one_of_the       721
## 2         a_lot_of       586
## 3    thank_for_the       459
## 4        i_want_to       443
## 5         go_to_be       351
## 6          to_be_a       346
## 7       out_of_the       316
## 8       the_end_of       313
## 9  look_forward_to       311
## 10       be_abl_to       304

It can be inferenced that, for N-grams, the larger N it is, the more combinations there will be while they appear less frequently, and the less steeply will the accumulation curve grows so it needs more N-grams to cover a certain percentage of the text.

Further plans and confusions

I’ve read about the Katz Back-Off Models and known how it works, but have no idea how to realize them in R at present. And for my computer, calculation of 4-grams is already too time-consuming, I cannot imagine 5- or 6-grams. So how to improve the efficiency is a big problem. And I havn’t understood about how to evaluate the model yet.

I removed all the punctuations in my sample text, but I wonder if they can be kept and applied in any predicting models.

I stemmed the words for biulding model, but do I need to consider about how to change them into normal status?

A lot to learn…

If you would like to interpret any of my confusions, please reply or email me: catreewp@outlook.com , thank you so much!