This report is a course milestone report for the second week of “Data Science Capstone” Coresera course provided by Johns Hopkins University.
In this step, I first downloaded the corpora files, and then read them into R. The three corpora in English (en_US.twitter, en_US.blogs and en_US.news) were analysed. Here is a brief summary about how many lines do they contain and how long (the mean and maximum number of characters) are the lines in them:
## lines mean_nchar max_nchar
## US.twitter 2360148 69 140
## US.blogs 899288 230 40833
## US.news 1010242 201 11384
The twitter file has the most lines, but the lines are short (no more than 140 characters). The blogs and news files contain less lines but they can be very long.
A too large sample will decrease the speed for calculating, here I randomly sampled 1/50 of the lines from each of the three corpora and combined them into one sample object.
For cleaning, I replaced and removed some full-width and Unicode characters which won’t be correctly recognised in predicting models. And turned all the characters into lower-case.
Then, I removed 377 profanity words/phrases in the sample object to ensure that they won’t be predicted by the model.
In this step, I first broke down each line in the sample into shorter lines by sentance. Because punctuations are not included in this prediction model, a bigram of the two words before and after a punctuation will be mostly meaningless.
Punctuations, symbols, numbers, '@', ‘#’ and urls were removed from the texts, then the left text was stemmed.
At this stage, I just created 1-gram, 2-grams and 3-grams for starting a basic model.
First, let’s start from 1-gram, that means all the single words that appeared in the sample. According to the data, there are 50613 different 1-grams in the sample, and they appeared 1.9413210^{6} times in total. A histogram is made to show the distribution of the frequencies that each 1-gram appeared in the sample:
Oops! It seems that a majority of 1-grams appeared less than 5000 times, but some words appeared over 80000 times. So this histogram has to be log-transformed as follows:
Now it’s much clearer that among all the 50613 different 1-grams, over 30000 1-grams only appeared once, about 10000 1-grams only appeared 2~10 times, while less than 100 1-grams appeared 1000~100000 times each. This frequency distribution can be displayed in another form:
From above it can be seen that, the accumulation curve of coverage grows very steeply, thus only 117 out of the 50613 1-grams are enough to cover 50% of the text in the sample, and 3382 1-grams can cover 90%. The top-10 1-grams are:
## feature frequency
## 1 the 94049
## 2 to 54244
## 3 and 47040
## 4 a 46840
## 5 of 39329
## 6 i 32432
## 7 in 32182
## 8 it 24841
## 9 that 21893
## 10 for 21496
As to 2-grams, there are 567608 different 2-grams in the sample, and they appeared 1.70845210^{6} times in total. That’s much more than 1-grams. The histogram, accumulation curve and top-10 are shown below:
## feature frequency
## 1 of_the 8647
## 2 in_the 7936
## 3 to_the 4268
## 4 for_the 3887
## 5 on_the 3854
## 6 to_be 3217
## 7 at_the 2850
## 8 and_the 2437
## 9 in_a 2262
## 10 with_the 2091
As to 3-grams, there are 1107117 different 3-grams in the sample, and they appeared 1.47558410^{6} times in total. The histogram, accumulation curve and top-10 are shown below:
## feature frequency
## 1 one_of_the 721
## 2 a_lot_of 586
## 3 thank_for_the 459
## 4 i_want_to 443
## 5 go_to_be 351
## 6 to_be_a 346
## 7 out_of_the 316
## 8 the_end_of 313
## 9 look_forward_to 311
## 10 be_abl_to 304
It can be inferenced that, for N-grams, the larger N it is, the more combinations there will be while they appear less frequently, and the less steeply will the accumulation curve grows so it needs more N-grams to cover a certain percentage of the text.
I’ve read about the Katz Back-Off Models and known how it works, but have no idea how to realize them in R at present. And for my computer, calculation of 4-grams is already too time-consuming, I cannot imagine 5- or 6-grams. So how to improve the efficiency is a big problem. And I havn’t understood about how to evaluate the model yet.
I removed all the punctuations in my sample text, but I wonder if they can be kept and applied in any predicting models.
I stemmed the words for biulding model, but do I need to consider about how to change them into normal status?
A lot to learn…
If you would like to interpret any of my confusions, please reply or email me: catreewp@outlook.com , thank you so much!