This is an exploratory analysis of data for the caption project of Coursera nine courses in data science from John Hopking University. It’s about predicting next word after some input words are typed. The dataset for this project is from Swiftkey and consist of three text datasets from blogs, news and twitter. The text are supposed to be written in english, but there are non english words there. For prediction purpose more we use data in the training set more the accuracy will be. But because of memory issues only 1% is used.
Notice : The code isn’t shown because of warning about plagiarism.
The total number of lines is given below. But our analysis is base on 1% of these dataset, because of memory issues. The blogs source which has the smallest number of lines is the most heaviest.
en_US.blogs.txt en_US.news.txt en_US.twitter.txt
899,288 1,010,242 2,360,148
The sample taken from provided datasets consist of :
[1] 42698 lines
It’s from these three sources of information. Let’s see the distribution of some words through them.
Document-feature matrix of: 3 documents, 10 features.
3 x 10 sparse Matrix of class "dfmSparse"
features
docs to a is that this any simple type tutorial applies
blog 10673 8942 4334 4723 2574 386 74 58 12 6
new 9089 8744 2816 3391 1211 267 36 26 0 1
twit 7940 5992 3483 2413 1635 250 29 24 1 4
The text contain non english words and some typing error words. Also when lowerring the text the meaning of certain words isn’t obvious. Let’s see some content found in the data that should be removed. I’ve identify some chineese and arabic texts.
[1] "<U+793C><U+4E49><U+5EC9><U+803B>" "______________" "<U+867D><U+7136>"
[4] "<U+6211><U+4EEC>" "<U+53EA><U+5F97>" "<U+4E09><U+7B49>"
[7] "<U+5B9D><U+8D1D>" "<U+8868><U+73B0>" "<U+771F><U+7684>"
[10] "<U+5FC3><U+91CC>" "<U+5C31><U+662F>" "<U+7B2C><U+4E00>"
[13] "360º" "________" "350º"
[16] "__" "<U+694A><U+679D>" "<U+3063><U+3066>"
[19] "<U+5C0F><U+3055><U+3044>" "<U+3057><U+305F>" "______"
[22] "<U+0B9A><U+0BBF><U+0BB1><U+0BC1>" "<U+0BA4><U+0BC1><U+0BB3><U+0BBF>" "<U+0BAA><U+0BC6><U+0BB1><U+0BC1>"
[25] "<U+0BB5><U+0BC6><U+0BB3><U+0BCD><U+0BB3><U+0BAE><U+0BCD>" "____" "_____"
[28] "<U+0639><U+064A><U+0634>" "<U+0627><U+0644><U+0623><U+062C><U+0648><U+0627><U+0621>" "___"
[31] "___________" "<U+5B9A><U+98DF>" "<U+697C><U+4E0A>"
[34] "<U+5F00><U+4F1A>" "<U+306A><U+3044>" "____________________"
Since the aim of this project is prediction of words the text need to be splitted into word to see the distribution of word occurring in the training set. We’re going to examine 1-gram, 2-gram and 3-gram.
The most frequent words of one word are given below. There are mostly english stop words. It’s normal because there are words occuring frequently in english.
the to and a of i in for is that
47,505 27,702 24,306 23,678 20,306 16,479 16,260 11,035 10,633 10,527
A 2-gram is a combination consisting of two words. The most frequent words of two words are given below.
of_the in_the to_the for_the on_the to_be at_the and_the
4,319 4,143 2,100 2,006 1,931 1,662 1,444 1,243
in_a with_the
1,178 1,056
The most frequent words of thres word are given below.
one_of_the a_lot_of thanks_for_the going_to_be to_be_a
360 291 236 181 165
out_of_the as_well_as the_end_of be_able_to i_want_to
157 148 144 144 143
High frequency words are few regardless of n-gram. But it is decreasing from 1-gram to 3-gram. The graph below shows the 10000 highest frequency words. The variable gram_i refer to i-gram i through 1-3. other variable refer to the name. The name for 1-gram is the rownames. So the most frequent 1-gram is the, follow by to then and. For the 2-gram it’s of the follow by in the and to the. And finally it’s one of the follow by a lot of and thanks for the for the 3.gram
gram_1 gram2_name gram_2 gram3_name gram_3
the 47505 of_the 4319 one_of_the 360
to 27702 in_the 4143 a_lot_of 291
and 24306 to_the 2100 thanks_for_the 236
a 23678 for_the 2006 going_to_be 181
of 20306 on_the 1931 to_be_a 165
i 16479 to_be 1662 out_of_the 157
in 16260 at_the 1444 as_well_as 148
for 11035 and_the 1243 the_end_of 144
is 10633 in_a 1178 be_able_to 144
that 10527 with_the 1056 i_want_to 143
I’m going to use the three basic ngrams (1-3) for the prediction purpose. I think I’m going to use the Markov chain. The frequency of n-grams suit our analysis.