Abstract

This is an exploratory analysis of data for the caption project of Coursera nine courses in data science from John Hopking University. It’s about predicting next word after some input words are typed. The dataset for this project is from Swiftkey and consist of three text datasets from blogs, news and twitter. The text are supposed to be written in english, but there are non english words there. For prediction purpose more we use data in the training set more the accuracy will be. But because of memory issues only 1% is used.

Notice : The code isn’t shown because of warning about plagiarism.

Exploratory analysis

Number of lines

The total number of lines is given below. But our analysis is base on 1% of these dataset, because of memory issues. The blogs source which has the smallest number of lines is the most heaviest.

  en_US.blogs.txt    en_US.news.txt en_US.twitter.txt 
          899,288         1,010,242         2,360,148 

The sample taken from provided datasets consist of :

[1] 42698 lines

It’s from these three sources of information. Let’s see the distribution of some words through them.

Document-feature matrix of: 3 documents, 10 features.
3 x 10 sparse Matrix of class "dfmSparse"
      features
docs      to    a   is that this any simple type tutorial applies
  blog 10673 8942 4334 4723 2574 386     74   58       12       6
  new   9089 8744 2816 3391 1211 267     36   26        0       1
  twit  7940 5992 3483 2413 1635 250     29   24        1       4

Cleanning the data

The text contain non english words and some typing error words. Also when lowerring the text the meaning of certain words isn’t obvious. Let’s see some content found in the data that should be removed. I’ve identify some chineese and arabic texts.

 [1] "<U+793C><U+4E49><U+5EC9><U+803B>" "______________"       "<U+867D><U+7136>"    
 [4] "<U+6211><U+4EEC>"     "<U+53EA><U+5F97>"     "<U+4E09><U+7B49>"    
 [7] "<U+5B9D><U+8D1D>"     "<U+8868><U+73B0>"     "<U+771F><U+7684>"    
[10] "<U+5FC3><U+91CC>"     "<U+5C31><U+662F>"     "<U+7B2C><U+4E00>"    
[13] "360º"                 "________"             "350º"                
[16] "__"                   "<U+694A><U+679D>"     "<U+3063><U+3066>"    
[19] "<U+5C0F><U+3055><U+3044>" "<U+3057><U+305F>"     "______"              
[22] "<U+0B9A><U+0BBF><U+0BB1><U+0BC1>" "<U+0BA4><U+0BC1><U+0BB3><U+0BBF>" "<U+0BAA><U+0BC6><U+0BB1><U+0BC1>"
[25] "<U+0BB5><U+0BC6><U+0BB3><U+0BCD><U+0BB3><U+0BAE><U+0BCD>" "____"                 "_____"               
[28] "<U+0639><U+064A><U+0634>" "<U+0627><U+0644><U+0623><U+062C><U+0648><U+0627><U+0621>" "___"                 
[31] "___________"          "<U+5B9A><U+98DF>"     "<U+697C><U+4E0A>"    
[34] "<U+5F00><U+4F1A>"     "<U+306A><U+3044>"     "____________________"

Text tokenization

Since the aim of this project is prediction of words the text need to be splitted into word to see the distribution of word occurring in the training set. We’re going to examine 1-gram, 2-gram and 3-gram.

1-gram

The most frequent words of one word are given below. There are mostly english stop words. It’s normal because there are words occuring frequently in english.

   the     to    and      a     of      i     in    for     is   that 
47,505 27,702 24,306 23,678 20,306 16,479 16,260 11,035 10,633 10,527 

2-gram

A 2-gram is a combination consisting of two words. The most frequent words of two words are given below.

  of_the   in_the   to_the  for_the   on_the    to_be   at_the  and_the 
   4,319    4,143    2,100    2,006    1,931    1,662    1,444    1,243 
    in_a with_the 
   1,178    1,056 

3-gram

The most frequent words of thres word are given below.

    one_of_the       a_lot_of thanks_for_the    going_to_be        to_be_a 
           360            291            236            181            165 
    out_of_the     as_well_as     the_end_of     be_able_to      i_want_to 
           157            148            144            144            143 

Summary

High frequency words are few regardless of n-gram. But it is decreasing from 1-gram to 3-gram. The graph below shows the 10000 highest frequency words. The variable gram_i refer to i-gram i through 1-3. other variable refer to the name. The name for 1-gram is the rownames. So the most frequent 1-gram is the, follow by to then and. For the 2-gram it’s of the follow by in the and to the. And finally it’s one of the follow by a lot of and thanks for the for the 3.gram

     gram_1 gram2_name gram_2     gram3_name gram_3
the   47505     of_the   4319     one_of_the    360
to    27702     in_the   4143       a_lot_of    291
and   24306     to_the   2100 thanks_for_the    236
a     23678    for_the   2006    going_to_be    181
of    20306     on_the   1931        to_be_a    165
i     16479      to_be   1662     out_of_the    157
in    16260     at_the   1444     as_well_as    148
for   11035    and_the   1243     the_end_of    144
is    10633       in_a   1178     be_able_to    144
that  10527   with_the   1056      i_want_to    143

Prediction prespective

I’m going to use the three basic ngrams (1-3) for the prediction purpose. I think I’m going to use the Markov chain. The frequency of n-grams suit our analysis.