Milestone report

This is the Week Two Milestone Report for the John Hopkins Data Science Capstone course on Coursera.

The report briefly describes the following stages of the project to develop a text prediction algorithm:

download the data and store it locally
read a sample of the data into computer memory
clean and transform the data so it is suitable as the basis for a predictive text model
undertake some exploratory data analysis and report some basic features of the data
describe initial thoughts on development and implementation of the predictive model

1. Download the data

Data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The English language data, which we discuss here, consisted of three files containing text extracted from published blogs, news-feeds, and twitter feeds repectively.

2. Read in and verify

Rather than using the full downloaded data, a sample was created and stored for use later stages of the project. The initial sample size was 1% of the full dataset. The size will be increased once an initial working model has been created and the improvement in performance with the sample size monitored.

The data was read into memory and stored as a “virtual corpus”.

Here are a few summarized details:

## <<VCorpus>>
## Metadata:  corpus specific: 0, document level (indexed): 0
## Content:  documents: 3

## <<PlainTextDocument>>
## Metadata:  7
## Content:  chars: 1579169

We can see there are 3 documents and more than 1.5 million characters.

3. Clean and transform the data

The following steps were undertaken to clean and transform the data:

convert all characters to lowercase
remove punctuation, except for intra-word dashes and intra-word apostrophes.
remove html tags and hyperlinks
remove numbers
remove words contained in a list of English “stopwords”
remove a small list of offensive words

4. Exploratory Data Analysis

I have used two approaches, one using the tm package of natural language processing tools, and another based on the tidytext package and other tools fro a group of packages collectively known as the “tidyverse”. For future work I will use the tidytext approach as much as possible for speed and memory efficiency.

4.1 Most frequent terms (tm package)

The tables below list the most frequent terms from each document.

## <<DocumentTermMatrix (documents: 3, terms: 50919)>>
## Non-/sparse entries: 76241/76516
## Sparsity           : 50%
## Maximal term length: 58
## Weighting          : term frequency (tf)
## Sample             :
##                    Terms
## Docs                can  get just like new now one said time will
##   en_US.blogs.txt   739  594  739  763 379 458 922  238  662  866
##   en_US.news.txt    456  344  431  393 490 247 612 1870  355  774
##   en_US.twitter.txt 893 1089 1482 1141 663 838 801  171  752  913

## $en_US.blogs.txt
##       one      will      like       can      just      time       get 
##       922       866       763       739       739       662       594 
##      know       now    people      back       day      even      make 
##       469       458       427       412       412       400       399 
##      also     first       new      love       see    really      well 
##       387       383       379       375       370       369       368 
##    little      good      much       way     think     going      many 
##       361       358       356       341       330       325       313 
##      life    things      want     still      itâs       two       say 
##       299       298       290       288       284       281       277 
##      made      work      last     years something      take     great 
##       272       267       265       263       259       250       248 
##      year      need      said       got       iâm    around     never 
##       240       239       238       237       237       235       235 
##     right 
##       235 
## 
## $en_US.news.txt
##      said      will       one       new       can       two      just 
##      1870       774       612       490       456       441       431 
##      also      year      like     state     years     first      time 
##       420       418       393       365       361       359       355 
##       get    people      last      city      make      game      says 
##       344       322       321       275       257       254       248 
##       now    school    county     going     three      back   million 
##       247       247       240       235       235       233       229 
##       way      even      many      good       may      team    police 
##       226       221       217       215       214       209       205 
##   percent      made    season      work     think       day     since 
##       204       202       196       192       191       186       183 
##      home      four president      well      much      know       say 
##       180       179       178       178       177       176       174 
##    public 
##       171 
## 
## $en_US.twitter.txt
##    just    like     get    love    will    good     can     day  thanks 
##    1482    1141    1089     973     913     900     893     848     841 
##     now     one    know    time   today   great     new     see     lol 
##     838     801     790     752     698     689     663     643     632 
##    back     got   going  people  follow   think   right   happy    need 
##     566     555     518     503     489     476     470     469     469 
##    want    much  really    make    come tonight   night    work   thank 
##     467     460     419     417     401     397     387     377     374 
##    last    hope    well     way   still    best   never     say    life 
##     369     352     348     338     323     319     318     317     313 
##  better   first  please twitter    next 
##     308     305     305     302     292

4.2 Summary totals

Here are some statistics about the number of words in the three documents.

Total number of unique terms: 50919

Total number of words across sources: 435288

Number of words in blogs: 143093

Number of words in news: 140658

Number of words in twitter: 151537

The 984 most common words account for 50% of the total, and the 15357account for 90% of the total (after stop words have been removed).

4.3 Some charts of word frequencies (they should match the tables in Section 4.1)

## Selecting by count

4.4 Characteristic words

In the next two sections we look at words that appear in all three documents and then those that are unique to each document.

I know that word clouds are viewed with the same distain by many data scientists as are pie charts, but for a quick impression of the language used I think they can be justified here.

Top 60 words appearing in all three sources:

Top 60 words in blogs:

Top 60 words in news:

It looks like we have a rather local set of news feeds - Cuyahoga is a river and a county in Ohio.

Top 60 words in twitter feeds: (several offensive words removed from here)

A Venn diagram of the number of words in each category and of the overlaps ### 4.5 Characteristic Word Frequencies - another approach

Here we are looking dor words characteristic of each group by using the tf-idf variable (see http://tidytextmining.com/tfidf.html)

The table below summarises the news-feed data. It’s clearly compatable with the word cloud above.

## # A tibble: 26,272 × 6
##          document           term count            tf       idf
##             <chr>          <chr> <dbl>         <dbl>     <dbl>
## 1  en_US.news.txt       cuyahoga    27 0.00019195495 1.0986123
## 2  en_US.news.txt    spokeswoman    23 0.00016351718 1.0986123
## 3  en_US.news.txt       analysts    19 0.00013507941 1.0986123
## 4  en_US.news.txt superintendent    17 0.00012086053 1.0986123
## 5  en_US.news.txt         winery    17 0.00012086053 1.0986123
## 6  en_US.news.txt    authorities    44 0.00031281548 0.4054651
## 7  en_US.news.txt           corp    16 0.00011375108 1.0986123
## 8  en_US.news.txt       county's    15 0.00010664164 1.0986123
## 9  en_US.news.txt     regulators    15 0.00010664164 1.0986123
## 10 en_US.news.txt        trenton    13 0.00009242276 1.0986123
## # ... with 26,262 more rows, and 1 more variables: tf_idf <dbl>

The following charts show the words most characteristic of each source document.

## Selecting by tf_idf

4.6 N-grams

The following two tables show the most frequent bigrams and trigrams.

Bigrams

## <<TermDocumentMatrix (terms: 2590, documents: 3)>>
## Non-/sparse entries: 7770/0
## Sparsity           : 0%
## Maximal term length: 21
## Weighting          : term frequency (tf)
## Sample             :
##                  Docs
## Terms             en_US.blogs.txt en_US.news.txt en_US.twitter.txt
##   feel like                    40             15                59
##   high school                  19             65                28
##   last night                   16              9               116
##   last year                    22             87                30
##   looking forward              15              7               110
##   new york                     43             83                19
##   p m                           7            152                 9
##   right now                    22             21               176
##   u s                          28            165                14
##   years ago                    53             52                17

Trigrams

## <<TermDocumentMatrix (terms: 46, documents: 3)>>
## Non-/sparse entries: 138/0
## Sparsity           : 0%
## Maximal term length: 22
## Weighting          : term frequency (tf)
## Sample             :
##                         Docs
## Terms                    en_US.blogs.txt en_US.news.txt en_US.twitter.txt
##   cinco de mayo                        2              1                12
##   four years ago                       3              4                 1
##   just around corner                   1              1                 6
##   just wanted say                      1              1                 6
##   let us know                          4              1                18
##   looking forward seeing               1              2                 8
##   new york city                       10              6                 3
##   new york times                       9              6                 1
##   two years ago                        2             10                 2
##   will take place                      2              4                 1

4.7 French Words

The following section attempts to extract a list of potential French words form the text. The method uses a list of French words found on the web, plus a list of English words also found on the web - however, these are just for demonstration purposes and could probably be improved (and properly cited!)

The method is to first find the intersection of our list of words with the French list. This gives us words that could be French, but there are a large number of strings that are valid words in both French and English so it makes sense to remove all the possibly English ones.

This leaves us with a list of just over 330 words, however we can’t be sure they are all intended to be French - some could be proper names or from other languages. And, of course, we will have filtered out some genuinely French Words because they were the same as an Englis word.

We’ve pronted out a sample of the words detected.

##  [1] "vida"          "alsace"        "tempe"         "mallette"     
##  [5] "puy"           "banderas"      "rais"          "ravigote"     
##  [9] "est"           "accoutrements" "lazare"        "banc"         
## [13] "pers"          "zona"          "rodas"         "blondie"      
## [17] "occurences"    "cale"          "nui"           "bayer"        
## [21] "tue"           "garces"        "perron"        "fertiliser"   
## [25] "anglophone"    "sens"          "vitale"        "injectable"   
## [29] "broche"        "grue"          "las"           "bomba"        
## [33] "vite"          "tourisme"      "tel"           "mai"          
## [37] "sep"           "gallo"         "ramage"        "catalan"      
## [41] "maxime"        "vas"           "brandon"       "ravi"         
## [45] "tertre"        "revit"         "ravin"         "gis"          
## [49] "tris"          "protections"

5. Development of a predictive text model

The initial plan is to build a Katz back-off model based on the combined data from all three documents.

It will run as a shiny app allowing the user to type in a number of words and see the prediction for the next one.

I intend to spend a little more time on data cleaning, for example I would like to look more closely at removal of special characters and the effect of removing the stop words. There is one obvious problem with the output above - in some cases it looks like an apostrophe has been replaced with an a with a caret above.

I’ll also create validation and tests data sets to investigate the effect of different cleaning processes and the effect (on accuracy, speed, and memory use) of varying the size of the data set used.