This is the Week Two Milestone Report for the John Hopkins Data Science Capstone course on Coursera.
The report briefly describes the following stages of the project to develop a text prediction algorithm:
Data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
The English language data, which we discuss here, consisted of three files containing text extracted from published blogs, news-feeds, and twitter feeds repectively.
Rather than using the full downloaded data, a sample was created and stored for use later stages of the project. The initial sample size was 1% of the full dataset. The size will be increased once an initial working model has been created and the improvement in performance with the sample size monitored.
The data was read into memory and stored as a “virtual corpus”.
Here are a few summarized details:
## <<VCorpus>>
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 3
## <<PlainTextDocument>>
## Metadata: 7
## Content: chars: 1579169
We can see there are 3 documents and more than 1.5 million characters.
The following steps were undertaken to clean and transform the data:
I have used two approaches, one using the tm package of natural language processing tools, and another based on the tidytext package and other tools fro a group of packages collectively known as the “tidyverse”. For future work I will use the tidytext approach as much as possible for speed and memory efficiency.
The tables below list the most frequent terms from each document.
## <<DocumentTermMatrix (documents: 3, terms: 50919)>>
## Non-/sparse entries: 76241/76516
## Sparsity : 50%
## Maximal term length: 58
## Weighting : term frequency (tf)
## Sample :
## Terms
## Docs can get just like new now one said time will
## en_US.blogs.txt 739 594 739 763 379 458 922 238 662 866
## en_US.news.txt 456 344 431 393 490 247 612 1870 355 774
## en_US.twitter.txt 893 1089 1482 1141 663 838 801 171 752 913
## $en_US.blogs.txt
## one will like can just time get
## 922 866 763 739 739 662 594
## know now people back day even make
## 469 458 427 412 412 400 399
## also first new love see really well
## 387 383 379 375 370 369 368
## little good much way think going many
## 361 358 356 341 330 325 313
## life things want still itâs two say
## 299 298 290 288 284 281 277
## made work last years something take great
## 272 267 265 263 259 250 248
## year need said got iâm around never
## 240 239 238 237 237 235 235
## right
## 235
##
## $en_US.news.txt
## said will one new can two just
## 1870 774 612 490 456 441 431
## also year like state years first time
## 420 418 393 365 361 359 355
## get people last city make game says
## 344 322 321 275 257 254 248
## now school county going three back million
## 247 247 240 235 235 233 229
## way even many good may team police
## 226 221 217 215 214 209 205
## percent made season work think day since
## 204 202 196 192 191 186 183
## home four president well much know say
## 180 179 178 178 177 176 174
## public
## 171
##
## $en_US.twitter.txt
## just like get love will good can day thanks
## 1482 1141 1089 973 913 900 893 848 841
## now one know time today great new see lol
## 838 801 790 752 698 689 663 643 632
## back got going people follow think right happy need
## 566 555 518 503 489 476 470 469 469
## want much really make come tonight night work thank
## 467 460 419 417 401 397 387 377 374
## last hope well way still best never say life
## 369 352 348 338 323 319 318 317 313
## better first please twitter next
## 308 305 305 302 292
Here are some statistics about the number of words in the three documents.
Total number of unique terms: 50919
Total number of words across sources: 435288
Number of words in blogs: 143093
Number of words in news: 140658
Number of words in twitter: 151537
The 984 most common words account for 50% of the total, and the 15357account for 90% of the total (after stop words have been removed).
## Selecting by count
In the next two sections we look at words that appear in all three documents and then those that are unique to each document.
I know that word clouds are viewed with the same distain by many data scientists as are pie charts, but for a quick impression of the language used I think they can be justified here.
Top 60 words appearing in all three sources:
Top 60 words in blogs:
Top 60 words in news:
It looks like we have a rather local set of news feeds - Cuyahoga is a river and a county in Ohio.
Top 60 words in twitter feeds: (several offensive words removed from here)
A Venn diagram of the number of words in each category and of the overlaps ### 4.5 Characteristic Word Frequencies - another approach
Here we are looking dor words characteristic of each group by using the tf-idf variable (see http://tidytextmining.com/tfidf.html)
The table below summarises the news-feed data. It’s clearly compatable with the word cloud above.
## # A tibble: 26,272 × 6
## document term count tf idf
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 en_US.news.txt cuyahoga 27 0.00019195495 1.0986123
## 2 en_US.news.txt spokeswoman 23 0.00016351718 1.0986123
## 3 en_US.news.txt analysts 19 0.00013507941 1.0986123
## 4 en_US.news.txt superintendent 17 0.00012086053 1.0986123
## 5 en_US.news.txt winery 17 0.00012086053 1.0986123
## 6 en_US.news.txt authorities 44 0.00031281548 0.4054651
## 7 en_US.news.txt corp 16 0.00011375108 1.0986123
## 8 en_US.news.txt county's 15 0.00010664164 1.0986123
## 9 en_US.news.txt regulators 15 0.00010664164 1.0986123
## 10 en_US.news.txt trenton 13 0.00009242276 1.0986123
## # ... with 26,262 more rows, and 1 more variables: tf_idf <dbl>
The following charts show the words most characteristic of each source document.
## Selecting by tf_idf
The following two tables show the most frequent bigrams and trigrams.
## <<TermDocumentMatrix (terms: 2590, documents: 3)>>
## Non-/sparse entries: 7770/0
## Sparsity : 0%
## Maximal term length: 21
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## feel like 40 15 59
## high school 19 65 28
## last night 16 9 116
## last year 22 87 30
## looking forward 15 7 110
## new york 43 83 19
## p m 7 152 9
## right now 22 21 176
## u s 28 165 14
## years ago 53 52 17
## <<TermDocumentMatrix (terms: 46, documents: 3)>>
## Non-/sparse entries: 138/0
## Sparsity : 0%
## Maximal term length: 22
## Weighting : term frequency (tf)
## Sample :
## Docs
## Terms en_US.blogs.txt en_US.news.txt en_US.twitter.txt
## cinco de mayo 2 1 12
## four years ago 3 4 1
## just around corner 1 1 6
## just wanted say 1 1 6
## let us know 4 1 18
## looking forward seeing 1 2 8
## new york city 10 6 3
## new york times 9 6 1
## two years ago 2 10 2
## will take place 2 4 1
The following section attempts to extract a list of potential French words form the text. The method uses a list of French words found on the web, plus a list of English words also found on the web - however, these are just for demonstration purposes and could probably be improved (and properly cited!)
The method is to first find the intersection of our list of words with the French list. This gives us words that could be French, but there are a large number of strings that are valid words in both French and English so it makes sense to remove all the possibly English ones.
This leaves us with a list of just over 330 words, however we can’t be sure they are all intended to be French - some could be proper names or from other languages. And, of course, we will have filtered out some genuinely French Words because they were the same as an Englis word.
We’ve pronted out a sample of the words detected.
## [1] "vida" "alsace" "tempe" "mallette"
## [5] "puy" "banderas" "rais" "ravigote"
## [9] "est" "accoutrements" "lazare" "banc"
## [13] "pers" "zona" "rodas" "blondie"
## [17] "occurences" "cale" "nui" "bayer"
## [21] "tue" "garces" "perron" "fertiliser"
## [25] "anglophone" "sens" "vitale" "injectable"
## [29] "broche" "grue" "las" "bomba"
## [33] "vite" "tourisme" "tel" "mai"
## [37] "sep" "gallo" "ramage" "catalan"
## [41] "maxime" "vas" "brandon" "ravi"
## [45] "tertre" "revit" "ravin" "gis"
## [49] "tris" "protections"
The initial plan is to build a Katz back-off model based on the combined data from all three documents.
It will run as a shiny app allowing the user to type in a number of words and see the prediction for the next one.
I intend to spend a little more time on data cleaning, for example I would like to look more closely at removal of special characters and the effect of removing the stop words. There is one obvious problem with the output above - in some cases it looks like an apostrophe has been replaced with an a with a caret above.
I’ll also create validation and tests data sets to investigate the effect of different cleaning processes and the effect (on accuracy, speed, and memory use) of varying the size of the data set used.