For the SwiftKey Capstone project, we must build a model to predict the next word when a user types in one or more words on a mobile device or tablet. This report displays how I have investigated test text data provided by SwiftKey.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: NLP
##
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
##
## annotate
I’ve explored the one of the test English files in the test data. I chose only one file to explore because I needed to develop techniques to deal with the large sizes of these files.
I first calculated line counts and word counts for all available English text files.
Since these files are large I created an R procedure to randomly download approximately 10% of the lines of en_us_blogs.txt file. I then created plots showing the number of occurences for the most popular words.
Below is a table of all words with a count > 2000. There are also two graphs – the first graphs is all words with counts <= 500, the second graph is all words with counts > 500.
results %>% filter( Counts > 2000 ) %>% arrange( desc(Counts) )
## Terms Docs Counts
## 1 time short.txt 11713
## 2 make short.txt 8955
## 3 day short.txt 7980
## 4 year short.txt 7486
## 5 love short.txt 7218
## 6 peopl short.txt 6800
## 7 thing short.txt 6798
## 8 work short.txt 6759
## 9 back short.txt 5758
## 10 good short.txt 5706
## 11 book short.txt 4629
## 12 feel short.txt 4441
## 13 life short.txt 4433
## 14 week short.txt 4342
## 15 start short.txt 4311
## 16 made short.txt 3958
## 17 read short.txt 3858
## 18 live short.txt 3802
## 19 great short.txt 3548
## 20 find short.txt 3523
## 21 world short.txt 3430
## 22 friend short.txt 3383
## 23 call short.txt 3318
## 24 home short.txt 3228
## 25 don short.txt 3223
## 26 lot short.txt 3157
## 27 place short.txt 3153
## 28 end short.txt 3123
## 29 show short.txt 3048
## 30 post short.txt 2992
## 31 thought short.txt 2950
## 32 put short.txt 2922
## 33 blog short.txt 2918
## 34 part short.txt 2918
## 35 god short.txt 2879
## 36 person short.txt 2840
## 37 stori short.txt 2838
## 38 long short.txt 2789
## 39 today short.txt 2771
## 40 write short.txt 2762
## 41 give short.txt 2690
## 42 famili short.txt 2586
## 43 play short.txt 2577
## 44 turn short.txt 2499
## 45 set short.txt 2413
## 46 point short.txt 2403
## 47 night short.txt 2394
## 48 hous short.txt 2368
## 49 hope short.txt 2339
## 50 month short.txt 2314
## 51 bit short.txt 2313
## 52 word short.txt 2313
## 53 run short.txt 2294
## 54 hand short.txt 2253
## 55 found short.txt 2227
## 56 school short.txt 2199
## 57 head short.txt 2124
## 58 talk short.txt 2121
## 59 man short.txt 2117
## 60 move short.txt 2102
## 61 chang short.txt 2095
## 62 big short.txt 2051
## 63 state short.txt 2041
## 64 kid short.txt 2021
## 65 want short.txt 2004
results0 <- results %>% filter( Counts <= 500 ) %>% arrange( desc(Counts) )
results500 <- results %>% filter( Counts > 500 ) %>% arrange( desc(Counts) )
ggplot(results0) + geom_histogram( aes( x = Counts), binwidth = 50 ) + ggtitle("Histogram of words with counts <= 500") + xlab("Word Counts") + ylab("")
ggplot(results500) + geom_histogram( aes( x = Counts), binwidth = 50 ) + ggtitle("Histogram of words with counts <= 500") + xlab("Word Counts") + ylab("")
Next will be to explore all the files by creating shorter files containing a random subset of lines. Then I will build an n-gram (multi word) model.