Exploration of Word Counts for SwiftKey Capstone project

Introduction

For the SwiftKey Capstone project, we must build a model to predict the next word when a user types in one or more words on a mobile device or tablet. This report displays how I have investigated test text data provided by SwiftKey.

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

Description of Technique

I’ve explored the one of the test English files in the test data. I chose only one file to explore because I needed to develop techniques to deal with the large sizes of these files.

I first calculated line counts and word counts for all available English text files.

en_US.blogs.txt – 899,288 lines; 37,334,089 words
en_uS_news.txt – 1,010,242 lines; 34,372,530 words
en_US.twitter.txt – 2,360,148 lines; 30,373,582 words

Since these files are large I created an R procedure to randomly download approximately 10% of the lines of en_us_blogs.txt file. I then created plots showing the number of occurences for the most popular words.

Exploration

Below is a table of all words with a count > 2000. There are also two graphs – the first graphs is all words with counts <= 500, the second graph is all words with counts > 500.

results %>% filter( Counts > 2000 ) %>% arrange( desc(Counts) )

##      Terms      Docs Counts
## 1     time short.txt  11713
## 2     make short.txt   8955
## 3      day short.txt   7980
## 4     year short.txt   7486
## 5     love short.txt   7218
## 6    peopl short.txt   6800
## 7    thing short.txt   6798
## 8     work short.txt   6759
## 9     back short.txt   5758
## 10    good short.txt   5706
## 11    book short.txt   4629
## 12    feel short.txt   4441
## 13    life short.txt   4433
## 14    week short.txt   4342
## 15   start short.txt   4311
## 16    made short.txt   3958
## 17    read short.txt   3858
## 18    live short.txt   3802
## 19   great short.txt   3548
## 20    find short.txt   3523
## 21   world short.txt   3430
## 22  friend short.txt   3383
## 23    call short.txt   3318
## 24    home short.txt   3228
## 25     don short.txt   3223
## 26     lot short.txt   3157
## 27   place short.txt   3153
## 28     end short.txt   3123
## 29    show short.txt   3048
## 30    post short.txt   2992
## 31 thought short.txt   2950
## 32     put short.txt   2922
## 33    blog short.txt   2918
## 34    part short.txt   2918
## 35     god short.txt   2879
## 36  person short.txt   2840
## 37   stori short.txt   2838
## 38    long short.txt   2789
## 39   today short.txt   2771
## 40   write short.txt   2762
## 41    give short.txt   2690
## 42  famili short.txt   2586
## 43    play short.txt   2577
## 44    turn short.txt   2499
## 45     set short.txt   2413
## 46   point short.txt   2403
## 47   night short.txt   2394
## 48    hous short.txt   2368
## 49    hope short.txt   2339
## 50   month short.txt   2314
## 51     bit short.txt   2313
## 52    word short.txt   2313
## 53     run short.txt   2294
## 54    hand short.txt   2253
## 55   found short.txt   2227
## 56  school short.txt   2199
## 57    head short.txt   2124
## 58    talk short.txt   2121
## 59     man short.txt   2117
## 60    move short.txt   2102
## 61   chang short.txt   2095
## 62     big short.txt   2051
## 63   state short.txt   2041
## 64     kid short.txt   2021
## 65    want short.txt   2004

results0 <- results %>% filter( Counts <= 500 ) %>% arrange( desc(Counts) )
results500 <- results %>% filter( Counts > 500 ) %>% arrange( desc(Counts) )
ggplot(results0) + geom_histogram( aes( x = Counts), binwidth = 50 ) + ggtitle("Histogram of words with counts <= 500") + xlab("Word Counts") + ylab("")

ggplot(results500) + geom_histogram( aes( x = Counts), binwidth = 50 ) + ggtitle("Histogram of words with counts <= 500") + xlab("Word Counts") + ylab("")

Next Steps

Next will be to explore all the files by creating shorter files containing a random subset of lines. Then I will build an n-gram (multi word) model.

Exploration of Word Counts for SwiftKey Capstone project

Alim Ray

Introduction

Description of Technique

Exploration

Next Steps