Data Science Capstone Milestone Report

Introduction

This is a preliminary report for the Data Science Specialization Capstone on Coursera. The purpose of this report is to explore the data in preparation for predictive modeling. The R Markdown file that generated this report is available on github at https://github.com/jnd18/capstone-milestone.

Data

The data for the capstone project is available at https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Inside the zipped archive is a directory containing three english language text files. These text files were gathered from public websites by a web crawler.

There are three files, corresponding to three types of text sources: twitter, news websites, and blogs. The twitter file is 159 MB, the news file is 196 MB, and the blogs file is 200 MB.

Analysis

First, some basic descriptions. The Twitter file contains 2360148 lines and 30093369 words. The news file contains 77259 lines and 2674536 words. The blogs file contains 899288 lines and 37546246 words. Below we have tables displaying the top ten most frequent words in each source, removing common so-called “stop words”, so that the ten words are more illustrative of what makes the text unique. The top words for each source look very different.

Twitter
rank	word	occurences
1	love	106721
2	day	91710
3	rt	89537
4	time	76794
5	lol	70133
6	3	54940
7	people	52040
8	happy	48998
9	follow	48104
10	2	45515

News
rank	word	occurences
1	time	4474
2	people	3673
3	1	2994
4	city	2902
5	school	2702
6	percent	2635
7	game	2589
8	day	2477
9	home	2438
10	2	2434

Blogs
rank	word	occurences
1	time	90918
2	people	59574
3	day	52372
4	love	45230
5	life	41251
6	it’s	38657
7	1	30907
8	2	29561
9	world	29305
10	i’m	29189

Below, we have histograms of the word frequencies from different sources. The histograms show that most words appear very few times, but a few words appear frequently. This is with the common words removed. The distribution of frequencies is highly skewed-right.

Next up we have the bigram frequency tables for the three sources, again with common words removed. A bigram is just a pair of words. The Twitter table shows two forms of “Mother’s Day.” Perhaps the data were collected around Mother’s Day. The news table shows many city names. The blog table has a few pairs of numbers, which we believe come from fractions used in recipes, like “1/2 cup”. The histograms of word frequencies look almost identical to the ones above, except with even more mass to the left. It makes sense that more pairs of words appear only once. Thus, we won’t bother to display the histograms.

Twitter
rank	bigram	occurences
1	happy birthday	8389
2	social media	3886
3	mother’s day	2874
4	stay tuned	2657
5	mothers day	2572
6	san diego	2232
7	rt rt	2102
8	happy friday	1952
9	1 2	1918
10	ice cream	1899

News
rank	bigram	occurences
1	st louis	701
2	los angeles	436
3	san francisco	381
4	30 p.m	354
5	health care	317
6	1 2	227
7	san diego	219
8	vice president	219
9	white house	179
10	7 p.m	167

Blogs
rank	bigram	occurences
1	1 2	3974
2	weeks ago	1606
3	ice cream	1585
4	1 4	1465
5	social media	1342
6	jesus christ	1314
7	south africa	1153
8	real life	1145
9	3 4	1109
10	10 minutes	1072

Finally, we have the trigram tables. For the Twitter data, we see many holidays which all occur in the beginning of the year. We also amusingly see “cake cake cake.” For the news data different times of day as well as names and titles. For the blog data, we see many cooking measurements. Interestingly, “world war ii” appears frequently both in the news and on blogs, but not on Twitter. Again, the histograms just look like one huge spike on the left, so we won’t display them.

Twitter
rank	trigram	occurences
1	happy mothers day	1743
2	happy mother’s day	1582
3	cinco de mayo	1002
4	st patrick’s day	414
5	love love love	412
6	ha ha ha	363
7	cake cake cake	341
8	happy valentine’s day	341
9	ralph waldo emerson	318
10	happy valentines day	298

News
rank	trigram	occurences
1	president barack obama	95
2	7 30 p.m	77
3	st louis county	76
4	gov chris christie	66
5	world war ii	53
6	11 30 a.m	49
7	6 30 p.m	42
8	1 1 2	41
9	chief financial officer	40
10	1 2 cup	39

Blogs
rank	trigram	occurences
1	1 2 cup	710
2	1 4 cup	462
3	1 1 2	461
4	amazon services llc	427
5	world war ii	310
6	1 2 tsp	266
7	2 1 2	262
8	amp amp amp	250
9	lord jesus christ	219
10	amazon eu associates	213

Plans for the App

Using the basic techniques we’ve developed for exploring the data, we can create a simple n-gram model. To predict the next word in a sequence, we would look at the previous few words and pick the word that most commonly appears after those words in the text data. We would not remove the stopwords in this context. We could use some kind of smoothing, like adding 1 to each count of word appearances. We could also use a backoff strategy. For example, when predicting the next word from a two word sequence, if that two word sequence is common, then use the most frequent third word. If the two word sequence is uncommon, then just predict based on the second word.

With n-gram models it can be tricky to achieve high predictive performance with reasonable computational constraints. For this reason, we may try the alternative approach of training a recurrent neural net to predict the next character in a sequence of characters. This would frontload the computational burden into the training time and hopefully deliver a more performant app once the model is trained. Another benefit of this approach is that predicting on the basis of characters allows us to complete partial words, just like a real predictive keyboard.

Data Science Capstone Milestone Report

Jonathan Dorsey

June 12, 2018

Introduction

Data

Analysis

Plans for the App