Data Science Capstone: Milestone Report

A. Background

This project aims to create an application for predictive texting.

For example, when someone type the first five (5) words of his/her blogs or tweets or news articles, the application will provide three (3) options for what the next word might be.

The app will be done in Shiny, a package from RStudio that can be used to build interactive web pages with R.

SwiftKey, our corporate partner in this capstone, provided the training data set: data. The company SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices.

B. Loading and Processing the Raw Data

B.1 Load the data

Let us first load the data by source or file. Then, divide into train and test data sets.

Blogs

blogsCon = file("en_US.blogs.txt",open="r")  ##load the data
blogsLine = readLines(blogsCon)
close(blogsCon)

set.seed(1)
indexes = sample(1:length(blogsLine), size=0.2*length(blogsLine))
train = blogsLine[indexes] ##training data set
test = blogsLine[-indexes]
blogsLine = train

Here are the first two lines of the blogs train data set:

[1] "Writing not yet strong enough: if I look at a first page and find five cases of the dreaded double verbing (was running, were singing, am walkingâ<U+0080>¦ try ran, sang, walk etc.), Iâ<U+0080><U+0099>m guessing the rest of the manuscript is like that too. Iâ<U+0080><U+0099>m not saying the occasional double verbing is the end of everything, but thereâ<U+0080><U+0099>s a difference between a double verb every twenty pages and twenty on one page. Oh, and this includes you first person present tense writers. Double verbs are bad. Read Hunger Games and tell me how many double verbs she hasâ<U+0080>¦I think first person present was the biggest addition to my No category for double verbing alone (okay, I think you all get the picture, two verbs are much weaker than one, go forth and disseminate to all your friends). Weak writing was the number one trigger for making me pounce on the no."
[2] "Roerich said â<U+0080><U+009C>through Beauty we pray, with Beauty we conquerâ<U+0080>"

News

newsCon = file("en_US.news.txt",open="r")
newsLine = readLines(newsCon)
close(newsCon)

set.seed(1)
indexes = sample(1:length(newsLine), size=0.2*length(newsLine))
train = newsLine[indexes] ##training data set
test = newsLine[-indexes]
newsLine = train

Here are the first two lines of the news train data set:

[1] "The writer is a professor of human rights at American University."                                                                                                                           
[2] "On Feb. 1, we ran the bulk online at money.azcentral.com. A number also will be published Feb. 14 in The Arizona Republic. Unfortunately, we can only offer a sampling of what is out there."

Twitter

twitterCon = file("en_US.twitter.txt",open="r")
twitterLine = readLines(twitterCon)
close(twitterCon)

set.seed(1)
indexes = sample(1:length(twitterLine), size=0.2*length(twitterLine))
train = twitterLine[indexes] ##training data set
test = twitterLine[-indexes]
twitterLine = train

Here are the first two lines of the twitter train data set:

head(twitterLine,2)

[1] "When you finally do reach the age where a woman's personality is what attracts you it won't matter- they stop believing that years earlier."
[2] "You'd think that appl could afford a dedicated wifi ap and Internet link for their demos!"

B.2 Clean the data

Since analysis of word frequencies will be significant, we will:

convert all letters into lower case
remove period and other metacharacters

blogsLine = tolower(blogsLine) ##to lower all letter case 
newsLine = tolower(newsLine)
twitterLine = tolower(twitterLine)

blogsLine = gsub(".","",blogsLine,fixed = TRUE) ##remove period
newsLine = gsub(".","",newsLine,fixed = TRUE)
twitterLine = gsub(".","",twitterLine,fixed = TRUE)

library(stringr)
blogsLine = str_replace_all(blogsLine, "[[:punct:]]", "") 
                                          ##remove other metacharacters
newsLine = str_replace_all(newsLine, "[[:punct:]]", "") 
twitterLine = str_replace_all(twitterLine, "[[:punct:]]", "")

C. Exploratory Data Analysis

C.1 Blogs

Line Count

[1] 179857

Word count

blogsWord = strsplit(blogsLine," ", perl=TRUE) ##to split by word
blogsWord = unlist(blogsWord)
length(blogsWord) ##word count

[1] 7444634

First 10 words

 [1] "writing" "not"     "yet"     "strong"  "enough"  "if"      "i"      
 [8] "look"    "at"      "a"

Distribution of Word Frequencies - first 10 rows from highest to lowest frequency

   word   freq CumFreq CumPercentage wordRank
1   the 369234  369234         0.050        1
2   and 216087  585321         0.079        2
3    to 212408  797729         0.107        3
4     a 179038  976767         0.131        4
5    of 174234 1151001         0.155        5
6     i 151608 1302609         0.175        6
7    in 118720 1421329         0.191        7
8  that  91196 1512525         0.203        8
9    is  86131 1598656         0.215        9
10   it  79386 1678042         0.225       10

NOTE: wordRank is the corresponding word frequency ranking (from highest to lowest frequency).

Histogram of the Word Frequency Ranking

Median or 50% of the distribution will have the first 114 words with highest frequencies
90th percentile of the distribution will include the first 7771 words with highest frequencies

C.2 News

Line Count

[1] 15451

Word count

newsWord = strsplit(newsLine," ", perl=TRUE) ##to split by word
newsWord = unlist(newsWord)
length(newsWord) ##word count

[1] 528151

First 10 words

 [1] "the"       "writer"    "is"        "a"         "professor"
 [6] "of"        "human"     "rights"    "at"        "american"

Distribution of Word Frequencies - first 10 rows from highest to lowest frequency

   word  freq CumFreq CumPercentage wordRank
1   the 30386   30386         0.058        1
2    to 13898   44284         0.084        2
3   and 13454   57738         0.109        3
4     a 13217   70955         0.134        4
5    of 11734   82689         0.157        5
6    in 10203   92892         0.176        6
7   for  5413   98305         0.186        7
8  that  5349  103654         0.196        8
9    is  4380  108034         0.205        9
10   on  4112  112146         0.212       10

NOTE: wordRank is the corresponding word frequency ranking (from highest to lowest frequency).

Histogram of the Word Frequency Ranking

Median or 50% of the distribution will have the first 216 words with highest frequencies
90th percentile of the distribution will include the first 8986 words with highest frequencies

C.3 Twitter

Line Count

[1] 472029

Word count

twitterWord = strsplit(twitterLine," ", perl=TRUE) ##to split by word
twitterWord = unlist(twitterWord)
length(twitterWord) ##word count

[1] 6041948

First 10 words

 [1] "when"    "you"     "finally" "do"      "reach"   "the"     "age"    
 [8] "where"   "a"       "womans"

Distribution of Word Frequencies - first 10 rows from highest to lowest frequency

   word   freq CumFreq CumPercentage wordRank
1   the 187299  187299         0.031        1
2    to 157703  345002         0.057        2
3     i 143161  488163         0.081        3
4     a 121244  609407         0.101        4
5   you 108688  718095         0.119        5
6   and  86967  805062         0.133        6
7        78108  883170         0.146        7
8   for  77074  960244         0.159        8
9    in  75168 1035412         0.171        9
10   of  71732 1107144         0.183       10

NOTE: wordRank is the corresponding word frequency ranking (from highest to lowest frequency).

Histogram of the Word Frequency Ranking

Median or 50% of the distribution will have the first 123 words with highest frequencies
90th percentile of the distribution will include the first 6175 words with highest frequencies

Data Science Capstone: Milestone Report

JGG

May 15, 2017

A. Background

B. Loading and Processing the Raw Data

B.1 Load the data

B.2 Clean the data

C. Exploratory Data Analysis

C.1 Blogs

C.2 News

C.3 Twitter