This project aims to create an application for predictive texting.
For example, when someone type the first five (5) words of his/her blogs or tweets or news articles, the application will provide three (3) options for what the next word might be.
The app will be done in Shiny, a package from RStudio that can be used to build interactive web pages with R.
SwiftKey, our corporate partner in this capstone, provided the training data set: data. The company SwiftKey builds a smart keyboard that makes it easier for people to type on their mobile devices.
Let us first load the data by source or file. Then, divide into train and test data sets.
Blogs
blogsCon = file("en_US.blogs.txt",open="r") ##load the data
blogsLine = readLines(blogsCon)
close(blogsCon)
set.seed(1)
indexes = sample(1:length(blogsLine), size=0.2*length(blogsLine))
train = blogsLine[indexes] ##training data set
test = blogsLine[-indexes]
blogsLine = train
Here are the first two lines of the blogs train data set:
[1] "Writing not yet strong enough: if I look at a first page and find five cases of the dreaded double verbing (was running, were singing, am walkingâ<U+0080>¦ try ran, sang, walk etc.), Iâ<U+0080><U+0099>m guessing the rest of the manuscript is like that too. Iâ<U+0080><U+0099>m not saying the occasional double verbing is the end of everything, but thereâ<U+0080><U+0099>s a difference between a double verb every twenty pages and twenty on one page. Oh, and this includes you first person present tense writers. Double verbs are bad. Read Hunger Games and tell me how many double verbs she hasâ<U+0080>¦I think first person present was the biggest addition to my No category for double verbing alone (okay, I think you all get the picture, two verbs are much weaker than one, go forth and disseminate to all your friends). Weak writing was the number one trigger for making me pounce on the no."
[2] "Roerich said â<U+0080><U+009C>through Beauty we pray, with Beauty we conquerâ<U+0080>"
News
newsCon = file("en_US.news.txt",open="r")
newsLine = readLines(newsCon)
close(newsCon)
set.seed(1)
indexes = sample(1:length(newsLine), size=0.2*length(newsLine))
train = newsLine[indexes] ##training data set
test = newsLine[-indexes]
newsLine = train
Here are the first two lines of the news train data set:
[1] "The writer is a professor of human rights at American University."
[2] "On Feb. 1, we ran the bulk online at money.azcentral.com. A number also will be published Feb. 14 in The Arizona Republic. Unfortunately, we can only offer a sampling of what is out there."
twitterCon = file("en_US.twitter.txt",open="r")
twitterLine = readLines(twitterCon)
close(twitterCon)
set.seed(1)
indexes = sample(1:length(twitterLine), size=0.2*length(twitterLine))
train = twitterLine[indexes] ##training data set
test = twitterLine[-indexes]
twitterLine = train
Here are the first two lines of the twitter train data set:
head(twitterLine,2)
[1] "When you finally do reach the age where a woman's personality is what attracts you it won't matter- they stop believing that years earlier."
[2] "You'd think that appl could afford a dedicated wifi ap and Internet link for their demos!"
Since analysis of word frequencies will be significant, we will:
blogsLine = tolower(blogsLine) ##to lower all letter case
newsLine = tolower(newsLine)
twitterLine = tolower(twitterLine)
blogsLine = gsub(".","",blogsLine,fixed = TRUE) ##remove period
newsLine = gsub(".","",newsLine,fixed = TRUE)
twitterLine = gsub(".","",twitterLine,fixed = TRUE)
library(stringr)
blogsLine = str_replace_all(blogsLine, "[[:punct:]]", "")
##remove other metacharacters
newsLine = str_replace_all(newsLine, "[[:punct:]]", "")
twitterLine = str_replace_all(twitterLine, "[[:punct:]]", "")
Line Count
[1] 179857
Word count
blogsWord = strsplit(blogsLine," ", perl=TRUE) ##to split by word
blogsWord = unlist(blogsWord)
length(blogsWord) ##word count
[1] 7444634
First 10 words
[1] "writing" "not" "yet" "strong" "enough" "if" "i"
[8] "look" "at" "a"
Distribution of Word Frequencies - first 10 rows from highest to lowest frequency
word freq CumFreq CumPercentage wordRank
1 the 369234 369234 0.050 1
2 and 216087 585321 0.079 2
3 to 212408 797729 0.107 3
4 a 179038 976767 0.131 4
5 of 174234 1151001 0.155 5
6 i 151608 1302609 0.175 6
7 in 118720 1421329 0.191 7
8 that 91196 1512525 0.203 8
9 is 86131 1598656 0.215 9
10 it 79386 1678042 0.225 10
NOTE: wordRank is the corresponding word frequency ranking (from highest to lowest frequency).
Histogram of the Word Frequency Ranking
Median or 50% of the distribution will have the first 114 words with highest frequencies
90th percentile of the distribution will include the first 7771 words with highest frequencies
Line Count
[1] 15451
Word count
newsWord = strsplit(newsLine," ", perl=TRUE) ##to split by word
newsWord = unlist(newsWord)
length(newsWord) ##word count
[1] 528151
First 10 words
[1] "the" "writer" "is" "a" "professor"
[6] "of" "human" "rights" "at" "american"
Distribution of Word Frequencies - first 10 rows from highest to lowest frequency
word freq CumFreq CumPercentage wordRank
1 the 30386 30386 0.058 1
2 to 13898 44284 0.084 2
3 and 13454 57738 0.109 3
4 a 13217 70955 0.134 4
5 of 11734 82689 0.157 5
6 in 10203 92892 0.176 6
7 for 5413 98305 0.186 7
8 that 5349 103654 0.196 8
9 is 4380 108034 0.205 9
10 on 4112 112146 0.212 10
NOTE: wordRank is the corresponding word frequency ranking (from highest to lowest frequency).
Histogram of the Word Frequency Ranking
Median or 50% of the distribution will have the first 216 words with highest frequencies
90th percentile of the distribution will include the first 8986 words with highest frequencies
Line Count
[1] 472029
Word count
twitterWord = strsplit(twitterLine," ", perl=TRUE) ##to split by word
twitterWord = unlist(twitterWord)
length(twitterWord) ##word count
[1] 6041948
First 10 words
[1] "when" "you" "finally" "do" "reach" "the" "age"
[8] "where" "a" "womans"
Distribution of Word Frequencies - first 10 rows from highest to lowest frequency
word freq CumFreq CumPercentage wordRank
1 the 187299 187299 0.031 1
2 to 157703 345002 0.057 2
3 i 143161 488163 0.081 3
4 a 121244 609407 0.101 4
5 you 108688 718095 0.119 5
6 and 86967 805062 0.133 6
7 78108 883170 0.146 7
8 for 77074 960244 0.159 8
9 in 75168 1035412 0.171 9
10 of 71732 1107144 0.183 10
NOTE: wordRank is the corresponding word frequency ranking (from highest to lowest frequency).
Histogram of the Word Frequency Ranking
Median or 50% of the distribution will have the first 123 words with highest frequencies
90th percentile of the distribution will include the first 6175 words with highest frequencies