milestoneproj.Rmd Prepared by Marcel Merchat September 4, 2016
The purpose of this milestone report is to demonstrate some key project issues such as the following:
Download the US English data files and import it into the program.
Provide summary statistics about the data sets.
Outline a plan for a prediction algorithm and Shiny Application.
The data is unzipped into a folder called Data.
setwd("~/edu/Data Science/capstone/Project")
## fileUrl <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
## download.file(fileUrl, dest="zipData.zip")
## unzip("zipData.zip", exdir=".")
There are folders for four languages. One of the folders is for English and it has three files of chatting, news, and blogs as follows:
en_US.twitter.txt en_US.news.txt en_US.blogs.txt
The chat data consists of over 2 million lines of chat. Similarly there are about 77,000 lines of news and 900,000 lines of blog text some of which are the length of journal articles. For the United States, there are about 30,000,000 words in the Twitter data file, about 35,000,000 in the Blog data file, and about 2.6 million in news file. Below we estimate the true word counts from the training data sets we construct.
setwd("~/edu/Data Science/capstone/Project")
ustwitter <- readLines("./Data/en_US/en_US.twitter.txt")
length(ustwitter) #[1] 2360148
## [1] 2360148
usnews <- readLines("./Data/en_US/en_US.news.txt")
length(usnews) #[1] 77259
## [1] 77259
usblogs <- readLines("./Data/en_US/en_US.blogs.txt")
length(usblogs) #[1] 899288
## [1] 899288
We will estimate the number of words in the United States data set directly below using the training data set.
We considered the AppliedPredictiveModeling package but the us_twitter data file required too much computer memory. Selecting random lines with a random TRUE or FALSE vector generated with the R rbinom function worked. Our analysis and exploration is based on a very small sampling of the data set but it still includes approximately one million words of text from each of the three files.
For the United States, there are about 30,000,000 words in the Twitter file, about 35,000,000 in the Blog file, and about 2.6 million in news file.
## us_twitter
get_word_count(twitraining)/(prob*samples/twitter_word_count1)
## [1] 29652186
## us_blogs
get_word_count(blogtraining)/(prob*samples/blog_word_count1)
## [1] 37238042
## us_news
get_word_count(newstraining)/(prob*samples/news_word_count1)
## [1] 2602514
The histogram below shows there are many rare words which are not used very much while only a few words appear many times. To illustrate this, the x-axis represents the popularity of words and the y axis indicates how many words have a given amount of popularity.
dfsources <- rbind(twitter_1_grams, blog_1_grams, news_1_grams)
#dfsources <- twitter_1_grams
ggplot(dfsources, aes(x=Freq)) +
ggtitle("1-Gram Word Count vs Popularity") +
stat_bin(boundary=20,breaks=seq(20,200,by=10)) +
coord_cartesian(xlim = c(20, 200)) +
scale_x_continuous(name="Popularity per Million Words (Occurrence Rate)",
limits=c(20, 200)) +
scale_y_continuous(name="Number of Words") +
facet_grid(. ~ Source)
These histograms show the lower number of popular phrases.
twitter_2_grams <- get_2_gram_dictionary(twitraining,
minimum_word_count=10,"Twitter")
blog_2_grams <- get_2_gram_dictionary(blogtraining,
minimum_word_count=10,"US_Blogs")
news_2_grams <- get_2_gram_dictionary(newstraining,
minimum_word_count=10,"US_News")
dfsources2 <- rbind(twitter_2_grams, blog_2_grams, news_2_grams)
ggplot(dfsources2, aes(x=Freq)) +
ggtitle("2-Gram Phrase Count vs Popularity") +
stat_bin(boundary=2.5,breaks=seq(5, 1000, by=5)) +
coord_cartesian(xlim = c(0, 100)) +
scale_x_continuous(name=
"Popularity of Phrases per Million Words (Occurrence Rate)",
limits=c(2.5, 1000)) +
scale_y_continuous(name="Number of Phrases") +
facet_grid(. ~ Source)
head(twitter_2_grams)
## any3 Freq Source
## 1 in the 1292 Twitter
## 2 Thanks for 893 Twitter
## 3 of the 877 Twitter
## 4 for the 806 Twitter
## 5 I love 791 Twitter
## 6 to be 773 Twitter
head(blog_2_grams)
## any3 Freq Source
## 1 of the 2405 US_Blogs
## 2 in the 1835 US_Blogs
## 3 to the 1087 US_Blogs
## 4 on the 974 US_Blogs
## 5 and the 933 US_Blogs
## 6 to be 914 US_Blogs
nums2 <- as.numeric(as.character(twitter_2_grams$Freq))
## Here is a statistics summary for 2-gram phrases.
summary(as.numeric(as.character(twitter_2_grams$Freq)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.00 12.00 17.00 32.74 30.00 1292.00
twitter_3_grams <- get_3_gram_dictionary(twitraining,
minimum_word_count=3,"Twitter")
blog_3_grams <- get_3_gram_dictionary(blogtraining,
minimum_word_count=3,"US_Blogs")
news_3_grams <- get_3_gram_dictionary(newstraining,
minimum_word_count=3,"US_News")
dfsources <- rbind(twitter_3_grams, blog_3_grams, news_3_grams)
ggplot(dfsources, aes(x=Freq)) +
ggtitle("3-Gram Phrase Count vs Popularity") +
stat_bin(boundary=2.5,breaks=seq(2.5,200.5,by=1)) +
coord_cartesian(xlim = c(0, 23)) +
scale_x_continuous(name=
"Popularity of Phrases per Million Words (Occurrence Rate)",
limits=c(0, 300)) +
scale_y_continuous(name="Number of Phrases") +
facet_grid(. ~ Source)
head(twitter_3_grams)
## any3 Freq Source
## 1 Thanks for the 469 Twitter
## 2 thanks for the 204 Twitter
## 3 Thank you for 183 Twitter
## 4 Looking forward to 175 Twitter
## 5 I want to 161 Twitter
## 6 I love you 131 Twitter
head(blog_3_grams)
## any3 Freq Source
## 1 as well as 111 US_Blogs
## 2 a lot of 106 US_Blogs
## 3 one of the 102 US_Blogs
## 4 I want to 94 US_Blogs
## 5 I have a 90 US_Blogs
## 6 I have been 89 US_Blogs
nums3 <- as.numeric(as.character(twitter_3_grams$Freq))
## Here is a statistics summary for 3-gram phrases.
summary(as.numeric(as.character(twitter_3_grams$Freq)))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.000 4.000 6.093 6.000 469.000
Consider the string “Nothing in the world.” The last three words are “in the world.” Since for our problem we will not know what the final word is, we start with the string “Nothing in the” and use the last two words which are “in the” to predict the unknown final word.
str <- "Nothing in the"
patternlast <- "\\s([a-zA-Z']{1,50})[\\.]?$"
re <- regexec(patternlast,str)
lastword <- regmatches(str, re) [[1]][2]
lastword
## [1] "the"
The last two words form a 2-gram object, but here we only consider what might be predicted from the 3-gram model given that we know the first two words are “in the.”
patternnexttolast <- "\\s([a-zA-Z']{1,50})\\s[a-zA-Z']{1,50}[\\.]?$"
re <- regexec(patternnexttolast,str)
nexttolast <- regmatches(str, re) [[1]][2]
nexttolast
## [1] "in"
N-gram model for predicting the next word based on the previous 1, 2, or 3 words. We will also attempt to handle unseen n-grams too. We form a data frame that is suitable as input for a prediction model in the Carot Package. The first word of the 2-gram model might somehow be added to the same data frame but here we simply see that there a few phrases in our 3-gram data that we can choose a correct response from. Perhaps the most likely choice would be the 3-gram with the highest frequency of occurrence.
col1 <- unlist(lapply(as.character(twitter_3_grams[,1]),
function(x) {strsplit(x, "\\s+")[[1]][1]}))
col2 <- unlist(lapply(as.character(twitter_3_grams[,1]),
function(x) {strsplit(x, "\\s+")[[1]][2]}))
col3 <- unlist(lapply(as.character(twitter_3_grams[,1]),
function(x) {strsplit(x, "\\s+")[[1]][3]}))
data <- data.frame(col1, col2, col3, stringsAsFactors =FALSE)
## Given 2-gram or two word combination: Find possible matching 3-gram from
## the Twitter data set.
paste(nexttolast,lastword)
## [1] "in the"
## Find possible matching 3-gram from the Twitter data set.
data[data[,1]==nexttolast & data[,2] == lastword,]
## col1 col2 col3
## 51 in the world
## 125 in the morning
## 306 in the middle
## 337 in the face
## 831 in the next
## 1246 in the air
## 1247 in the city
## 1248 in the last
## 1620 in the first
## 1621 in the game
## 1622 in the house
## 1623 in the kitchen
## 1624 in the past
## 1625 in the studio
## 1626 in the works
## 2163 in the back
## 2164 in the car
## 2165 in the dark
## 2166 in the day
## 2167 in the fall
## 2168 in the history
## 2169 in the office
## 2170 in the same
## 2171 in the snow
## 2172 in the USA
## 2173 in the way
## 3195 in the afternoon
## 3196 in the area
## 3197 in the fridge
## 3198 in the hospital
## 3200 in the mirror
## 3201 in the mood
## 3202 in the NBA
## 3203 in the neighborhood
## 3204 in the present
## 3205 in the rain
## 3206 in the sky
## 3207 in the store
## 3208 in the street
## 5124 in the ass
## 5125 in the background
## 5126 in the bedroom
## 5127 in the club
## 5128 in the corner
## 5129 in the final
## 5130 in the future
## 5131 in the lab
## 5132 in the mail
## 5133 in the military
## 5134 in the movie
## 5135 in the name
## 5136 in the second