The goal of Coursera Data Science Capstone - Milestone Report is to display that I’ve gotten used to working with the data and that you are on track to create your prediction algorithm. The report will be sumitted on R Pubs (http://rpubs.com/), and it will explain my exploratory analysis and my goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data I have identified and briefly summarize my plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. I will make use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to: 1. Demonstrate that I’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that I amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Also, this report aims to answer to following questions: 1. Does the link lead to an HTML page describing the exploratory analysis of the training data set? 2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 3. Has the data scientist made basic plots, such as histograms to illustrate features of the data? 4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
This document is intended to give an overview of data identify the way the prediction is going to be conducted.
The project data consists of datasets in four separate languages (English, German, Russian and Finnish, where I will use the English data). For each of the languages, there are three files:
Data is downloaded from the following link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip and unzipped into the “data” folder in the local working directory.
Afterwards, the data is loaded into R with following code:
blog_entries<-readLines("data/en_US/en_US.blogs.txt", skipNul = TRUE, warn= FALSE)
news_entries<-readLines("data/en_US/en_US.news.txt", skipNul = TRUE, warn=FALSE)
twitter_feeds<-readLines("data/en_US/en_US.twitter.txt", skipNul = TRUE, warn=FALSE)
The size of the each of the following files for English language are:
blog_entries_size<-file.info("data/en_US/en_US.blogs.txt")$size/ 1024 ^ 2
news_entries_size<-file.info("data/en_US/en_US.news.txt")$size/ 1024 ^ 2
twitter_feeds_size<-file.info("data/en_US/en_US.twitter.txt")$size/ 1024 ^ 2
We can arrange the files in a table the following way:
eng_data_set_size<-c(blog_entries_size,news_entries_size,twitter_feeds_size)
data_frame_size<-data.frame(eng_data_set_size)
names(data_frame_size)[1] <-"MBs"
row.names(data_frame_size) <- c("Blogs", "News", "Twitter Feeds")
data_frame_size
## MBs
## Blogs 200.4242
## News 196.2775
## Twitter Feeds 159.3641
The line count for each of the files is the following:
blog_entries_line_count<-length(blog_entries)
news_entries_line_count<-length(news_entries)
twitter_feeds_line_count<-length(twitter_feeds)
data_set_length <-c(blog_entries_line_count,news_entries_line_count, twitter_feeds_line_count)
eng_data_frame_line_count <-data.frame(data_set_length)
names(eng_data_frame_line_count)[1] <-"Line Count"
row.names(eng_data_frame_line_count) <- c("Blogs", "News", "Twitter Feeds")
eng_data_frame_line_count
## Line Count
## Blogs 899288
## News 77259
## Twitter Feeds 2360148
Similar, the word count can be found the following way:
library(ngram)
blog_entries_word_count <-wordcount(blog_entries)
news_entries_word_count <-wordcount(news_entries)
twitter_feeds_word_count <-wordcount(twitter_feeds)
data_set_word_count <-c(blog_entries_word_count, news_entries_word_count, twitter_feeds_word_count)
eng_data_frame_word_count <-data.frame(data_set_word_count)
names(eng_data_frame_word_count)[1] <-"Word Count"
row.names(eng_data_frame_word_count) <- c("Blogs", "News", "Twitter Feeds")
eng_data_frame_word_count
## Word Count
## Blogs 37334131
## News 2643969
## Twitter Feeds 30373583
Finally, we can summarize all of the data and get a glimpse of the data:
summary(blog_entries)
## Length Class Mode
## 899288 character character
head(blog_entries,3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan â\200œgodsâ\200\235."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
summary(news_entries)
## Length Class Mode
## 77259 character character
head(news_entries,3)
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
summary(twitter_feeds)
## Length Class Mode
## 2360148 character character
head(twitter_feeds,3)
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
Due to a large number of observations, theexploratory data analysis will be conducted on a sample, which will be constructed as to contain a representative sample.
For this purpose, I will use only a hundredth of set to create a subsets, and finally combine them in one final training set (for the sole purpose of exploration).
sample_size <- 0.01
blogs_index <- sample(seq_len(blog_entries_line_count),blog_entries_line_count*sample_size)
news_index <- sample(seq_len(length(news_entries)),length(news_entries)*sample_size)
twitter_index <- sample(seq_len(length(twitter_feeds)),length(twitter_feeds)*sample_size)
blogs_sub <- blog_entries[blogs_index[]]
news_sub <- news_entries[news_index[]]
twitter_sub <- twitter_feeds[twitter_index[]]
Next, we use tm library to remove punctuation, numeric data, excess white space and convert all text to lower case, in order to convert all data to plain text.
library(tm)
## Loading required package: NLP
korpus <- Corpus(VectorSource(c(blogs_sub, news_sub, twitter_sub)),readerControl=list(reader=readPlain,language="en"))
korpus <- Corpus(VectorSource(sapply(korpus, function(row) iconv(row, "latin1", "ASCII", sub=""))))
korpus <- tm_map(korpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(korpus, removePunctuation): transformation
## drops documents
korpus <- tm_map(korpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(korpus, stripWhitespace): transformation
## drops documents
korpus <- tm_map(korpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(korpus, content_transformer(tolower)):
## transformation drops documents
korpus <- tm_map(korpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(korpus, removeNumbers): transformation drops
## documents
korpus <- tm_map(korpus, PlainTextDocument)
## Warning in tm_map.SimpleCorpus(korpus, PlainTextDocument): transformation
## drops documents
korpus <- Corpus(VectorSource(korpus))
head(korpus,5)
## <<SimpleCorpus>>
## Metadata: corpus specific: 1, document level (indexed): 0
## Content: documents: 3
For the graphics part, I will now show 50 most frequent use with a histogram using ggplot2.
library(slam)
library(ggplot2)
## Registered S3 methods overwritten by 'ggplot2':
## method from
## [.quosures rlang
## c.quosures rlang
## print.quosures rlang
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
s_korpus <-TermDocumentMatrix(korpus,control=list(minWordLength=1))
wordFrequency <-rowapply_simple_triplet_matrix(s_korpus,sum)
wordFrequency <-wordFrequency[order(wordFrequency,decreasing=T)]
mostFrequent50 <-as.data.frame(wordFrequency[1:50])
mostFrequent50 <-data.frame(Words = row.names(mostFrequent50),mostFrequent50)
names(mostFrequent50)[2] = "Frequency"
row.names(mostFrequent50) <-NULL
mf50Plot = ggplot(data=mostFrequent50, aes(x=Words, y=Frequency, fill=Frequency)) + geom_bar(stat="identity") + guides(fill=FALSE) + theme(axis.text.x=element_text(angle=90))
mf50Plot + ggtitle("50 Most Frequently Used Words") + theme(plot.title = element_text(hjust = 0.5))
From the cleaned data subset it is possible to create a pie chart of the ten most frequently used words in the corpus.
library(plotrix)
mostFrequent10 <-head(mostFrequent50,10)
pie3D(mostFrequent10$Frequency, labels = mostFrequent10$Words, main = "Pie of Ten Greatest Word Frequencies", explode=0.2, radius=1.0, labelcex = 1.0, start=0)
As expected, it can be seen that the most frequent word occurance are mostly articles and pronouns, especially “the” and “and”. For a Shiny app that would be able to predict a next word, I would try to use n-grams (specifically bigrams and trigrams). The product will be in Shiny app form, where user can enter text, and will recieve the predicted next word.