Capstone project: milestone report

Data science Capstone project- Week 2 Assignment

The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app. ** REVIEW CRITERIA:** 1. Does the link lead to an HTML page describing the exploratory analysis of the training data set? 2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 3. Has the data scientist made basic plots, such as histograms to illustrate features of the data? 4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

DOWNLAODING THE DATASET

Download the data from following link and unzip the files in the current working directory - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

LOADING THE DATA AND LIBRARIES IN R

We use only the english data set for demonstration purpose..

library("ggplot2")
library("tm")
library("wordcloud")
library("quanteda")

Check whether the en_US data is in the working directory… if not change the directory or move the the data in the directory you are working in..

blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")

Dataset summaries

filestats <- function(text_file, lines){
  filesize <- file.info(text_file)[1]/1024^2
  nchars <- lapply(lines, nchar)
  maxchars <- which.max(nchars)
  wordcount <- sum(sapply(strsplit(lines, "//s+"), length))
  return(c(text_file, format(round(as.double(filesize),2), nsmall= 2), length(lines), maxchars, wordcount))
}
blogstat <- filestats("en_US.blogs.txt", blogs)
newstat <- filestats("en_US.news.txt", news)
twitterstat <- filestats("en_US.twitter.txt", twitter)
testsumm <- c(blogstat, newstat, twitterstat)
df <- data.frame(matrix(unlist(testsumm), nrow = 3, byrow = T))
colnames(df) <- c("text_file", "size(Mb)", "line_count", "max line length", "words count")
print(df)

##           text_file size(Mb) line_count max line length words count
## 1   en_US.blogs.txt   200.42     899288          483415      899347
## 2    en_US.news.txt   196.28      77259           14556       77260
## 3 en_US.twitter.txt   159.36    2360148         1484357     2360169

Cleaning the data

Here we will make the test data of the given corpus which will require cleaning of the given corpus data… hence first we’ll clean the given data nd the then take out a sample from and that will be our test data and then we’ll see which words has the highest frequency..

make_corpus <- function(test_file){
  gen_corp <- paste(test_file, collapse = " ")
  gen_corp<- VectorSource(gen_corp)
  gen_corp <- Corpus(gen_corp)
}

clean_corp <- function(corp_data){
  corp_data <- tm_map(corp_data, removeNumbers)
  corp_data <- tm_map(corp_data, content_transformer(tolower))
  corp_data<- tm_map(corp_data, removeWords, stopwords("english"))
  corp_data<- tm_map(corp_data, removePunctuation)
  corp_data<- tm_map(corp_data, stripWhitespace)
  return (corp_data)
}
high_freq_words <-  function(corp_data){
  term_sparse <- DocumentTermMatrix(corp_data)
  #convert our term-document-matrix into normal matrix
  term_matrix <- as.matrix(term_sparse)
  freq_words <- colSums(term_matrix)
  freq_words <- as.data.frame(sort(freq_words, decreasing = TRUE))
  freq_words$words <- rownames(freq_words)
  colnames(freq_words) <- c("frequency", "word")
  return(freq_words)
}

Bar Chart of High Frequency words

news_text <- sample(news, round(0.1*length(news)), replace = F)
news_corp <- make_corpus(news_text)
news_corp <- clean_corp(news_corp)
news_high_freq_words <- high_freq_words(news_corp)
news_high_freq_words1 <- news_high_freq_words[1:15, ]

plot_news<- ggplot(data = news_high_freq_words1, aes(x= reorder(word, frequency), y = frequency, fill = factor(reorder(word, -frequency))))+ geom_bar(stat = "identity")+ labs(x = "Words", main = "Most frequent words used in US News")+ theme(legend.title = element_blank())+ coord_flip()
plot_news

# High frequency words in blogs
blogs_text <- sample(blogs, round(0.1*length(blogs)), replace = F)
blogs_corp <- make_corpus(blogs_text)
blogs_corp<- clean_corp(blogs_corp)
blogs_high_freq_words <- high_freq_words(blogs_corp)
blogs_high_freq_words1 <- blogs_high_freq_words[1:15, ]

blog_plot <- ggplot(data= blogs_high_freq_words1, aes(x= reorder(word, frequency), y = frequency, fill = factor(reorder(word, -frequency))))+ geom_bar(stat = "identity")+ labs(x= "words", title = "most frequent words used in US blogs")+ theme(legend.title = element_blank())+ coord_flip()
blog_plot

# High frequency words in twitter
twitter_text <- sample(twitter, round(0.1*length(twitter)), replace = F)
twitter_corp <- make_corpus(twitter_text)
twitter_corp<- clean_corp(twitter_corp)
twitter_high_freq_words <- high_freq_words(twitter_corp)
twitter_high_freq_words1 <- twitter_high_freq_words[1:15, ]

twitter_plot <- ggplot(data= twitter_high_freq_words1, aes(x= reorder(word, frequency), y = frequency, fill = factor(reorder(word, -frequency))))+ geom_bar(stat = "identity")+ labs(x= "words", title = "most frequent words used in US twitter")+ theme(legend.title = element_blank())+ coord_flip()
twitter_plot

generating the word cloud

word cloud is a display of the words based on frequencies

## US news word cloud
wordcloud(news_high_freq_words$word[1:100], news_high_freq_words$frequency[1:100], colors = brewer.pal(9, "Dark2"))

## US blogs word cloud
wordcloud(blogs_high_freq_words$word[1:100], blogs_high_freq_words$frequency[1:100], colors = brewer.pal(9, "Dark2"))

## Warning in brewer.pal(9, "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors

## US twitter word cloud
wordcloud(twitter_high_freq_words$word[1:100], twitter_high_freq_words$frequency[1:100], colors = brewer.pal(9, "Dark2"))

Word Analysis

For the Data analysis of text document we need to create a bag of word matrices with Unigram, Bigram, Trigrams. These Ngram model set improve the predictabily of the data analysis.

news_text1 <- sample(news_text, round(0.1*length(news_text)), replace = F)
news_tokens <- tokens(news_text1, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
news_tokens <- tokens_tolower(news_tokens)
news_tokens <- tokens_select(news_tokens, stopwords(), selection = "remove")

news_unigram <- tokens_ngrams(news_tokens, n = 1) ## unigram
news_unigram.dfm <- dfm(news_unigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

news_bigram <- tokens_ngrams(news_tokens, n = 2) ## bigram
news_bigram.dfm <- dfm(news_bigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

news_trigram <- tokens_ngrams(news_tokens, n = 3) ## trigram
news_trigram.dfm <- dfm(news_trigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

topfeatures(news_unigram.dfm, 20) ## 20 top unigram words of news

##    said       â     one     new     two       s    like    also    last   state 
##     186     120      68      66      52      49      49      46      44      43 
##     can    time   first     get    year    just   years     u.s percent    good 
##      40      39      38      38      34      33      33      32      32      31

topfeatures(news_bigram.dfm, n= 20) ## 20 top bigram words of news

##       new_york    los_angeles     new_jersey    high_school       st_louis 
##             10              7              7              7              6 
##  united_states           iâ_m          itâ_s      last_week    new_library 
##              5              5              5              5              4 
##     first_time         â_said          heâ_s     last_month      years_ago 
##              4              4              4              4              4 
##         said_â      last_year vice_president    budget_cuts u.s_department 
##              4              4              4              3              3

topfeatures(news_trigram.dfm, n= 20) ## 20 top trigram words of news

##                  weâ_re_going              cents_per_gallon 
##                             3                             3 
##                      â_œheâ_s          hullihen_feet_inches 
##                             2                             2 
##              feet_inches_tall               eyes_brown_hair 
##                             2                             2 
##             tall_pounds_green             pounds_green_eyes 
##                             2                             2 
##                  world_war_ii             dâ_onofrio_argued 
##                             2                             2 
##      arizona_state_university          truth_told_president 
##                             2                             2 
## 7-percentage-point_lead_among       unified_school_district 
##                             2                             2 
##                la_jolla_front              close_budget_gap 
##                             2                             2 
##                    â_œthatâ_s            u.s_district_court 
##                             2                             2 
##               along_santa_ana               santa_ana_river 
##                             2                             2

blogs_text1 <- sample(blogs_text, round(0.1*length(blogs_text)), replace = F)
blogs_tokens <- tokens(blogs_text1, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
blogs_tokens <- tokens_tolower(blogs_tokens)
blogs_tokens <- tokens_select(blogs_tokens, stopwords(), selection = "remove")

blogs_unigram <- tokens_ngrams(blogs_tokens, n = 1) ## unigram
blogs_unigram.dfm <- dfm(blogs_unigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

blogs_bigram <- tokens_ngrams(blogs_tokens, n = 2) ## bigram
blogs_bigram.dfm <- dfm(blogs_bigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

blogs_trigram <- tokens_ngrams(blogs_tokens, n = 3) ## trigram
blogs_trigram.dfm <- dfm(blogs_trigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

topfeatures(blogs_unigram.dfm, 20) ## 20 top unigram words of blogs

##      â      s    one   just   like    can      t   time    get    now    new 
##   2356   1760   1174   1036   1009    977    940    915    650    626    613 
##   also   know people     iâ   even    day  first   make   love 
##    596    590    582    569    534    523    509    500    493

topfeatures(blogs_bigram.dfm, n= 20) ## 20 top bigram words of blogs

##    itâ_s     iâ_m   donâ_t  didnâ_t    iâ_ve  thatâ_s   canâ_t  youâ_re 
##      383      324      278      135      127      126      110       82 
## doesnâ_t      â_â     iâ_d    heâ_s  wasnâ_t   â_œthe new_york thereâ_s 
##       78       73       65       62       61       61       60       59 
##   t_know    iâ_ll  can_see   said_â 
##       57       52       50       49

topfeatures(blogs_trigram.dfm, n= 20) ## 20 top trigram words of blogs

##           donâ_t_know           donâ_t_want            iâ_m_going 
##                    45                    26                    24 
##                 â_â_â             iâ_m_sure          donâ_t_think 
##                    24                    19                    16 
## north_dakota_hospital            itâ_s_just           iâ_ve_never 
##                    14                    12                    11 
##            itâ_s_like               â_itâ_s         doesnâ_t_want 
##                    11                    10                    10 
##         new_york_city          iâ_ve_always               â_œiâ_m 
##                    10                     9                     9 
##            itâ_s_time           donâ_t_need             iâ_d_like 
##                     9                     9                     9 
##           just_donâ_t          didnâ_t_even 
##                     9                     9

twitter_text1 <- sample(twitter_text, round(0.1*length(twitter_text)), replace = F)
twitter_tokens <- tokens(twitter_text1, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
twitter_tokens <- tokens_tolower(twitter_tokens)
twitter_tokens <- tokens_select(twitter_tokens, stopwords(), selection = "remove")

twitter_unigram <- tokens_ngrams(twitter_tokens, n = 1) ## unigram
twitter_unigram.dfm <- dfm(twitter_unigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

twitter_bigram <- tokens_ngrams(twitter_tokens, n = 2) ## bigram
twitter_bigram.dfm <- dfm(twitter_bigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

twitter_trigram <- tokens_ngrams(twitter_tokens, n = 3) ## trigram
twitter_trigram.dfm <- dfm(twitter_trigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)

topfeatures(twitter_unigram.dfm, 20) ## 20 top unigram words of twitter

##   just   like    get   good   love     rt thanks    can    day    now    one 
##   1448   1218   1078   1028   1020    921    915    887    850    814    804 
##  great   know   time      u     go    new  today    see    lol 
##    782    768    748    744    734    710    708    687    659

topfeatures(twitter_bigram.dfm, n= 20) ## 20 top bigram words of twitter

##             â_œ       right_now looking_forward      last_night    good_morning 
##             213             174             100              97              89 
##  happy_birthday       feel_like        just_got   thanks_follow       good_luck 
##              88              77              69              65              61 
##        let_know      looks_like         can_get     follow_back       next_week 
##              61              56              52              51              50 
##       great_day   please_follow        wait_see     sounds_like      first_time 
##              49              47              45              45              43

topfeatures(twitter_trigram.dfm, n= 20) ## 20 top trigram words of twitter

##             let_us_know       happy_mothers_day                  rt_â_œ 
##                      22                      19                      16 
##          happy_new_year      happy_mother's_day  looking_forward_seeing 
##                      16                      15                      12 
##                   î_î_î                ðÿ_ðÿ_ðÿ             come_see_us 
##                      11                      11                      10 
##           cinco_de_mayo      good_morning_world          keep_good_work 
##                       9                       7                       6 
##     look_forward_seeing          st_patty's_day           need_get_back 
##                       6                       6                       5 
##   welcome_new_followers      follow_follow_back           cant_wait_see 
##                       5                       5                       5 
## looking_forward_hearing        just_finished_mi 
##                       5                       5

Further steps

Now since we have the data to the ngrams model we and the observations required we can proceed to make our shiny applicaton that predicts the the words using n - gram model..