The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report on R Pubs (http://rpubs.com/) that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app. ** REVIEW CRITERIA:** 1. Does the link lead to an HTML page describing the exploratory analysis of the training data set? 2. Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables? 3. Has the data scientist made basic plots, such as histograms to illustrate features of the data? 4. Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?
Download the data from following link and unzip the files in the current working directory - https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
We use only the english data set for demonstration purpose..
library("ggplot2")
library("tm")
library("wordcloud")
library("quanteda")
Check whether the en_US data is in the working directory… if not change the directory or move the the data in the directory you are working in..
blogs <- readLines("en_US.blogs.txt")
news <- readLines("en_US.news.txt")
twitter <- readLines("en_US.twitter.txt")
filestats <- function(text_file, lines){
filesize <- file.info(text_file)[1]/1024^2
nchars <- lapply(lines, nchar)
maxchars <- which.max(nchars)
wordcount <- sum(sapply(strsplit(lines, "//s+"), length))
return(c(text_file, format(round(as.double(filesize),2), nsmall= 2), length(lines), maxchars, wordcount))
}
blogstat <- filestats("en_US.blogs.txt", blogs)
newstat <- filestats("en_US.news.txt", news)
twitterstat <- filestats("en_US.twitter.txt", twitter)
testsumm <- c(blogstat, newstat, twitterstat)
df <- data.frame(matrix(unlist(testsumm), nrow = 3, byrow = T))
colnames(df) <- c("text_file", "size(Mb)", "line_count", "max line length", "words count")
print(df)
## text_file size(Mb) line_count max line length words count
## 1 en_US.blogs.txt 200.42 899288 483415 899347
## 2 en_US.news.txt 196.28 77259 14556 77260
## 3 en_US.twitter.txt 159.36 2360148 1484357 2360169
Here we will make the test data of the given corpus which will require cleaning of the given corpus data… hence first we’ll clean the given data nd the then take out a sample from and that will be our test data and then we’ll see which words has the highest frequency..
make_corpus <- function(test_file){
gen_corp <- paste(test_file, collapse = " ")
gen_corp<- VectorSource(gen_corp)
gen_corp <- Corpus(gen_corp)
}
clean_corp <- function(corp_data){
corp_data <- tm_map(corp_data, removeNumbers)
corp_data <- tm_map(corp_data, content_transformer(tolower))
corp_data<- tm_map(corp_data, removeWords, stopwords("english"))
corp_data<- tm_map(corp_data, removePunctuation)
corp_data<- tm_map(corp_data, stripWhitespace)
return (corp_data)
}
high_freq_words <- function(corp_data){
term_sparse <- DocumentTermMatrix(corp_data)
#convert our term-document-matrix into normal matrix
term_matrix <- as.matrix(term_sparse)
freq_words <- colSums(term_matrix)
freq_words <- as.data.frame(sort(freq_words, decreasing = TRUE))
freq_words$words <- rownames(freq_words)
colnames(freq_words) <- c("frequency", "word")
return(freq_words)
}
news_text <- sample(news, round(0.1*length(news)), replace = F)
news_corp <- make_corpus(news_text)
news_corp <- clean_corp(news_corp)
news_high_freq_words <- high_freq_words(news_corp)
news_high_freq_words1 <- news_high_freq_words[1:15, ]
plot_news<- ggplot(data = news_high_freq_words1, aes(x= reorder(word, frequency), y = frequency, fill = factor(reorder(word, -frequency))))+ geom_bar(stat = "identity")+ labs(x = "Words", main = "Most frequent words used in US News")+ theme(legend.title = element_blank())+ coord_flip()
plot_news
# High frequency words in blogs
blogs_text <- sample(blogs, round(0.1*length(blogs)), replace = F)
blogs_corp <- make_corpus(blogs_text)
blogs_corp<- clean_corp(blogs_corp)
blogs_high_freq_words <- high_freq_words(blogs_corp)
blogs_high_freq_words1 <- blogs_high_freq_words[1:15, ]
blog_plot <- ggplot(data= blogs_high_freq_words1, aes(x= reorder(word, frequency), y = frequency, fill = factor(reorder(word, -frequency))))+ geom_bar(stat = "identity")+ labs(x= "words", title = "most frequent words used in US blogs")+ theme(legend.title = element_blank())+ coord_flip()
blog_plot
# High frequency words in twitter
twitter_text <- sample(twitter, round(0.1*length(twitter)), replace = F)
twitter_corp <- make_corpus(twitter_text)
twitter_corp<- clean_corp(twitter_corp)
twitter_high_freq_words <- high_freq_words(twitter_corp)
twitter_high_freq_words1 <- twitter_high_freq_words[1:15, ]
twitter_plot <- ggplot(data= twitter_high_freq_words1, aes(x= reorder(word, frequency), y = frequency, fill = factor(reorder(word, -frequency))))+ geom_bar(stat = "identity")+ labs(x= "words", title = "most frequent words used in US twitter")+ theme(legend.title = element_blank())+ coord_flip()
twitter_plot
word cloud is a display of the words based on frequencies
## US news word cloud
wordcloud(news_high_freq_words$word[1:100], news_high_freq_words$frequency[1:100], colors = brewer.pal(9, "Dark2"))
## US blogs word cloud
wordcloud(blogs_high_freq_words$word[1:100], blogs_high_freq_words$frequency[1:100], colors = brewer.pal(9, "Dark2"))
## Warning in brewer.pal(9, "Dark2"): n too large, allowed maximum for palette Dark2 is 8
## Returning the palette you asked for with that many colors
## US twitter word cloud
wordcloud(twitter_high_freq_words$word[1:100], twitter_high_freq_words$frequency[1:100], colors = brewer.pal(9, "Dark2"))
For the Data analysis of text document we need to create a bag of word matrices with Unigram, Bigram, Trigrams. These Ngram model set improve the predictabily of the data analysis.
news_text1 <- sample(news_text, round(0.1*length(news_text)), replace = F)
news_tokens <- tokens(news_text1, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
news_tokens <- tokens_tolower(news_tokens)
news_tokens <- tokens_select(news_tokens, stopwords(), selection = "remove")
news_unigram <- tokens_ngrams(news_tokens, n = 1) ## unigram
news_unigram.dfm <- dfm(news_unigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
news_bigram <- tokens_ngrams(news_tokens, n = 2) ## bigram
news_bigram.dfm <- dfm(news_bigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
news_trigram <- tokens_ngrams(news_tokens, n = 3) ## trigram
news_trigram.dfm <- dfm(news_trigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
topfeatures(news_unigram.dfm, 20) ## 20 top unigram words of news
## said â one new two s like also last state
## 186 120 68 66 52 49 49 46 44 43
## can time first get year just years u.s percent good
## 40 39 38 38 34 33 33 32 32 31
topfeatures(news_bigram.dfm, n= 20) ## 20 top bigram words of news
## new_york los_angeles new_jersey high_school st_louis
## 10 7 7 7 6
## united_states iâ_m itâ_s last_week new_library
## 5 5 5 5 4
## first_time â_said heâ_s last_month years_ago
## 4 4 4 4 4
## said_â last_year vice_president budget_cuts u.s_department
## 4 4 4 3 3
topfeatures(news_trigram.dfm, n= 20) ## 20 top trigram words of news
## weâ_re_going cents_per_gallon
## 3 3
## â_œheâ_s hullihen_feet_inches
## 2 2
## feet_inches_tall eyes_brown_hair
## 2 2
## tall_pounds_green pounds_green_eyes
## 2 2
## world_war_ii dâ_onofrio_argued
## 2 2
## arizona_state_university truth_told_president
## 2 2
## 7-percentage-point_lead_among unified_school_district
## 2 2
## la_jolla_front close_budget_gap
## 2 2
## â_œthatâ_s u.s_district_court
## 2 2
## along_santa_ana santa_ana_river
## 2 2
blogs_text1 <- sample(blogs_text, round(0.1*length(blogs_text)), replace = F)
blogs_tokens <- tokens(blogs_text1, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
blogs_tokens <- tokens_tolower(blogs_tokens)
blogs_tokens <- tokens_select(blogs_tokens, stopwords(), selection = "remove")
blogs_unigram <- tokens_ngrams(blogs_tokens, n = 1) ## unigram
blogs_unigram.dfm <- dfm(blogs_unigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
blogs_bigram <- tokens_ngrams(blogs_tokens, n = 2) ## bigram
blogs_bigram.dfm <- dfm(blogs_bigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
blogs_trigram <- tokens_ngrams(blogs_tokens, n = 3) ## trigram
blogs_trigram.dfm <- dfm(blogs_trigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
topfeatures(blogs_unigram.dfm, 20) ## 20 top unigram words of blogs
## â s one just like can t time get now new
## 2356 1760 1174 1036 1009 977 940 915 650 626 613
## also know people iâ even day first make love
## 596 590 582 569 534 523 509 500 493
topfeatures(blogs_bigram.dfm, n= 20) ## 20 top bigram words of blogs
## itâ_s iâ_m donâ_t didnâ_t iâ_ve thatâ_s canâ_t youâ_re
## 383 324 278 135 127 126 110 82
## doesnâ_t â_â iâ_d heâ_s wasnâ_t â_œthe new_york thereâ_s
## 78 73 65 62 61 61 60 59
## t_know iâ_ll can_see said_â
## 57 52 50 49
topfeatures(blogs_trigram.dfm, n= 20) ## 20 top trigram words of blogs
## donâ_t_know donâ_t_want iâ_m_going
## 45 26 24
## â_â_â iâ_m_sure donâ_t_think
## 24 19 16
## north_dakota_hospital itâ_s_just iâ_ve_never
## 14 12 11
## itâ_s_like â_itâ_s doesnâ_t_want
## 11 10 10
## new_york_city iâ_ve_always â_œiâ_m
## 10 9 9
## itâ_s_time donâ_t_need iâ_d_like
## 9 9 9
## just_donâ_t didnâ_t_even
## 9 9
twitter_text1 <- sample(twitter_text, round(0.1*length(twitter_text)), replace = F)
twitter_tokens <- tokens(twitter_text1, what = "word", remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE, remove_symbols = TRUE)
twitter_tokens <- tokens_tolower(twitter_tokens)
twitter_tokens <- tokens_select(twitter_tokens, stopwords(), selection = "remove")
twitter_unigram <- tokens_ngrams(twitter_tokens, n = 1) ## unigram
twitter_unigram.dfm <- dfm(twitter_unigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
twitter_bigram <- tokens_ngrams(twitter_tokens, n = 2) ## bigram
twitter_bigram.dfm <- dfm(twitter_bigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
twitter_trigram <- tokens_ngrams(twitter_tokens, n = 3) ## trigram
twitter_trigram.dfm <- dfm(twitter_trigram, tolower = TRUE, remove = stopwords("english"), remove_punct = TRUE)
topfeatures(twitter_unigram.dfm, 20) ## 20 top unigram words of twitter
## just like get good love rt thanks can day now one
## 1448 1218 1078 1028 1020 921 915 887 850 814 804
## great know time u go new today see lol
## 782 768 748 744 734 710 708 687 659
topfeatures(twitter_bigram.dfm, n= 20) ## 20 top bigram words of twitter
## â_œ right_now looking_forward last_night good_morning
## 213 174 100 97 89
## happy_birthday feel_like just_got thanks_follow good_luck
## 88 77 69 65 61
## let_know looks_like can_get follow_back next_week
## 61 56 52 51 50
## great_day please_follow wait_see sounds_like first_time
## 49 47 45 45 43
topfeatures(twitter_trigram.dfm, n= 20) ## 20 top trigram words of twitter
## let_us_know happy_mothers_day rt_â_œ
## 22 19 16
## happy_new_year happy_mother's_day looking_forward_seeing
## 16 15 12
## î_î_î ðÿ_ðÿ_ðÿ come_see_us
## 11 11 10
## cinco_de_mayo good_morning_world keep_good_work
## 9 7 6
## look_forward_seeing st_patty's_day need_get_back
## 6 6 5
## welcome_new_followers follow_follow_back cant_wait_see
## 5 5 5
## looking_forward_hearing just_finished_mi
## 5 5
Now since we have the data to the ngrams model we and the observations required we can proceed to make our shiny applicaton that predicts the the words using n - gram model..