The goal of this project is just to display that you’ve gotten used to working with the data and that you are on track to create your prediction algorithm. Please submit a report that explains your exploratory analysis and your goals for the eventual app and algorithm. This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in. 2. Create a basic report of summary statistics about the data sets. 3. Report any interesting findings that you amassed so far. 4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
library(tm)
## Loading required package: NLP
library(slam)
library(xtable)
library(rJava)
library(RWeka)
library(NLP)
library(ngram)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud2)
library(knitr)
library(RColorBrewer)
library(stringi)
library(LaF)
The Swifkey Dataset has been downloaded and unzipped manually from the below link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
##Information about the 3 file types for the US set downloaded:
file_information <- function(file_path) {
file_size <- file.info(file_path)$size/1048576
conn <- file(file_path, "r")
full_text <- readLines(conn)
n_lines <- length(full_text)
max_line <- 0
for (i in 1:n_lines) {
line_length <- nchar(full_text[i])
if (line_length > max_line) { max_line <- line_length }
}
n_words <- sum(stri_count_words(full_text))
close(conn)
list(file_size=file_size, n_lines=n_lines, max_line=max_line, n_words=n_words)
}
data_dir <- "/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/"
info_blog <- file_information(paste0(data_dir,"en_US.blogs.txt"))
info_news <- file_information(paste0(data_dir,"en_US.news.txt"))
info_twitter <- file_information(paste0(data_dir,"en_US.twitter.txt"))
## Warning in readLines(conn): line 167155 appears to contain an embedded nul
## Warning in readLines(conn): line 268547 appears to contain an embedded nul
## Warning in readLines(conn): line 1274086 appears to contain an embedded nul
## Warning in readLines(conn): line 1759032 appears to contain an embedded nul
matrix(c(info_blog[1],info_blog[2],info_blog[3],info_blog[4],
info_news[1],info_news[2],info_news[3],info_news[4],
info_twitter[1],info_twitter[2],info_twitter[3],info_twitter[4]),
nrow = 3, ncol = 4, byrow = TRUE,
dimnames = list(c("Info Blogs:", "Info News:", "Info Twitter:"),
c("File Size in MB", "No. of Lines", "Longest Line (No. characters)", "No. of Words")))
## File Size in MB No. of Lines Longest Line (No. characters)
## Info Blogs: 200.4242 899288 40833
## Info News: 196.2775 1010242 11384
## Info Twitter: 159.3641 2360148 140
## No. of Words
## Info Blogs: 37546239
## Info News: 34762395
## Info Twitter: 30093372
Only a portion of the data will be used for an initial analysis, therefore getting a sample for the 3 file types for US set: blogs, news, twitter. A Corpus (collection of documents) is also created based on the 3 sample
blogs_con <- file("/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/en_US.blogs.txt", "r")
news_con <- file("/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/en_US.news.txt", "r")
twitter_con <- file("/Users/nilsgimpl/Desktop/Coding/R_data/datasciencecoursera/Capstone Project/NLP_capstone_project/en_US/en_US.twitter.txt", "r")
blogs_data <- readLines(blogs_con, 2000)
news_data <- readLines(news_con, 2000)
twitter_data <- readLines(twitter_con, 2000)
corp <- VCorpus(VectorSource(c(blogs_data, news_data, twitter_data)), readerControl=list(readPlain, language="en", load=TRUE))
close(blogs_con)
close(news_con)
close(twitter_con)
This section will use the text mining library ‘tm’ (loaded previously) to perform Data cleaning tasks, which are meaningful in Predictive Text Analytics. Main cleaning steps are:
Removing extra whitespaces generated in previous 5 steps The above can be achieve with some of the TM package functions; let’s take a look to each cleaning task, individually:
Converting the document to lowercase
corp_low <- tm_map(corp, content_transformer(tolower))
corp_low_punct <- tm_map(corp_low, removePunctuation)
corp_low_punct_no <- tm_map(corp_low_punct, removeNumbers)
corp_low_punct_no_stop <- tm_map(corp_low_punct_no, removeWords,stopwords("english"))
Removing undesired terms in a first exploration of the datasets, we could see they contain a lot of “profanity” words, which potentially would need to be removed; nevertheless, they could have some weight in the prediction results so therefore we can always consider this step at a later stage, depending on needs.
Removing whitespaces generated in previous steps
corp_low_punct_no_stop_white <- tm_map(corp_low_punct_no_stop, stripWhitespace)
The cleaned data is now ready to be analysed, in the next steps it will be checked:
uni_gram = as.data.frame((as.matrix( TermDocumentMatrix(corp_low_punct_no_stop_white) )) )
uni_gram_sorted <- sort(rowSums(uni_gram),decreasing=TRUE)
uni_gram_data_frame <- data.frame(word = names(uni_gram_sorted),freq=uni_gram_sorted)
uni_gram_data_frame[1:10,]
## word freq
## said said 600
## one one 499
## will will 499
## like like 478
## just just 464
## can can 402
## time time 351
## new new 344
## get get 326
## now now 294
bigram <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
bi_gram= as.data.frame((as.matrix( TermDocumentMatrix(corp_low_punct_no_stop_white,control = list(tokenize = bigram)) )) )
bi_gram_sorted <- sort(rowSums(bi_gram),decreasing=TRUE)
bi_gram_data_frame <- data.frame(word = names(bi_gram_sorted),freq=bi_gram_sorted)
bi_gram_data_frame[1:10,]
## word freq
## new york new york 44
## last year last year 34
## dont know dont know 32
## high school high school 32
## right now right now 31
## u u u u 26
## last night last night 24
## feel like feel like 22
## new jersey new jersey 21
## years ago years ago 21
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
tri_gram = as.data.frame((as.matrix( TermDocumentMatrix(corp_low_punct_no_stop_white,control = list(tokenize = trigram)) )) )
tri_gram_sorted <- sort(rowSums(tri_gram),decreasing=TRUE)
tri_gram_data_frame <- data.frame(word = names(tri_gram_sorted),freq=tri_gram_sorted)
tri_gram_data_frame[1:10,]
## word freq
## u u u u u u 17
## pates fountain parks pates fountain parks 11
## classic pates fountain classic pates fountain 8
## cinco de mayo cinco de mayo 7
## new york city new york city 6
## new york times new york times 6
## world war ii world war ii 5
## cricket world cup cricket world cup 4
## four years ago four years ago 4
## osama bin laden osama bin laden 4
uni_gram_plot <- ggplot(uni_gram_data_frame[1:20,], aes(x=reorder(word, freq),y=freq)) +
geom_bar(stat="identity", width=0.7, fill="steelblue") +
labs(title="20th Most Common Unigrams")+
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=90, vjust=0.3))
uni_gram_plot
worldcloud_bi_gram <- wordcloud2(bi_gram_data_frame[1:400,],size=1.0,shape = 'cirlce')
worldcloud_bi_gram
wordcloud_tri_gram <- wordcloud2(tri_gram_data_frame[1:200,],size=1.0,shape = 'circle')
wordcloud_tri_gram