Executive Summary This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.
Downloading and reading files
twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
*Exploratory Data Analysis
##number of lines
length_twitter <- length(news)
length_blogs <- length(blogs)
length_news <- length(news)
##number of words
word_count_twitter <- wordcount(twitter)
word_count_blogs <- wordcount(blogs)
word_count_news <- wordcount(news)
##file size in MB
file_size_twitter <- file.info("./final/en_US/en_US.twitter.txt")$size / 1024^2
file_size_blogs <- file.info("./final/en_US/en_US.blogs.txt")$size / 1024^2
file_size_news <- file.info("./final/en_US/en_US.news.txt")$size / 1024^2
a <- rbind(length_twitter, length_blogs, length_news)
b <- rbind(word_count_twitter, word_count_blogs, word_count_news)
c <- rbind(file_size_twitter, file_size_blogs, file_size_news)
d <- as.data.frame(cbind(a,b,c))
names(d)<-c("Number of Lines", "Number of Words", "File Size -MB")
rownames(d)<-c("twitter", "news", "blogs")
d
## Number of Lines Number of Words File Size -MB
## twitter 1010242 30373583 159.3641
## news 899288 37334131 200.4242
## blogs 1010242 34372530 196.2775
Files are pretty large so we need to take a sample of each and combine the files Set the seed to get randomised numbers Taking a 1% sample as the file size is large
set.seed(1845)
twitter_sample <- sample(twitter, length(twitter)*.001)
news_sample <- sample(news, length(news)*.001)
blogs_sample <- sample(blogs, length(blogs)*.001)
##combine the samples
combined_sample <- c(twitter_sample, blogs_sample, news_sample)
combined_sample <- iconv(combined_sample, "UTF-8", "ASCII", sub="")
length(combined_sample)
## [1] 4269
Cleaning the data Need to clean up the data so after research I found the Text Mining Package where I can strip whitespace, remove punctuation etc
Using Corpus and VCorpus explained: Corpora are collections of documents containing text, they are represented via the virtual S3 class ‘Corpus’ some packages then provide S3 corpus classes extending the virtual base class - such as VCorpus.
corpus <- VCorpus(VectorSource(combined_sample))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords,stopwords("english"))
Analysing the Cleaned Data - pt 1 When analysing the data we could check for most used words. Most used 2 -grams and 3 grams in the dataset this can be done by tokenizing the data as: Unigram (1-Gram), Bigram (2-Gram), Trigram (3-gram)
#1 gram
gram1 = as.data.frame((as.matrix( TermDocumentMatrix(corpus))))
gramlv <- sort(rowSums(gram1), decreasing= TRUE)
gram1d <- data.frame(word = names(gramlv), freq=gramlv)
gram1d[1:10,]
## word freq
## will will 313
## one one 304
## just just 281
## said said 278
## like like 270
## can can 264
## get get 222
## new new 212
## time time 208
## day day 197
Analysing the Cleaned Data - pt 2 Two word repetitions
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max = 2))
gram2 = as.data.frame((as.matrix( TermDocumentMatrix(corpus,control = list(tokenize=bigram)))))
gram2v <- sort(rowSums(gram2), decreasing = TRUE)
gram2d <- data.frame(word=names(gram2v), freq=gram2v)
gram2d[1:10, ]
## word freq
## last week last week 20
## new york new york 20
## right now right now 19
## dont know dont know 17
## cant wait cant wait 16
## im going im going 14
## im sure im sure 14
## last night last night 14
## high school high school 13
## last year last year 13
Histogram of the 30 biggest one word repeitions Top 30 one word repetitions
ggplot(gram1d[1:30,], aes(x=reorder(word,freq), y=freq)) +
geom_bar(stat="identity", width=0.5, fill="tomato2") +
labs(title="Unigrams") +
xlab("Unigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=65,vjust=0.6))
World Cloud of Top 20 two word repetitons
wordcloud2(gram2d[1:20, ], size=1.6, shape = "circle")
Findings in the Dataset News biggest file size but less lines of text
Next steps For this analysis i used only a 1% sample due to the sheer size of the dataset, it would be good to get a bigger sample for the prediction algorithm so can split better into training and testing sets.
When getting a better representation of the data the algorithm can be implemented and trained.
Last step would be to put this into a shiny application.