Milestone - Week 2

Executive Summary This document should be concise and explain only the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app in a way that would be understandable to a non-data scientist manager. You should make use of tables and plots to illustrate important summaries of the data set. The motivation for this project is to: 1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.2. Create a basic report of summary statistics about the data sets.3. Report any interesting findings that you amassed so far.4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Downloading and reading files

twitter <- readLines("./final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("./final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("./final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)

*Exploratory Data Analysis

##number of lines
length_twitter <- length(news)
length_blogs <- length(blogs)
length_news <- length(news)

##number of words

word_count_twitter <- wordcount(twitter)
word_count_blogs <- wordcount(blogs)
word_count_news <- wordcount(news)

##file size in MB

file_size_twitter <- file.info("./final/en_US/en_US.twitter.txt")$size / 1024^2 
file_size_blogs <- file.info("./final/en_US/en_US.blogs.txt")$size / 1024^2 
file_size_news <- file.info("./final/en_US/en_US.news.txt")$size / 1024^2 

a <- rbind(length_twitter, length_blogs, length_news)
b <- rbind(word_count_twitter, word_count_blogs, word_count_news)
c <- rbind(file_size_twitter, file_size_blogs, file_size_news)
d <- as.data.frame(cbind(a,b,c))
names(d)<-c("Number of Lines", "Number of Words", "File Size -MB")
rownames(d)<-c("twitter", "news", "blogs")
d

##         Number of Lines Number of Words File Size -MB
## twitter         1010242        30373583      159.3641
## news             899288        37334131      200.4242
## blogs           1010242        34372530      196.2775

Files are pretty large so we need to take a sample of each and combine the files Set the seed to get randomised numbers Taking a 1% sample as the file size is large

set.seed(1845)

twitter_sample <- sample(twitter, length(twitter)*.001)
news_sample <- sample(news, length(news)*.001)
blogs_sample <- sample(blogs, length(blogs)*.001)

##combine the samples

combined_sample <- c(twitter_sample, blogs_sample, news_sample)
combined_sample <- iconv(combined_sample, "UTF-8", "ASCII", sub="")
length(combined_sample)

## [1] 4269

Cleaning the data Need to clean up the data so after research I found the Text Mining Package where I can strip whitespace, remove punctuation etc

Using Corpus and VCorpus explained: Corpora are collections of documents containing text, they are represented via the virtual S3 class ‘Corpus’ some packages then provide S3 corpus classes extending the virtual base class - such as VCorpus.

corpus <- VCorpus(VectorSource(combined_sample))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords,stopwords("english"))

Analysing the Cleaned Data - pt 1 When analysing the data we could check for most used words. Most used 2 -grams and 3 grams in the dataset this can be done by tokenizing the data as: Unigram (1-Gram), Bigram (2-Gram), Trigram (3-gram)

#1 gram

gram1 = as.data.frame((as.matrix(  TermDocumentMatrix(corpus))))

gramlv <- sort(rowSums(gram1), decreasing= TRUE)
gram1d <- data.frame(word = names(gramlv), freq=gramlv)
gram1d[1:10,]

##      word freq
## will will  313
## one   one  304
## just just  281
## said said  278
## like like  270
## can   can  264
## get   get  222
## new   new  212
## time time  208
## day   day  197

Analysing the Cleaned Data - pt 2 Two word repetitions

bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max = 2))
gram2 = as.data.frame((as.matrix(  TermDocumentMatrix(corpus,control = list(tokenize=bigram)))))
gram2v <- sort(rowSums(gram2), decreasing = TRUE)
gram2d <- data.frame(word=names(gram2v), freq=gram2v)
gram2d[1:10, ]

##                    word freq
## last week     last week   20
## new york       new york   20
## right now     right now   19
## dont know     dont know   17
## cant wait     cant wait   16
## im going       im going   14
## im sure         im sure   14
## last night   last night   14
## high school high school   13
## last year     last year   13

Histogram of the 30 biggest one word repeitions Top 30 one word repetitions

ggplot(gram1d[1:30,], aes(x=reorder(word,freq), y=freq)) +
  geom_bar(stat="identity", width=0.5, fill="tomato2") +
  labs(title="Unigrams") +
  xlab("Unigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=65,vjust=0.6))

World Cloud of Top 20 two word repetitons

wordcloud2(gram2d[1:20, ], size=1.6, shape = "circle")

Findings in the Dataset News biggest file size but less lines of text

Next steps For this analysis i used only a 1% sample due to the sheer size of the dataset, it would be good to get a bigger sample for the prediction algorithm so can split better into training and testing sets.

When getting a better representation of the data the algorithm can be implemented and trained.

Last step would be to put this into a shiny application.

Milestone - Week 2

Kat Downey

11/08/2022