This report summarizes the Exploratory Data Analysis performed on the provided text dataset. This analysis will help us plan the development of an app which would predict the next word in a sentence like the autocomplete feature in most mobile phones.
I have already downloaded the file through the link provided. So, here we will set the working directory and import the data. First we read the data into separate variables and close the connections to save memory.
setwd("C:\\Users\\Intel\\Documents\\Coursera DS Capstone Project")
# Import all the files and open the connection
fileBlogs <- file(".\\final\\en_US\\en_US.blogs.txt", "rb")
fileNews <- file(".\\final\\en_US\\en_US.news.txt", "rb")
fileTweets <- file(".\\final\\en_US\\en_US.twitter.txt", "rb")
# Read the lines and close the connections
blogs <- readLines(fileBlogs, encoding = "UTF-8", skipNul = TRUE)
close(fileBlogs)
news <- readLines(fileNews, encoding = "UTF-8", skipNul = TRUE)
close(fileNews)
tweets <- readLines(fileTweets, encoding = "UTF-8", skipNul = TRUE)
close(fileTweets)
# Remove the variables from the workspace
rm(fileBlogs, fileNews, fileTweets)
Let’s check the memory utilization by the individual files in order to get some perspective about the space requirements. We will also use run the garbage collector to free up some space for R. We will do gc() everytime we remove some big variables.
blogsMem <- object.size(blogs)
format(blogsMem, units = "MB", standard = "legacy")
## [1] "255.4 Mb"
newsMem <- object.size(news)
format(newsMem, units = "MB", standard = "legacy")
## [1] "257.3 Mb"
tweetsMem <- object.size(tweets)
format(tweetsMem, units = "MB", standard = "legacy")
## [1] "319 Mb"
totalMem <- blogsMem + newsMem + tweetsMem
format(totalMem, units = "MB", standard = "legacy")
## [1] "831.7 Mb"
rm(blogsMem, newsMem, tweetsMem, totalMem)
gc() # Garbage Collector
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6072922 324.4 10069112 537.8 7114062 380.0
## Vcells 91949432 701.6 134224930 1024.1 105119992 802.1
So, the dataset needs about 832MB of RAM. The gc() output also indicates that the max memory used is around 1.9Gb. This might create some problems while creating the application as the shiny website only provides 1Gb of space for the app data. Anything above that would require a purchase of premium plans. So it would be better if we design the app to the take data in chunks instead of the whole file.
basicSummary <- data.frame(fileType = c("blogs", "news", "twitter"),
nlines = c(length(blogs), length(news), length(tweets)),
nwords = c(wordcount(blogs,sep = " "),
wordcount(news, sep = " "),
wordcount(tweets, sep = " ")),
longestLine = c(max(nchar(blogs)), max(nchar(news)), max(nchar(tweets))))
basicSummary
## fileType nlines nwords longestLine
## 1 blogs 899288 37334131 40833
## 2 news 1010242 34372530 11384
## 3 twitter 2360148 30373583 140
It should be noted here that the longest lines in the twitter files are 140 characters in length. This is accurate as tweets are limited to 140 characters. It can also be observed here that, even though the twitter dataset is larger in terms of number of lines and memory required, the number of words is much less than that in blogs. This might indicate usage of either longer words or, more probably, special characters like emojis.
In this section, we will clean the data and create smaller samples so that it is easier to work on. The cleaning process will involve the following steps: - Expanding contractions - converting all characters to lowercase - removing digits and words containing digits - removing punctuations
blogs <- data_frame(text = blogs)
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
news <- data_frame(text = news)
tweets <- data_frame(text = tweets)
Here, we first combine all the three dataframes into one and then sample from that text. This combination is done by using the bind_rows() function. We add an additional column to the new dataframe which tells us the source of the row. This is done using the mutate() function.
corpus <- bind_rows(mutate(blogs, source = "blogs"),
mutate(news, source = "news"),
mutate(tweets, source = "twitter"))
corpus$source <- as.factor(corpus$source)
#corpus <- rbind(blogs, news, tweets)
rm(blogs, news, tweets)
gc()
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 6114413 326.6 12178840 650.5 12178840 650.5
## Vcells 94170666 718.5 161149916 1229.5 126747347 967.1
Now we will create a sample from the combined data and operate on that. This will make our operations run faster as we don’t have to use the complete dataset. The set.seed() line should be uncommented for reproducability. Also, from this point on, we will be operating on the sample, so we will remove the original dataset to free up some memory.
set.seed(5)
corpusSample <- corpus[sample(nrow(corpus), 10000), ]
rm(corpus)
Here we will remove all the contractions, numbers, punctuation, special characters and emoticons from the sample.
corpusSample$text <- replace_contraction(corpusSample$text)
corpusSample$text <- gsub("\\d", "", corpusSample$text) # Remove Numbers
corpusSample$text <- gsub("[^\x01-\x7F]", "", corpusSample$text) # Remove emoticons
corpusSample$text <- gsub("[^[:alnum:]]", " ", corpusSample$text) # Remove special characters. Adds extra spaces
corpusSample$text <- gsub("\\s+", " ", corpusSample$text) # Remove the extra spaces
TOkenizing means to separate the words from the string.
tidyset_withStopWords <- corpusSample %>%
unnest_tokens(word, text)
data("stop_words")
tidyset_withoutStopWords <- corpusSample %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
## Joining, by = "word"
Lets see how many unique words are there in both sets
keyWithStopwords <- unique(tidyset_withStopWords)
keyWithoutStopwords <- unique(tidyset_withoutStopWords)
dim(keyWithStopwords)
## [1] 35876 2
dim(keyWithoutStopwords)
## [1] 34125 2
coverage50pctWithStopwords <- tidyset_withStopWords %>%
count(word) %>%
mutate(proportion = n / sum(n)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.5)
nrow(coverage50pctWithStopwords)
## [1] 122
coverage50pctWithoutStopwords <- tidyset_withoutStopWords %>%
count(word) %>%
mutate(proportion = n / sum(n)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.5)
nrow(coverage50pctWithoutStopwords)
## [1] 1551
coverage90pctWithStopwords <- tidyset_withStopWords %>%
count(word) %>%
mutate(proportion = n / sum(n)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.9)
nrow(coverage90pctWithStopwords)
## [1] 6041
coverage90pctWithoutStopwords <- tidyset_withoutStopWords %>%
count(word) %>%
mutate(proportion = n / sum(n)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.9)
nrow(coverage90pctWithoutStopwords)
## [1] 13187
Here we display the plots for uni, bi and trigrams for both with and without stopwords.
coverage90pctWithStopwords %>%
top_n(20, proportion) %>%
mutate(word = reorder(word, proportion)) %>%
ggplot(aes(word, proportion)) +
geom_col() +
xlab("Words") +
ggtitle("Unigram Distribution for 90% coverage with Stopwords")+
theme(plot.title = element_text(hjust = 0.5))+
coord_flip()
coverage90pctWithoutStopwords %>%
top_n(20, proportion) %>%
mutate(word = reorder(word, proportion)) %>%
ggplot(aes(word, proportion)) +
geom_col() +
xlab("Words") +
ggtitle("Unigram Distribution for 90% coverage without Stopwords")+
theme(plot.title = element_text(hjust = 0.5))+
coord_flip()
rm(tidyset_withStopWords, tidyset_withoutStopWords,
keyWithStopwords, keyWithoutStopwords,
coverage50pctWithStopwords, coverage50pctWithoutStopwords,
coverage90pctWithStopwords, coverage90pctWithoutStopwords)
tidyset_withStopWords <- corpusSample %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2)
coverage90pctWithStopwords <- tidyset_withStopWords %>%
count(bigram) %>%
mutate(proportion = n / sum(n)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.9)
coverage90pctWithStopwords %>%
top_n(20, proportion) %>%
mutate(bigram = reorder(bigram, proportion)) %>%
ggplot(aes(bigram, proportion)) +
geom_col() +
xlab("Bigrams") +
ggtitle("Bigram Distribution for 90% coverage")+
theme(plot.title = element_text(hjust = 0.5))+
coord_flip()
rm(tidyset_withStopWords,
coverage90pctWithStopwords)
tidyset_withStopWords <- corpusSample %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3)
coverage90pctWithStopwords <- tidyset_withStopWords %>%
count(trigram) %>%
mutate(proportion = n / sum(n)) %>%
arrange(desc(proportion)) %>%
mutate(coverage = cumsum(proportion)) %>%
filter(coverage <= 0.9)
coverage90pctWithStopwords %>%
top_n(20, proportion) %>%
mutate(trigram = reorder(trigram, proportion)) %>%
ggplot(aes(trigram, proportion)) +
geom_col() +
xlab("Trigrams") +
ggtitle("Trigram Distribution for 90% coverage")+
theme(plot.title = element_text(hjust = 0.5))+
coord_flip()
rm(tidyset_withStopWords,
coverage90pctWithStopwords)