Data Science Capstone - Milestone Report

Synopsis

This report summarizes the Exploratory Data Analysis performed on the provided text dataset. This analysis will help us plan the development of an app which would predict the next word in a sentence like the autocomplete feature in most mobile phones.

Initial set up

I have already downloaded the file through the link provided. So, here we will set the working directory and import the data. First we read the data into separate variables and close the connections to save memory.

setwd("C:\\Users\\Intel\\Documents\\Coursera DS Capstone Project")

# Import all the files and open the connection
fileBlogs <- file(".\\final\\en_US\\en_US.blogs.txt", "rb")
fileNews <- file(".\\final\\en_US\\en_US.news.txt", "rb")
fileTweets <- file(".\\final\\en_US\\en_US.twitter.txt", "rb")

# Read the lines and close the connections
blogs <- readLines(fileBlogs, encoding = "UTF-8", skipNul = TRUE)
close(fileBlogs)

news <- readLines(fileNews, encoding = "UTF-8", skipNul = TRUE)
close(fileNews)

tweets <- readLines(fileTweets, encoding = "UTF-8", skipNul = TRUE)
close(fileTweets)

# Remove the variables from the workspace
rm(fileBlogs, fileNews, fileTweets)

Basic summary of the data

Let’s check the memory utilization by the individual files in order to get some perspective about the space requirements. We will also use run the garbage collector to free up some space for R. We will do gc() everytime we remove some big variables.

blogsMem <- object.size(blogs)
format(blogsMem, units = "MB", standard = "legacy")

## [1] "255.4 Mb"

newsMem <- object.size(news)
format(newsMem, units = "MB", standard = "legacy")

## [1] "257.3 Mb"

tweetsMem <- object.size(tweets)
format(tweetsMem, units = "MB", standard = "legacy")

## [1] "319 Mb"

totalMem <- blogsMem + newsMem + tweetsMem
format(totalMem, units = "MB", standard = "legacy")

## [1] "831.7 Mb"

rm(blogsMem, newsMem, tweetsMem, totalMem)
gc() # Garbage Collector

##            used  (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells  6072922 324.4   10069112  537.8   7114062 380.0
## Vcells 91949432 701.6  134224930 1024.1 105119992 802.1

So, the dataset needs about 832MB of RAM. The gc() output also indicates that the max memory used is around 1.9Gb. This might create some problems while creating the application as the shiny website only provides 1Gb of space for the app data. Anything above that would require a purchase of premium plans. So it would be better if we design the app to the take data in chunks instead of the whole file.

basicSummary <- data.frame(fileType = c("blogs", "news", "twitter"), 
                           nlines = c(length(blogs), length(news), length(tweets)),
                           nwords = c(wordcount(blogs,sep = " "), 
                                      wordcount(news, sep = " "), 
                                      wordcount(tweets, sep = " ")),
                           longestLine = c(max(nchar(blogs)), max(nchar(news)), max(nchar(tweets))))
basicSummary

##   fileType  nlines   nwords longestLine
## 1    blogs  899288 37334131       40833
## 2     news 1010242 34372530       11384
## 3  twitter 2360148 30373583         140

It should be noted here that the longest lines in the twitter files are 140 characters in length. This is accurate as tweets are limited to 140 characters. It can also be observed here that, even though the twitter dataset is larger in terms of number of lines and memory required, the number of words is much less than that in blogs. This might indicate usage of either longer words or, more probably, special characters like emojis.

Cleaning and Sampling the data

In this section, we will clean the data and create smaller samples so that it is easier to work on. The cleaning process will involve the following steps: - Expanding contractions - converting all characters to lowercase - removing digits and words containing digits - removing punctuations

Converting the files to tibbles (basic data frames)

blogs <- data_frame(text = blogs)

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

news <- data_frame(text = news)
tweets <- data_frame(text = tweets)

Combining the data and creating samples

Here, we first combine all the three dataframes into one and then sample from that text. This combination is done by using the bind_rows() function. We add an additional column to the new dataframe which tells us the source of the row. This is done using the mutate() function.

corpus <- bind_rows(mutate(blogs, source = "blogs"),
                   mutate(news,  source = "news"),
                   mutate(tweets, source = "twitter"))

corpus$source <- as.factor(corpus$source)
#corpus <- rbind(blogs, news, tweets)
rm(blogs, news, tweets)
gc()

##            used  (Mb) gc trigger   (Mb)  max used  (Mb)
## Ncells  6114413 326.6   12178840  650.5  12178840 650.5
## Vcells 94170666 718.5  161149916 1229.5 126747347 967.1

Creating a sample

Now we will create a sample from the combined data and operate on that. This will make our operations run faster as we don’t have to use the complete dataset. The set.seed() line should be uncommented for reproducability. Also, from this point on, we will be operating on the sample, so we will remove the original dataset to free up some memory.

set.seed(5)
corpusSample <- corpus[sample(nrow(corpus), 10000), ]
rm(corpus)

Tidying the sample dataset

Here we will remove all the contractions, numbers, punctuation, special characters and emoticons from the sample.

corpusSample$text <- replace_contraction(corpusSample$text)
corpusSample$text <- gsub("\\d", "", corpusSample$text) # Remove Numbers
corpusSample$text <- gsub("[^\x01-\x7F]", "", corpusSample$text) # Remove emoticons
corpusSample$text <- gsub("[^[:alnum:]]", " ", corpusSample$text) # Remove special characters. Adds extra spaces
corpusSample$text <- gsub("\\s+", " ", corpusSample$text) # Remove the extra spaces

Tokenizing the sample

TOkenizing means to separate the words from the string.

tidyset_withStopWords <- corpusSample %>%
    unnest_tokens(word, text)

data("stop_words")
tidyset_withoutStopWords <- corpusSample %>%
    unnest_tokens(word, text) %>%
    anti_join(stop_words)

## Joining, by = "word"

Lets see how many unique words are there in both sets

keyWithStopwords <- unique(tidyset_withStopWords)
keyWithoutStopwords <- unique(tidyset_withoutStopWords)
dim(keyWithStopwords)

## [1] 35876     2

dim(keyWithoutStopwords)

## [1] 34125     2

Answering the questions

How many words are required to attain 50% coverage of all the words in the sample?

coverage50pctWithStopwords <- tidyset_withStopWords %>%
    count(word) %>%  
    mutate(proportion = n / sum(n)) %>%
    arrange(desc(proportion)) %>%  
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.5)
nrow(coverage50pctWithStopwords)

## [1] 122

coverage50pctWithoutStopwords <- tidyset_withoutStopWords %>%
    count(word) %>%  
    mutate(proportion = n / sum(n)) %>%
    arrange(desc(proportion)) %>%  
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.5)
nrow(coverage50pctWithoutStopwords)

## [1] 1551

How many words are required to attain 90% coverage of all the words in the sample?

coverage90pctWithStopwords <- tidyset_withStopWords %>%
    count(word) %>%  
    mutate(proportion = n / sum(n)) %>%
    arrange(desc(proportion)) %>%  
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.9)
nrow(coverage90pctWithStopwords)

## [1] 6041

coverage90pctWithoutStopwords <- tidyset_withoutStopWords %>%
    count(word) %>%  
    mutate(proportion = n / sum(n)) %>%
    arrange(desc(proportion)) %>%  
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.9)
nrow(coverage90pctWithoutStopwords)

## [1] 13187

Plotting the distributions

Here we display the plots for uni, bi and trigrams for both with and without stopwords.

Unigrams

coverage90pctWithStopwords %>%
    top_n(20, proportion) %>%
    mutate(word = reorder(word, proportion)) %>%
    ggplot(aes(word, proportion)) +
    geom_col() +
    xlab("Words") +
    ggtitle("Unigram Distribution for 90% coverage with Stopwords")+
    theme(plot.title = element_text(hjust = 0.5))+
    coord_flip()

coverage90pctWithoutStopwords %>%
    top_n(20, proportion) %>%
    mutate(word = reorder(word, proportion)) %>%
    ggplot(aes(word, proportion)) +
    geom_col() +
    xlab("Words") +
    ggtitle("Unigram Distribution for 90% coverage without Stopwords")+
    theme(plot.title = element_text(hjust = 0.5))+
    coord_flip()

rm(tidyset_withStopWords, tidyset_withoutStopWords,
   keyWithStopwords, keyWithoutStopwords,
   coverage50pctWithStopwords, coverage50pctWithoutStopwords, 
   coverage90pctWithStopwords, coverage90pctWithoutStopwords)

Bigrams

tidyset_withStopWords <- corpusSample %>%
    unnest_tokens(bigram, text, token = "ngrams", n = 2)


coverage90pctWithStopwords <- tidyset_withStopWords %>%
    count(bigram) %>%  
    mutate(proportion = n / sum(n)) %>%
    arrange(desc(proportion)) %>%  
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.9)

coverage90pctWithStopwords %>%
    top_n(20, proportion) %>%
    mutate(bigram = reorder(bigram, proportion)) %>%
    ggplot(aes(bigram, proportion)) +
    geom_col() +
    xlab("Bigrams") +
    ggtitle("Bigram Distribution for 90% coverage")+
    theme(plot.title = element_text(hjust = 0.5))+
    coord_flip()

rm(tidyset_withStopWords, 
   coverage90pctWithStopwords)

Trigrams

tidyset_withStopWords <- corpusSample %>%
    unnest_tokens(trigram, text, token = "ngrams", n = 3)


coverage90pctWithStopwords <- tidyset_withStopWords %>%
    count(trigram) %>%  
    mutate(proportion = n / sum(n)) %>%
    arrange(desc(proportion)) %>%  
    mutate(coverage = cumsum(proportion)) %>%
    filter(coverage <= 0.9)

coverage90pctWithStopwords %>%
    top_n(20, proportion) %>%
    mutate(trigram = reorder(trigram, proportion)) %>%
    ggplot(aes(trigram, proportion)) +
    geom_col() +
    xlab("Trigrams") +
    ggtitle("Trigram Distribution for 90% coverage")+
    theme(plot.title = element_text(hjust = 0.5))+
    coord_flip()

rm(tidyset_withStopWords, 
   coverage90pctWithStopwords)