Week2 Assignment

Data acquisition

Data was downloaded from the provided link. The folder was named “final” and stored within the R project directory.

Link for the download was as follows -

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The files contained in the folder can be seen as follows -

list.files("final/")

## [1] "de_DE" "en_US" "fi_FI" "ru_RU"

list.files("final/en_us/")

## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

As can be seen above, there are three files in en_US (i.e. US English) containing data for blogs, news and twitter.

Data loading in R

Using the file paths seen above, the data was loaded in R using following code -

con1 <- file("final/en_US/en_US.blogs.txt")
con2 <- file("final/en_US/en_US.news.txt")
con3 <- file("final/en_US/en_US.twitter.txt")

blogs <- readLines(con1)
news <- readLines(con2)
twitter <- readLines(con3)

## Warning in readLines(con3): line 167155 appears to contain an embedded nul

## Warning in readLines(con3): line 268547 appears to contain an embedded nul

## Warning in readLines(con3): line 1274086 appears to contain an embedded nul

## Warning in readLines(con3): line 1759032 appears to contain an embedded nul

Exploratory Data Analysis and Summaries

Calculating Basic Statistics of the files - mean, maximum and minimum number of characters in lines within a file. It also calculates the total number of lines in dataset.

Blogs

mean(nchar(blogs))
max(nchar(blogs))
min(nchar(blogs))

## [1] 229.987

## [1] 40833

## [1] 1

Similarly, results for News and twitter were found out. #### News

## [1] 201.1628

## [1] 11384

## [1] 1

Twitter

Notably, This must have max number of character as 140 obviously, so it checks whether our calculations are right or not.

## [1] 68.68045

## [1] 140

## [1] 2

Size of the three sets is as follows (size = number of lines) -

## [1] "Number of lines in Blogs"

## [1] 899288

## [1] "Number of lines in News"

## [1] 1010242

## [1] "Number of lines in Twitter"

## [1] 2360148

Basic Plots

Preparing data for plotting

object.size(blogs)

## 267758632 bytes

The figure of memory used is very high and thus, loading all the data into memory might not be efficient. So, it is decided to load a random 10% sample of all the data and do the further analysis based on that. “rbinom” function was used to create the random sample and then, the samples were written in separate files.

set.seed(1313)

l1 <- length(blogs) 

select1 <- rbinom(l1, 1, 0.1)
select1 <- (select1 == 1)

file.create("en_US/SampleENBlogs.txt")
con1a <- file("en_US/SampleENBlogs.txt")

writeLines(blogs[select1], con1a)

list.files("en_US/")

## [1] "en_US.blogs.txt"     "en_US.news.txt"      "en_US.twitter.txt"  
## [4] "SampleENBlogs.txt"   "SampleENNews.txt"    "SampleENTwitter.txt"

Thus, 3 files containing sampled data are created and further exploration is done using them.

Creation of Corpus

For creation of Corpus, VCorpus is used because it is volatile allows dynamic processing and thus, processing of DocumentTermMatrix to a smaller, more manageable size which is the target of our subsequent operations.

For Cleaning the data, following transformations have been used - 1. Transformation to Lower case 2. Removing punctuation marks 3. Removing stopwords from the data 4. Stripping Whitespace

library(tm)
con1 <- file("en_US/SampleENBlogs.txt")
con2 <- file("en_US/SampleENNews.txt")
con3 <- file("en_US/SampleENTwitter.txt")

blogs <- readLines(con1)
news <- readLines(con2)
twitter <- readLines(con3)

corpus1 <- VCorpus(VectorSource(blogs))
corpus1 <- tm_map(corpus1, content_transformer(tolower))
corpus1 <- tm_map(corpus1, FUN = removePunctuation)
corpus1 <- tm_map(corpus1, FUN = removeWords, stopwords(kind = "en"))
corpus1 <- tm_map(corpus1, FUN = stripWhitespace)

Calculating Document Term Matrix from the corpora created

Below is a sample code for Blogs, similar codes are written for news as well as twitter.

dtm11 <- DocumentTermMatrix(corpus1)
dtm11 <- removeSparseTerms(dtm11, 0.995)
mat11 <- as.matrix(dtm11)
wordfreq11 <- colSums(mat11)
wordfreq11 <- sort(wordfreq11, decreasing = TRUE)

Removing sparse terms is necessary, otherwise matrix is too large and memory gets exhausted. However, too low a sparse threshold will give only few values, which renders the calculations redundant.

For bi-gram tokenizer, the code is as follows

library(RWeka)
tokenarg <- function(x) { NGramTokenizer(x,Weka_control(min = 2, max = 2)) } 
dtm12 <- DocumentTermMatrix(corpus1, control = list(tokenize = tokenarg))
dtm12 <- removeSparseTerms(dtm12, 0.9995)
mat12 <- as.matrix(dtm12)
wordfreq12 <- colSums(mat12)
wordfreq12 <- sort(wordfreq12, decreasing = TRUE)

Similar code was made for tri-gram tokenizer too.

vacate the space

object.size(corpus1)

## 390937512 bytes

object.size(mat11)

## 443259584 bytes

object.size(mat12)

## 496512752 bytes

object.size(mat13)

## 184215632 bytes

rm(corpus1)
rm(mat11)
rm(mat12)
rm(mat13)

TIME TO PLOT

Plan for creating Shiny Algorithm

Thus, we get a good sense of the data we are dealing with. Some interesting findings are as follows -

In 3-gram model, we get only 248 3-grams with sparsity of 99.99%, otherwise the memory gets loaded. In 2-gram model the number of 2-grams is 682 which is better.
In the top 10 tri-grams, new york appears 3 times which also shows dataset is perhaps good only for American applications and may not be useful for other regions.
Size of the corpora and matrices (from dtm) are huge, more than 350 MB in each case. Thus, after calculating the word frequency data frame, they should be deleted to save the memory and application fast. Notably, only word frequency data frames are useful in final prediction algorithms.

With these takeaways, the plan for a Shiny app moves forward.