Exploratory analysis of the training data set

Summary

This report describes the steps applied to explore and analyze the training data set using the tools provided in R programming language. This information is used as training for a text predicting model. This document describes the steps applied for downloading, sampling, cleaning & analyzing the data. The objective is to calculate the frequency of single and pair of words in the documents.

The data source

The data source for this project is a conjunction of files in text format containing three types of information: blogs, news & twitter. The information comes in different languages but only the ones in US language will be processed.

Downloading the data files

The following code downloads the data files and store them in a specific folder. The code below creates temporary directories to store the data, if they do not exist already, and downloads the zip file with the source data for this project.

library(rio)

## Warning: package 'rio' was built under R version 4.1.1

td <- tempdir()  # create a temporary directory
tf <- tempfile(tmpdir=td, fileext=".zip") # create a temporary file
download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", tf) # download file from internet into temporary location
file_names <- unzip(tf, list=TRUE)  # list zip archive
unzip(tf, exdir=td, overwrite=TRUE)  # extract files from zip file

# unlink(td) # delete the files and directories

File information

The following code obtain information about the data files (size, number of lines, maximum size of a line).

Blog File information:

Blog file lines summary:

summary(blogsFileLines)

##    Length     Class      Mode 
##    899288 character character

Blog file lines character summary:

summary(nchar(blogsFileLines))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1      47     156     230     329   40833

Blog file’s size:

setwd(td)
blogsFileinfo <- file.info("final/en_US/en_US.blogs.txt")
bsizeB <- blogsFileinfo$size
bsizeKB <- bsizeB/1024
bsizeMB <-  bsizeKB/1024
bsizeMB

## [1] 200.4242

News File information:

## Warning in readLines(newsFile, encoding = "UTF-8", skipNul = TRUE): incomplete
## final line found on 'final/en_US/en_US.news.txt'

News file lines summary:

summary(newsFileLines)

##    Length     Class      Mode 
##     77259 character character

News file lines character summary:

summary(nchar(newsFileLines))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0   111.0   186.0   202.4   270.0  5760.0

News file’s size:

setwd(td)
newsFileinfo <- file.info("final/en_US/en_US.news.txt")
nsizeB <- newsFileinfo$size
nsizeKB <- nsizeB/1024
nsizeMB <-  nsizeKB/1024
nsizeMB

## [1] 196.2775

Twitter File information:

Twitter file lines summary:

summary(twitterFileLines)

##    Length     Class      Mode 
##   2360148 character character

Twitter file lines characters summary:

summary(nchar(twitterFileLines))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   37.00   64.00   68.68  100.00  140.00

Twitter file’s size:

setwd(td)
twitterFileinfo <- file.info("final/en_US/en_US.twitter.txt")
tsizeB <- twitterFileinfo$size
tsizeKB <- tsizeB/1024
tsizeMB <-  tsizeKB/1024
tsizeMB

## [1] 159.3641

Sampling the text files

Data files are large (from 156 MB to 200 MB) containing from about 77,259 lines of text in the news file to 2,360,148 lines of text in the Twitter file. To speed the data exploration and the analysis of these file information, it is important to sample the data source to improve computer processing speed. A sample of 15,000 random of text lines will be extracted from the original files. Rbinom function will be used to extract the lines of text. Resultant information will be saved in a new file.

#install.packages("stringi")
library(stringi)

## Warning: package 'stringi' was built under R version 4.1.2

setwd(td)

sampleBlogs <- blogsFileLines[rbinom(length(blogsFileLines)*.01, length(blogsFileLines), .5)] #The file is sampled to a 1% of its original number of lines.
sampleNews <- newsFileLines[rbinom(length(newsFileLines)*.01, length(newsFileLines), .5)] #The file is sampled to a 1% of its original number of lines.
sampleTwitter <- twitterFileLines[rbinom(length(twitterFileLines)*.01, length(twitterFileLines), .5)] #The file is sampled to a 1% of its original number of lines.

sampleBlogs <- stri_replace_all_regex(sampleBlogs, "\u2018|\u2026|\u201c|\u201d|\u2019","") #Replacement of Unicode strings; left('), (...), left ("), right("), right (')
# summary(sampleBlogs) # Check the final size of the sampled blog file.
sampleNews <- stri_replace_all_regex(sampleNews, "\u2018|\u2019|\u2026|\u201c|\u201d\u2019","")
# summary(sampleNews)  # Check the final size of the sampled news file.
sampleTwitter <- stri_replace_all_regex(sampleTwitter, "\u2018|\u2026|\u201c|\u201d|\u2019","")
# summary(sampleTwitter) # Check the final size of the sampled twitter file.

# write.csv(sampleBlogs, file = 'Sample/sampleBlogs.csv', row.names = FALSE)
dir.create('final/en_US/sample')
write.csv(sampleBlogs, 'final/en_US/sample/sampleBlogs.csv', row.names = FALSE) #Writing the resultant sampled file to a csv file.
write.csv(sampleNews, file = "final/en_US/sample/sampleNews.csv", row.names = FALSE)
write.csv(sampleTwitter, file = "final/en_US/sample/sampleTwitter.csv", row.names = FALSE)

Files compute analysis was facilitated after the files were sampled.

Data Preprocessing

Data pre-processing will prepare the information for compute analysis. In the sampled files the information contains characters, upper and lower cases, punctuation marks, stopwords, numbers and unwanted terms that should be removed before any analysis.

Initially, the sampled files are loaded into memory. Once in memory the files a pre-processed using the previously loaded tm package.

Common data cleanning tasks associated with the mining are: - Converting the entire document to lower case - Removing punctuation marks (periods, commas, hyphens etc) - Removing stopwords (extremely common words such as “and”, “or”, “not”, “in”, “is” etc) - Removing numbers - Filtering out unwanted terms - Removing extra whitespace

#Corpus
library (tm)

## Warning: package 'tm' was built under R version 4.1.3

## Loading required package: NLP

## Warning: package 'NLP' was built under R version 4.1.1

setwd(td)
textfiles <- Corpus(DirSource("final/en_US/sample"), readerControl = list(reader=readPlain, language="en_US"))
inspect(textfiles)

#Start preprocesing
toSpace <- content_transformer(function(x, pattern) { return(gsub(pattern, " ", x))})
textfiles <- tm_map(textfiles, toSpace, "-")
textfiles <- tm_map(textfiles, toSpace, ":")
textfiles <- tm_map(textfiles, toSpace, ",")
textfiles <- tm_map(textfiles, toSpace, "'")
textfiles <- tm_map(textfiles, toSpace, "-")

#Remove punctuation
textfiles <- tm_map(textfiles, removePunctuation)
#Transform lower case
textfiles <- tm_map(textfiles, content_transformer(tolower))
#Strip digits
textfiles<- tm_map(textfiles, removeNumbers)
#Remove stopwords from standard stopword list (How to check this? How to add your own?
textfiles <- tm_map(textfiles, removeWords, stopwords("english"))
#Strip whitespaces (Cosmetic?)
textfiles <- tm_map(textfiles, stripWhitespace)
#Stem document
textfiles <- tm_map(textfiles, stemDocument)

Data exploration

After removing the unwanted terms and characters, the common approach in text mining is to create a matrix of the most common used terms. The TermDocumentMatrix function creates the term-document matrix from a corpus.

The as.matrix function will count the words contained in the matrix. the matrix tend to contain a lot of information, because of that the topten function will head the ten most common words.

corpus_tdm <- TermDocumentMatrix(textfiles)
dtm.matrix <- as.matrix(corpus_tdm)
wordcount <- rowSums(dtm.matrix)
topten <- head(sort(wordcount, decreasing=TRUE), 10)

Plotting the most common words from the matrix

Once determined the most common words, the plot below shows the histogram of the ten most frequent words in the corpus.

library(reshape2)

## Warning: package 'reshape2' was built under R version 4.1.3

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

dfplot <- as.data.frame(melt(topten))
dfplot$word <- dimnames(dfplot)[[1]]
dfplot$word <- factor(dfplot$word,
                      levels=dfplot$word[order(dfplot$value,
                                               decreasing=TRUE)])

fig <- ggplot(dfplot, aes(x=word, y=value)) + geom_bar(stat="identity")
fig <- fig + xlab("Word in Corpus")
fig <- fig + ylab("Count")
print(fig)

Next steps

From here, the next step for predicting text would be to generate two-gram and three-gram matrices. The information provided about the frequencies of all the possible combinations will predict the most probable word to enter by a user based on the word just entered.