Introduction

This is the milestone report for the Data Science Capstone Project by Coursera. The goal of the project is to build a predictive text application that will predict the next word as the user types a sentence. It demonstrates the work done the exploratory data analysis of the Swiftkey data. There are 3 sources of data provided for this project in 4 different languages. For this project, we will be using the English database. Refer below for the 3 text files we used for this project:-

  1. en_US.blogs.txt
  2. en_US.news.txt
  3. en_US.twitter.txt

1) Data Summary

Downloading Raw Data

Data is downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. Assuming the R working directory has been set.

#Check for zip file and download if necessary
if (!file.exists("Coursera-SwiftKey.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
        destfile = "Coursera-SwiftKey.zip")
}
# Check for data file and unzip if necessary
if (!file.exists("final/en_US/en_US.blogs.txt")) {
    unzip("Coursera-SwiftKey.zip", exdir = ".")
}

checking the file size for each data

file.size("final/en_US/en_US.blogs.txt")
## [1] 210160014
file.size("final/en_US/en_US.news.txt")
## [1] 205811889
file.size("final/en_US/en_US.twitter.txt")
## [1] 167105338

2) Data Analysis

Loading the files and read the data in binary mode

twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
## News file has special characters so it is better to read it as a binary file
bin.news <- file("final/en_US/en_US.news.txt", open="rb")
news <- readLines(bin.news, encoding="UTF-8")
close(bin.news)
rm(bin.news)

Output the summary of the files

# Compute words per line info on each line for each data type
word_per_line <- lapply(list(blogs,news,twitter),function(x) stri_count_words(x))
# Compute statistics and summary info for each data type
stats <-data.frame(
            File=c("blogs","news","twitter"), 
            t(rbind(sapply(list(blogs,news,twitter),stri_stats_general),
                    TotalWords=sapply(list(blogs,news,twitter),stri_stats_latex)[4,])),
            # Compute words per line summary
            WPL=rbind(summary(word_per_line[[1]]),summary(word_per_line[[2]]),summary(word_per_line[[3]]))
            )
print(stats)
##      File   Lines LinesNEmpty     Chars CharsNWhite TotalWords WPL.Min.
## 1   blogs  899288      899288 206824382   170389539   37570839        0
## 2    news 1010242     1010242 203223154   169860866   34494539        1
## 3 twitter 2360148     2360148 162096241   134082806   30451170        1
##   WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1           9         28 41.75107          60     6726
## 2          19         32 34.40997          46     1796
## 3           7         12 12.75065          18       47

Plot the basic histogram to analyse 3 data files

# Plot histogram for Blogs data
qplot(word_per_line[[1]],geom="histogram",fill=I("red"),main="Histogram for Blogs",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

# Plot histogram for news data
qplot(word_per_line[[2]],geom="histogram",fill=I("green"),main="Histogram for News",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

# Plot histogram for twitter data
qplot(word_per_line[[3]],geom="histogram",fill=I("blue"),main="Histogram for Twitter",
      xlab="No. of Words",ylab="Frequency",binwidth=10) + scale_y_continuous(labels = comma)

rm(word_per_line)
rm(stats)

From the statistics, we observed that word per line for blogs are generally higher, followed by news and twitter. At the same time, we also noticed that the word per line for all data types are right-skewed. This may be an indication of the general trend towards short and concised communications.

3) Data Sampling

Perform the data sampling before the data analysis because of the huge dataset. Set the sample data size = 10000 lines of data before data cleaning for exploratory analysis.

samplesize <- 10000  # Assign sample size
set.seed(1000)  # Ensure reproducibility 
# Create raw data and sample vectors
data <- list(blogs, news, twitter)
sample <- list()
# Iterate each raw data to create 'cleaned'' sample for each
for (i in 1:length(data)) {
    # Create sample dataset
    Filter <- sample(1:length(data[[i]]), samplesize, replace = FALSE)
    sample[[i]] <- data[[i]][Filter]
    # Remove unconvention/funny characters
    for (j in 1:length(sample[[i]])) {
        row1 <- sample[[i]][j]
        row2 <- iconv(row1, "latin1", "ASCII", sub = "")
        sample[[i]][j] <- row2
    }
}
rm(blogs)
rm(news)
rm(twitter)

4) Creating Corpus and Cleaning Data

Creatng corpus for each data type, data cleaning as well as document term matrix to identify terms occurences in the files.

# Create corpus and document term matrix vectors
corpus <- list()
dtMatrix <- list()
# Iterate each sample data to create corpus and DTM for each
for (i in 1:length(sample)) {
    # Create corpus dataset
    corpus[[i]] <- Corpus(VectorSource(sample[[i]]))
    # Cleaning/stemming the data
    corpus[[i]] <- tm_map(corpus[[i]], tolower)
    corpus[[i]] <- tm_map(corpus[[i]], removeNumbers)
    corpus[[i]] <- tm_map(corpus[[i]], removeWords, stopwords("english"))
    corpus[[i]] <- tm_map(corpus[[i]], removePunctuation)
    corpus[[i]] <- tm_map(corpus[[i]], stemDocument)
    corpus[[i]] <- tm_map(corpus[[i]], stripWhitespace)

    # calculate document term frequency for corpus
    dtMatrix[[i]] <- DocumentTermMatrix(corpus[[i]], control = list(wordLengths = c(0, 
        Inf)))
}
## Warning in tm_map.SimpleCorpus(corpus[[i]], tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], stemDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], stemDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], tolower): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removeWords, stopwords("english")):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], removePunctuation): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], stemDocument): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(corpus[[i]], stripWhitespace): transformation
## drops documents
rm(data)
rm(sample)

5) Plottng data in Word Cloud

Plotting word cloud for each data type. The word cloud is used because it illustrates the word frequency effectively by present a picture of the most common words used with those used more often displayed larger.

set.seed(3000)
par(mfrow = c(1, 3))  # Establish Plotting Panel
headings = c("Blogs Word Cloud", "News Word Cloud", "Twitter Word Cloud")
# Iterate each corpus/DTM and plot word cloud for each
for (i in 1:length(corpus)) {
    wordcloud(words = colnames(dtMatrix[[i]]), freq = col_sums(dtMatrix[[i]]), 
        scale = c(3, 1), max.words = 100, random.order = FALSE, rot.per = 0.35, 
        use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
    title(headings[i])
}