Exploratory Analysis of Swift Key data

Introduction

This is a Milestone Report for week 2 of the capstone project of the Coursera Data Scienence Specialization. Objective of the report is to provide exploratory analysis of the Swift Key data. Data from different sources (as text files) are imported into R and analysed. Report will also provide the goals for further development - Model creation, Prediction and app creation.

Summary of the data

Let us first download the file from the Url provided. We’ll use cached file when available. Data in the form of text files is available from 3 different sources - Twitter (en_US.twitter.txt), Blogs (en_US.blogs.txt) and News (en_US.news.txt). First analysis will be to analyse the total number of lines and words of all the files.

##      File TotalLines TotalWords
## 1 Twitter    2360148  162096031
## 2   Blogs     899288  206824505
## 3    News      77259   15639408

It can be observed that Twitter file has more number of lines whereas Blog file has more number of words.

Features of data

Data files are very large and we are going to sample the data (used around 1% of data) and proceed with cleaning of data. We’ll create a corpus and then clean the corpus. We are going to perform below operations on the corpus: 1. Remove non-ASCII characters 2. Remove punctuation, unnecessary white spaces, stop words and convert to lower case. 3. Remove numbers as well as we are not going to predict numbers.

## <<SimpleCorpus>>
## Metadata:  corpus specific: 1, document level (indexed): 0
## Content:  documents: 33365

Now, we are going to proceed with tokenization of the corpus into Uni-grams, Bi-grams and Tri-grams. Then, we are going to check the top 10 words or combinations for each Uni, Bi and Tri grams based on frequency of the words in the corpus. This is represented in the form of bar graph where Frequency of the ngrams are mapped against words.

Next Steps

In the above analysis, we have created ngrams and converted them to data frames to explore the data. We can create Term Document Matrices using this data and can be used in model creation. These data frames can be used to predict the next word in a particular sequence. Shiny app can be created using the model and since its only 1% sample, app should be able to perform predictions within accpetable time. To increase accuracy, We can also look at increasing the sample size to 5% and check the performance of shiny app.

Appendix

code for Summaries

if(!file.exists("coursera-swiftkey.zip")){
  download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", "coursera-swiftkey.zip")
  unzip("coursera-swiftkey.zip")
}

con <- file("final/en_US/en_US.twitter.txt", "r")
twitter <- readLines(con, encoding = "UTF-8", warn = FALSE)
close(con)
con <- file("final/en_US/en_US.blogs.txt", "r")
blogs <- readLines(con, encoding = "UTF-8", warn = FALSE)
close(con)
con <- file("final/en_US/en_US.news.txt", "r")
news <- readLines(con, encoding = "UTF-8", warn = FALSE)
close(con)

summary <- data.frame("File" = c("Twitter", "Blogs", "News"),
    "TotalLines" = sapply(list(twitter, blogs, news), function(x){ length(x)}), 
    "TotalWords" = sapply(list(twitter, blogs, news), function(x){ sum(nchar(x))}))
summary

code for Corpus creation

# setting seed for reproducibility
set.seed(25)
sampleSize <- 0.01 #sample size set to 5%

# sampling data sets
subTwitter <- twitter[sample(1:length(twitter), length(twitter)*sampleSize, replace = FALSE)]
subBlogs <- blogs[sample(1:length(blogs), length(blogs)*sampleSize, replace = FALSE)]
subNews <- news[sample(1:length(news), length(news)*sampleSize, replace = FALSE)]

# remove variables that are no more used
rm(twitter, blogs, news, summary)

# Cleaning
skCorpus <- VCorpus(VectorSource(c(subTwitter, subBlogs, subNews)), readerControl=list(reader=readPlain,language="en")) # Make corpus
skCorpus <- Corpus(VectorSource(sapply(skCorpus, function(x) iconv(x, "latin1", "ASCII", sub="")))) # Remove non-ASCII
skCorpus <- tm_map(skCorpus, removePunctuation) # Remove punctuation
skCorpus <- tm_map(skCorpus, stripWhitespace) # Remove unneccesary white spaces
skCorpus <- tm_map(skCorpus, content_transformer(tolower)) # Convert to lowercase
skCorpus <- tm_map(skCorpus, removeNumbers) # Remove numbers
badWords <- readLines("final/en_US/badWords.txt")
skCorpus <- tm_map(skCorpus, removeWords, badWords) # Remove profane words
skCorpus <- tm_map(skCorpus, removeWords, stopwords("english")) # Remove stop words
skCorpus

code for Tokenization

#tokenization

ngram <- function(x = "uni"){
    if(x== "uni"){ 
      num <- 1
      grm <- "Uni"
      titleName <- "Frequencies of Top 10 Uni-grams"
      }
    if(x== "bi"){ 
      num <- 2
      grm <- "Bi"
      titleName <- "Frequencies of Top 10 Bi-grams"
      }
    if(x== "tri"){ 
      num <- 3
      grm <- "Tri"
      titleName <- "Frequencies of Top 10 Tri-grams"
      }
    token <- NGramTokenizer(skCorpus, Weka_control(min = num, max = num))
    ngramData <- data.frame(table(token))
    ngramData <- ngramData[order(ngramData$Freq, decreasing = TRUE),]
    colnames(ngramData) <- c("Word", "Freq")
    gramData <- head(ngramData, 10)
    ggplot(gramData, aes(x = reorder(Word, -Freq), y = Freq)) +
        geom_bar(stat = 'identity', fill = 'grey') +
        labs(x = paste0(grm, "-grams"), y = "Frequency", title = titleName) +
        theme(axis.text.x=element_text(angle=90))
}
ngram(x = "uni")
ngram(x = "bi")
ngram(x = "tri")