Data Science Capstone- Swiftkey Milestone Report

The purpose of this report is to provide a thorough exploratory analysis of swiftkey datasets, which is provided here https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip for capstone project in data science specialization course.

As this dataset will be further used for building model algorithm for predicting the next word that a user might type. so, understanding the distribution and frequencies of words and word-pair relationships is the essential goal for this milestone report.

The Swiftkey data set contains multiple languages words but this report will only focuses on exploring English language datasets within the Swiftkey data set. This data set contains 3 types of files (news, blog, twitter).

The following section provides a brief analysis of swiftkey datasets.

Section 1- Loading the data

Summary of individual files are given below:

## 1. blog_summary:  899288 character character

## 2. twitter_summary:  2360148 character character

## 3. news_summary:  1010242 character character

Summary of individual sample file (taken only 1 percent of data from each file via binomial random distribution) :

## 1. Sample blog summary:  8914 character character

## 2. Sample twitter summary:  10062 character character

## 3. Sample news summary:  23887 character character

Section 2- Data cleaning and preprocessing

##combined sample dataset
combined_sample <- c(sample_blogs,sample_twitter,sample_news)
sample_data <- Corpus(VectorSource(list(combined_sample)))


# cleaning data while removing numbers, punctuations, whitespace etc. and converting all letter in lowercase
sample_data <- tm_map(sample_data, removeWords, stopwords("english"))
sample_data <- tm_map(sample_data, removePunctuation)
sample_data <- tm_map(sample_data, removeNumbers)
sample_data <- tm_map(sample_data, stripWhitespace)

sample_data <- tm_map(sample_data, content_transformer(tolower))

# Profanity filtering - removing profanity and bad words, which I dont want to predict.  
# a txt file containing bad words from http://www.bannedwordlist.com/lists/swearWords.txt is being used to filter very common bad words from files.
badwords <- read.delim("swear_words.txt",sep = ":",header = FALSE)
badwords <- badwords[,1]
sample_data <- tm_map(sample_data, removeWords, badwords)

writeCorpus(sample_data, filename="sample_data.txt")
sample_data<- readLines("sample_data.txt")

Section 3- Tokenization

Tokenization is method for identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. so for this purpose , I am using an efficient ngram tokenizer r code written by Maciej Szymkiewicz. we will tokenize sample dataset by putting grams values 1, 2, 3 and so on to understand different features of each case

Unigram- puting gram value as 1

## Warning: package 'RWeka' was built under R version 3.2.3

##      word  freq
## 1       i 15144
## 2     the  5294
## 3    will  3158
## 4    said  3130
## 5    just  2961
## 6     one  2937
## 7    like  2553
## 8     can  2412
## 9      im  2399
## 10    get  2276
## 11   time  2127
## 12    new  1922
## 13    now  1848
## 14   good  1753
## 15    day  1736
## 16   know  1669
## 17 people  1603
## 18   love  1541
## 19     us  1478
## 20     it  1470

Bigram Analysis- puting gram value as 2

##           word freq
## 1      i think  672
## 2       i know  486
## 3       i love  473
## 4        i can  382
## 5       i just  376
## 6       i will  338
## 7       i want  298
## 8    right now  251
## 9       i like  224
## 10      i feel  205
## 11    i really  201
## 12    new york  201
## 13       i got  198
## 14       i get  189
## 15      i need  179
## 16   last year  179
## 17      time i  179
## 18   i thought  176
## 19  last night  162
## 20 high school  152

Trigram Analysis- puting gram value as 3

##                 word freq
## 1          i think i   91
## 2           i know i   77
## 3        i feel like   52
## 4           i wish i   48
## 5  happy mothers day   40
## 6        i thought i   37
## 7           i love i   32
## 8        let us know   30
## 9           i knew i   28
## 10     new york city   28
## 11       i dont know   27
## 12      i dont think   25
## 13        i think im   25
## 14       right now i   25
## 15         i guess i   24
## 16       feel like i   23
## 17         i know im   23
## 18      every time i   22
## 19    happy new year   22
## 20     cant wait see   21

Section 4- Conclusion and further plan

We can see from above experiments that as we progress towards higher number (1-2-3) of ngram model, we get lesser count of words means more closer to next possible word for a sequence. so, frequency of occurrence of n-grams can be used to determine the next word in a sequence. The next step would be develop shiny application for prediction algorithm which will be used to predict next word n-grams frequency matrices to find associations between words and n-grams.

References

Profanity word list (english/american) http://www.bannedwordlist.com/lists/swearWords.txt
Swiftkey datasets for report https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
Efficient R code Ngram tokenizer written by Maciej Szymkiewicz for our analysis. https://github.com/zero323/r-snippets/blob/master/R/ngram_tokenizer.R
basic understanding of NLP https://en.wikipedia.org/wiki/N-gram