The purpose of this report is to provide a thorough exploratory analysis of swiftkey datasets, which is provided here https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip for capstone project in data science specialization course.
As this dataset will be further used for building model algorithm for predicting the next word that a user might type. so, understanding the distribution and frequencies of words and word-pair relationships is the essential goal for this milestone report.
The Swiftkey data set contains multiple languages words but this report will only focuses on exploring English language datasets within the Swiftkey data set. This data set contains 3 types of files (news, blog, twitter).
The following section provides a brief analysis of swiftkey datasets.
Summary of individual files are given below:
## 1. blog_summary: 899288 character character
## 2. twitter_summary: 2360148 character character
## 3. news_summary: 1010242 character character
Summary of individual sample file (taken only 1 percent of data from each file via binomial random distribution) :
## 1. Sample blog summary: 8914 character character
## 2. Sample twitter summary: 10062 character character
## 3. Sample news summary: 23887 character character
##combined sample dataset
combined_sample <- c(sample_blogs,sample_twitter,sample_news)
sample_data <- Corpus(VectorSource(list(combined_sample)))
# cleaning data while removing numbers, punctuations, whitespace etc. and converting all letter in lowercase
sample_data <- tm_map(sample_data, removeWords, stopwords("english"))
sample_data <- tm_map(sample_data, removePunctuation)
sample_data <- tm_map(sample_data, removeNumbers)
sample_data <- tm_map(sample_data, stripWhitespace)
sample_data <- tm_map(sample_data, content_transformer(tolower))
# Profanity filtering - removing profanity and bad words, which I dont want to predict.
# a txt file containing bad words from http://www.bannedwordlist.com/lists/swearWords.txt is being used to filter very common bad words from files.
badwords <- read.delim("swear_words.txt",sep = ":",header = FALSE)
badwords <- badwords[,1]
sample_data <- tm_map(sample_data, removeWords, badwords)
writeCorpus(sample_data, filename="sample_data.txt")
sample_data<- readLines("sample_data.txt")
Tokenization is method for identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it. so for this purpose , I am using an efficient ngram tokenizer r code written by Maciej Szymkiewicz. we will tokenize sample dataset by putting grams values 1, 2, 3 and so on to understand different features of each case
## Warning: package 'RWeka' was built under R version 3.2.3
## word freq
## 1 i 15144
## 2 the 5294
## 3 will 3158
## 4 said 3130
## 5 just 2961
## 6 one 2937
## 7 like 2553
## 8 can 2412
## 9 im 2399
## 10 get 2276
## 11 time 2127
## 12 new 1922
## 13 now 1848
## 14 good 1753
## 15 day 1736
## 16 know 1669
## 17 people 1603
## 18 love 1541
## 19 us 1478
## 20 it 1470
## word freq
## 1 i think 672
## 2 i know 486
## 3 i love 473
## 4 i can 382
## 5 i just 376
## 6 i will 338
## 7 i want 298
## 8 right now 251
## 9 i like 224
## 10 i feel 205
## 11 i really 201
## 12 new york 201
## 13 i got 198
## 14 i get 189
## 15 i need 179
## 16 last year 179
## 17 time i 179
## 18 i thought 176
## 19 last night 162
## 20 high school 152
.
## word freq
## 1 i think i 91
## 2 i know i 77
## 3 i feel like 52
## 4 i wish i 48
## 5 happy mothers day 40
## 6 i thought i 37
## 7 i love i 32
## 8 let us know 30
## 9 i knew i 28
## 10 new york city 28
## 11 i dont know 27
## 12 i dont think 25
## 13 i think im 25
## 14 right now i 25
## 15 i guess i 24
## 16 feel like i 23
## 17 i know im 23
## 18 every time i 22
## 19 happy new year 22
## 20 cant wait see 21
We can see from above experiments that as we progress towards higher number (1-2-3) of ngram model, we get lesser count of words means more closer to next possible word for a sequence. so, frequency of occurrence of n-grams can be used to determine the next word in a sequence. The next step would be develop shiny application for prediction algorithm which will be used to predict next word n-grams frequency matrices to find associations between words and n-grams.