Description

Swiftkey is a Natural Language processing project as the Final Capstone project in the Coursera Data Science Specialization Course

Introduction and Summary

This is a concise Markdown document that is describing the step by step process of handling this NLP project . This is the first milestone report that is showing the progress for the complete project.

The motivation for this project is to:

  1. Demonstrate that you’ve downloaded the data and have successfully loaded it in.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that you amassed so far.
  4. Get feedback on your plans for creating a prediction algorithm and Shiny app.

Tasks to accomplish mostly concentrates on the Milestone goals and the Task-2 for the week no. 2

  • Tokenization - identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
  • Profanity filtering - removing profanity and other words you do not want to predict.
  • Does the link lead to an HTML page describing the exploratory analysis of the training data set?
  • Has the data scientist done basic summaries of the three files? Word counts, line counts and basic data tables?
  • Has the data scientist made basic plots, such as histograms to illustrate features of the data?
  • Was the report written in a brief, concise style, in a way that a non-data scientist manager could appreciate?

Loading the Data and Data Description

Course dataset

This is the training data to get you started that will be the basis for most of the capstone. You must download the data from the link below and not from external websites to start.

path1<-NULL
path1 <- "C:/Public/nlp"

for(i in list.files(path = path1, pattern = "\\.zip$")) {
  unzip(i, overwrite = TRUE)
}

Getting the Filenames and also getting connections to read the Text files

path1 <- "C:/Public/nlp"
foldernames <- c("DE", "US", "FI", "RU")
filenames_n<-character(0)
for (j in list.dirs(path = path1)) {
  if (length(unlist(strsplit(basename(j),"_")[[1]])) == 2) {
    for (k in list.files(path = j)) {
      if (grepl(".txt",k, perl = TRUE)) {
        filenames_n <- c(filenames_n, unlist(strsplit(k, "\\.txt"))[1]) 
      }
  }
  }
}

Getting the Data that is of Concern

  • Loading the data in. This dataset is fairly large. We emphasize that you don’t necessarily need to load the entire dataset in to build your algorithms (see point 2 below). At least initially, you might want to use a smaller subset of the data.
  • Sampling. To reiterate, to build models you don’t need to load in and use all of the data.
library(stringi)
library(strip)
countern<<-0
lines<-NULL
filesize <- c()
for (j in list.dirs(path = path1)) {
  #print(unlist(strsplit(basename(j),"_")[[1]])[1])
  if ((length(unlist(strsplit(basename(j),"_")[[1]])) == 2) & (unlist(strsplit(basename(j),"_")[[1]])[1] == "en") ) {
     for (k in list.files(path = j)) {
       # Grep based on Categories and then read
       if (grepl(".txt", k, perl = TRUE)) {
         countern<<-countern+1
         pathtemp <- paste(j,k, sep = "/")
         filesize[countern]<-file.size(pathtemp)
         nam <- unlist(strip(strsplit(pathtemp, "/")))[6]
         assign(nam, NULL)
         conntemp <- file(pathtemp, open = "r")
         lines <- readLines(conntemp)
         assign(nam , lines)
         close(conntemp) 
       }
   }
  }
}

Data Cleaning and representation of the data

Analyzing the corpus here and Cleaning using the “tm” package

  • Checking for Puntuations , numbers , periods, hyphens etc and removing them
  • Converting the entire document to lower case
  • Removing stopwords (extremely common words such as “and”, “or”, “not”, “in”, “is” etc)
    • This is an interesting effect based on the fact that N-grams might be affected due to this
    • Upto to us if we want to take this out or not, I am deciding to take these out
  • Removing numbers
  • Filtering out unwanted terms and weird characters
  • Removing extra whitespace
# How to see a line when in Vcorpus format
lapply(c(1:3), function(x) {
 strwrap(corpusblogs[[x]]) 
})
[[1]]
[1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “godsâ€\u009d."

[[2]]
[1] "We love you Mr. Brown."

[[3]]
[1] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox" 
[2] "together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the"  
[3] "money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had"
[4] "enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by"       
[5] "letting her switch out the characters! She loves it almost as much as him."                                                                                  
# To Lower case
corpusblogs_proc1 <- tm_map(corpusblogs, content_transformer(tolower)) 
corpususnews_proc1 <- tm_map(corpususnews, content_transformer(tolower))
corpustwitter_proc1 <- tm_map(corpustwitter, content_transformer(tolower))
#Replacing the Full stops and commas and pinctuations 
corpusblogs_proc1 <- tm_map(corpusblogs_proc1, removePunctuation) 
corpususnews_proc1 <- tm_map(corpususnews_proc1, content_transformer(tolower))
corpustwitter_proc1 <- tm_map(corpustwitter_proc1, content_transformer(tolower))
#Removing Whitespace
corpusblogs_proc1 <- tm_map(corpusblogs_proc1, stripWhitespace) 
corpususnews_proc1 <- tm_map(corpususnews_proc1, stripWhitespace)
corpustwitter_proc1 <- tm_map(corpustwitter_proc1, stripWhitespace)
#Removing Numbers
corpusblogs_proc1 <- tm_map(corpusblogs_proc1, removeNumbers) 
corpususnews_proc1 <- tm_map(corpususnews_proc1, removeNumbers)
corpustwitter_proc1 <- tm_map(corpustwitter_proc1, removeNumbers)
#Removing weird characters and ASCII
toSpace <- content_transformer(function (x , pattern) gsub(pattern, " ", x))
corpusblogs_proc1 <- tm_map(corpusblogs_proc1, toSpace, "â€") 
corpususnews_proc1 <- tm_map(corpususnews_proc1, toSpace, "â€")
corpustwitter_proc1 <- tm_map(corpustwitter_proc1, toSpace, "â€")
toNormal <- content_transformer(function (x) iconv(x, "latin1", "ASCII", sub=""))
corpusblogs_proc1 <- tm_map(corpusblogs_proc1, toNormal) 
corpususnews_proc1 <- tm_map(corpususnews_proc1, toNormal)
corpustwitter_proc1 <- tm_map(corpustwitter_proc1, toNormal)
#Removing english Stop words is a Choice
corpusblogs_proc1 <- tm_map(corpusblogs_proc1, removeWords, stopwords("english")) 
corpususnews_proc1 <- tm_map(corpususnews_proc1, removeWords, stopwords("english"))
corpustwitter_proc1 <- tm_map(corpustwitter_proc1, removeWords, stopwords("english"))
#strwrap(corpusblogs_proc1[[1]])
#iconv(strwrap(corpusblogs_proc1[[1]]), "latin1", "ASCII", sub = "")  

Build a Data table for the text present (Obvious as they ridiculously optimized to handle big data!!)

library(data.table)
library(dplyr)
library(tidytext)
library(ggplot2)
corpusblogs_proc1.dt <- NULL
corpususnews_proc1.dt<-NULL
corpustwitter_proc1.dt<-NULL
# Blogs
corpusblogs_proc1.dt <- data.table(text=sapply(corpusblogs_proc1, identity), stringsAsFactors = F) 
corpusblogs_proc1.dt.tidy<-corpusblogs_proc1.dt %>% unnest_tokens(word,text)
corpusblogs_proc1.dt.tidy %>% count(word, sort=TRUE) %>%
filter(n>1500) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
  geom_col()+
  xlab("Words")+
  coord_flip()+
  ggtitle("Word Count - BLOGS")

#usnews
corpususnews_proc1.dt<-data.table(text=sapply(corpususnews_proc1, identity), stringsAsFactors = F)
corpususnews_proc1.dt.tidy<-corpususnews_proc1.dt %>% unnest_tokens(word, text)
corpususnews_proc1.dt.tidy %>% count(word, sort=TRUE) %>%
  filter(n>75) %>%
  mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
  geom_col()+
  xlab("Words")+
  coord_flip()+
  ggtitle("Word Count - US NEWS")

#Twitter
corpustwitter_proc1.dt<-data.table(text=sapply(corpustwitter_proc1, identity), stringsAsFactors = F)
corpustwitter_proc1.dt.tidy<-corpustwitter_proc1.dt %>% unnest_tokens(word, text)
corpustwitter_proc1.dt.tidy %>% count(word, sort=TRUE) %>%
  filter(n>2000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) + 
  geom_col()+
  xlab("Words")+
  coord_flip()+
  ggtitle("Word Count - Twitter")

Tokenizing based on Adjacent words

  • Consecutive sequence of words called “n-grams”
  • Bigram
  • There seems a lot of count of Bigrams such as “dont”, “wont” .. somewords that ends on a t , we would need to remove those.
library(dplyr)
library(tidytext)
library(tidyr)
l = list(corpusblogs_proc1.dt, corpususnews_proc1.dt,corpustwitter_proc1.dt)
comb.dt<-rbindlist(l)
comb.dt.bigram <-  comb.dt %>% unnest_tokens(bigram, text, token="ngrams", n=2)
#Removing the words that end with t and stop words
comb.dt.bigram_separated<-comb.dt.bigram %>% separate(bigram, c("word1","word2"), sep=" ")
comb.dt.bigram_filtered <- comb.dt.bigram_separated %>%
  filter(!word1 %in% stop_words$word) %>%
  filter(!word2 %in% stop_words$word)
comb.dt.bigram_filtered <- comb.dt.bigram_filtered %>%
  filter(!word2 %in% "t")
comb.dt.bigram_united <- comb.dt.bigram_filtered %>%
  unite(bigram, word1, word2, sep=" ")
comb.dt.bigram_united %>% count(bigram, sort=TRUE) %>%
  filter(n>100) %>%
  mutate(bigram = reorder(bigram, n)) %>%
  ggplot(aes(bigram, n)) +
  geom_col()+
  xlab("bigram")+
  coord_flip() +
  ggtitle("Bigram Count - Combined Dataset")

NA

Trigram Frequency Determination

  • For this case , We decide to keep all the stop words etc .. and see how it goes
  • Word Cloud and the frequency distribution for all the trigrams
library(wordcloud)
comb.dt.trigram <- comb.dt %>% unnest_tokens(trigram, text, token="ngrams", n=3)
set.seed(1234)
comb.dt.final <- comb.dt.trigram %>% count(trigram , sort=TRUE)
wordcloud(words = comb.dt.final$trigram, freq = comb.dt.final$n, max.words = 100, colors = brewer.pal(6,"Dark2")) 

# plot of frequencies 
comb.dt.final %>%
  filter(n>50) %>%
ggplot(aes(trigram, n)) +
  geom_col() +
  xlab("trigram") +
  coord_flip() +
  ggtitle("Trigram Count - Combined Dataset")

Plotting Top 10 Unigram, Bigram & Trigram for Blogs, news and twitter dataset in one Plot

  • For this case I did not do a lot of filtering , as i wanted to check the raw top 10
par(mfrow=c(3,3))
data.blogs = corpusblogs_proc1.dt.tidy %>% count(word, sort=TRUE) 
data.news = corpususnews_proc1.dt.tidy %>% count(word, sort=TRUE)
data.twitter = corpustwitter_proc1.dt.tidy %>% count(word, sort=TRUE)
ggblog1 <- ggplot(data = head(data.blogs,10)) + geom_bar(aes(x=word, y=n), stat="identity") + coord_flip() + xlab("Unigram Blog Words")
ggnews1 <- ggplot(data = head(data.news,10)) + geom_bar(aes(x=word, y=n), stat="identity") + coord_flip() + xlab("Unigram news Words")
ggtwitter1 <- ggplot(data = head(data.twitter,10)) + geom_bar(aes(x=word, y=n), stat="identity") + coord_flip() + xlab("Unigram twitter Words")
list1 <- list(ggblog1,ggnews1,ggtwitter1)
data.blogs.bigram <-  corpusblogs_proc1.dt %>% unnest_tokens(bigram, text, token="ngrams", n=2) %>% count(bigram, sort=TRUE)
data.news.bigram <-  corpususnews_proc1.dt %>% unnest_tokens(bigram, text, token="ngrams", n=2) %>% count(bigram, sort=TRUE)
data.twitter.bigram <-  corpustwitter_proc1.dt %>% unnest_tokens(bigram, text, token="ngrams", n=2) %>% count(bigram, sort=TRUE)
ggblog2 <- ggplot(data = head(data.blogs.bigram,10)) + geom_bar(aes(x=bigram, y=n), stat="identity") + coord_flip() + xlab("Bigram Blog Words")
ggnews2 <- ggplot(data = head(data.news.bigram,10)) + geom_bar(aes(x=bigram, y=n), stat="identity") + coord_flip() + xlab("Bigram news Words")
ggtwitter2 <- ggplot(data = head(data.twitter.bigram,10)) + geom_bar(aes(x=bigram, y=n), stat="identity") + coord_flip() + xlab("Bigram twitter Words")
list2 <- list(ggblog2,ggnews2,ggtwitter2)
data.blogs.trigram <-  corpusblogs_proc1.dt %>% unnest_tokens(trigram, text, token="ngrams", n=3) %>% count(trigram, sort=TRUE)
data.news.trigram <-  corpususnews_proc1.dt %>% unnest_tokens(trigram, text, token="ngrams", n=3) %>% count(trigram, sort=TRUE)
data.twitter.trigram <-  corpustwitter_proc1.dt %>% unnest_tokens(trigram, text, token="ngrams", n=3) %>% count(trigram, sort=TRUE)
ggblog3 <- ggplot(data = head(data.blogs.trigram,10)) + geom_bar(aes(x=trigram, y=n), stat="identity") + coord_flip() + xlab("Trigram Blog Words")
ggnews3 <- ggplot(data = head(data.news.trigram,10)) + geom_bar(aes(x=trigram, y=n), stat="identity") + coord_flip() + xlab("Trigram news Words")
ggtwitter3 <- ggplot(data = head(data.twitter.trigram,10)) + geom_bar(aes(x=trigram, y=n), stat="identity") + coord_flip() + xlab("Trigram twitter Words")
list3 <- list(ggblog3,ggnews3,ggtwitter3)
library(grid)
library(gridExtra)
grid.arrange(grobs = c(list1,list2,list3),ncol = 3, as.table = FALSE)

How many unique words do you need in a frequency sorted dictionary to cover 50% of all word instances in the language? 90%?

comb.dt.freqdict<-comb.dt %>% unnest_tokens(word, text) %>% count(word, sort=TRUE)
sumofwords <- sum(comb.dt.freqdict$n)
print(paste("The total number of words present here", sumofwords))
[1] "The total number of words present here 2029277"
sumcounter<<-0
counter<<-0
valtemp<-lapply(comb.dt.freqdict$n, function(x){
  if (sumcounter<=0.5*sumofwords) {
   sumcounter<<-sumcounter+x
   counter<<-counter+1
  }
})
print(paste("The number of unique words required to cover 50% of all words in case of a sample size of 3% = ", counter))
[1] "The number of unique words required to cover 50% of all words in case of a sample size of 3% =  1130"
sumcounter<<-0
counter<<-0
valtemp<-lapply(comb.dt.freqdict$n, function(x){
  if (sumcounter<=0.9*sumofwords) {
   sumcounter<<-sumcounter+x
   counter<<-counter+1
  }
})
print(paste("The number of unique words required to cover 90% of all words in case of a sample size of 3% = ", counter))
[1] "The number of unique words required to cover 90% of all words in case of a sample size of 3% =  15052"

How do you evaluate how many of the words come from foreign languages?

  • Right now I did not deal with foreign language words as it seems after the Frequency distributions of the english words , these words are have very low presence , and if present they need to compared based on a hashmap or dictionary data base of some sort

Can you think of a way to increase the coverage – identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases?

  • This is an interesting question that involves prediction based on smaller sample size, we can use higher order n-grams with higher frequency and remove the lower frequency ones to predict which words are more probable to appear in a larger Population size .

Summary of Findings

  • Zipf’s Law - The Frequency that a word appears is inversely proportional to the rank of the word
  • Initially I thought of using the tdm function to form a tem document matrix from the corpus data, but it seems that is not possible to process due to size limitations and hence I decided to go with data tables instead , which as expected are good with large datasets
  • I also see even after a lot of filtering , some words like won’t , don’t, can’t , need to filtered out of the 2- or 3- n grams as they will produce misleading frequency distributions
  • Sampling only 3% of the data seems to be low, so I will be increasing my sample size to 10% in the coming runs of the project

Future Feedback

  • I plan to use some sort of a algorithm , I see there is a discussion about KBO(Katz backoff) , We basically need to predict the next word in a sequence of words .
    • There could be a Bayesian approch to the prediction problem with putting probablities on each of the words in a n-gram to predict n+1 word
    • There should be other similar approaces to KBO , which I plan to study and implement as required
  • I would also try to change my sample size and see how that changes my Frequency distributions
  • I also believe there is a lot of research/reading papers/tutorials & videos that needs to be done between task-2 and task-3 to reach a more through understanding of a better approach
