Introduction

Large databases comprising of text in a target language are commonly used when generating language models for various purposes.The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observe in the data and prepare to build our first linguistic models. Tasks to accomplish: 1.Getting and Cleaning data 2.Getting general summary and unique words of the files 3.Finding words from foreign language 4.Word coverage analysis for the dictionary 5.Profanity filtering 6.Tokenization-Build basic n-gram model 7.Exploratory analysis

Data Acquisition

The data was downloaded from coursera provided site: Link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The data is from a corpus called HC Corpora.For this exercise we are using English database.

Getting the general overview of the files

File Stats

##   FileName FileSize   Lines LinesNEmpty     Chars LMax.Chars LMean.Chars
## 1  twitter 159.3641 2360148     2360148 162096241        140    68.68054
## 2    blogs 200.4242  899288      899288 206824382      40833   229.98695
## 3     news 196.2775 1010242     1010242 203223154      11384   201.16285
##       Words
## 1 134082806
## 2 170389539
## 3 169860866

Sampling the data

To reiterate, to build models I don’t need to use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.

Sample Stats

##   FileName Lines   Chars LMax.Chars LMin.Chars LMean.Chars   Words
## 1       ST 20000 1367016        140          4     68.3508 1130682
## 2       SB 20000 4541126       4265          2    227.0565 3742891
## 3       SN 20000 3983561       2900          2    199.1781 3330133

Determining total Unique Words and Foreign Language words

Plotting the unique words

removed extra variables to free-up the space

Cleaning data to build term document matrix

Removing numbers digits,punctuations,extra strings,space and any profanity or words that are not required for prediction modelling. Note that the data can contain words of offensive and profane meaning. Therefore, we wanted to remove profane words that we do not want to predict.For this, the text file of defined profanity words is downloaded from: https://www.cs.cmu.edu/~biglou/resources/bad-words.txt and stored in working directory.

# Stopword or profanity removal
Path.P<- "~/Desktop/DS notes/NLP capstone/DS capstone/DS capstone/bad-words.txt"
profanity <- as.vector(readLines(Path.P))

# some extra unwanted words
extra <- c( "rt","re","ve","lol","em","im","gr","en","el", "st", "u.s", "p.m", "a.m", "mr", "dr", "ll", "ur", "omg", "co", "oh", "ha", "haha", "ha", "la",letters)

# clear non-english words

Scorpus <-gsub(pattern = "\\W+"," ",Scomb) # 15.3 Mb
Scorpus<- sapply(Scorpus, function(row) iconv(row, "latin1", "ASCII", sub=""))
# making a sample corpus
Scorpus <- VCorpus(VectorSource(Scorpus),readerControl = list(readPlain,language = "english",load=TRUE))# 3.1 Mb

# remove numbers
Scorpus <- tm_map(Scorpus, removeNumbers)
# convert letter to lower case
Scorpus <- tm_map(Scorpus, content_transformer(tolower))
# remove punctuation markds
Scorpus <- tm_map(Scorpus, removePunctuation)
# remove commonly occurring words not useful for prediction
Scorpus <- tm_map(Scorpus, removeWords, stopwords("english"))
# remove potentially offensive words
Scorpus <- tm_map(Scorpus, removeWords, profanity)
Scorpus <- tm_map(Scorpus, removeWords, extra)
# remove extra white space between words, leaving only one space
Scorpus <- tm_map(Scorpus, stripWhitespace)

rm("profanity","extra","Scomb")

Insepecting scorpus for chars and clean text Twt Content: chars: 872974 blg Content: chars: 2831592 nws Content: chars: 2689773

Build term document matrix and n-grams

# 1.creating uni-gram
tdm <- TermDocumentMatrix(Scorpus) # creating term document matrix # 4 Mb
f_tdm <- findFreqTerms(tdm,lowfreq = 20) # Remove Sparse Terms:eliminating terms below freq <=20
#tdm_sparse <- removeSparseTerms(tdm, sparse=0.97)
f_tdm1 <- sort(rowSums(as.matrix(tdm[f_tdm, ])),decreasing = TRUE)
tdm_df<- data.frame(word=names(f_tdm1), frequency=f_tdm1)
# dim(tdm_df)# 3785 obs.2 vars
summary(tdm_df)

##         word        frequency     
##  aaron    :   1   Min.   :  20.0  
##  abandoned:   1   1st Qu.:  28.0  
##  abbey    :   1   Median :  47.0  
##  abc      :   1   Mean   : 115.9  
##  abilities:   1   3rd Qu.: 101.0  
##  ability  :   1   Max.   :6026.0  
##  (Other)  :6443

creating bi-grams and tri-grams

# creating  bi-grams and tri-grams
# 1. create term document matrix containing 2-grams and inspect it
biGrams <- function(x) {
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
btdm <- TermDocumentMatrix(Scorpus, control = list(tokenize = biGrams))# 35.2 Mb
f_btdm <- findFreqTerms(btdm,lowfreq  = 10) # Remove Sparse Terms:eliminating terms below freq <=10
# alternative method
# btdm_sparse <- removeSparseTerms(btdm, sparse=0.98)
f_btdm1 <- sort(rowSums(as.matrix(btdm[f_btdm,])),decreasing = TRUE)
btdm_df<- data.frame(word=names(f_btdm1), frequency=f_btdm1)
#summary(btdm_df)

# 3. create term document matrix containing 3-grams and inspect it
triGrams <- function(x) {
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
ttdm <- TermDocumentMatrix(Scorpus, control = list(tokenize = triGrams))# 45.3 Mb
f_ttdm <- findFreqTerms(ttdm,lowfreq  = 5) # Remove Sparse Terms :eliminating terms below freq <=5
#ttdm_sparse <- removeSparseTerms(ttdm, sparse=0.98)
f_ttdm1 <- sort(rowSums(as.matrix(ttdm[f_ttdm,])),decreasing = TRUE)
ttdm_df<- data.frame(word=names(f_ttdm1), frequency=f_ttdm1)
#summary(ttdm_df)
rm("tdm","btdm","ttdm","Scorpus")
# combining n-grams for top 80 frequent terms
combined_DF<-cbind(tdm_df[1:80,],btdm_df[1:80,],ttdm_df[1:80,])
head(combined_DF,20)

ploting n-grams to see most frequent terms

gridExtra::grid.arrange(g1,g2,g3, ncol= 3)

bluiding wordcloud for n-grams

Word Coverage

Determining the unique words needed in a frequency sorted dictionary to cover 50%,70% and 90% of all instances in the language and how to Increase the Coverage of the Corpus By Using Fewer Words

# Key Words to Cover 50%,70%  and 90% of All Instances
total_freq<-sum(tdm_df$frequency) # calculating total frequency 
fifty_freq<-ceiling(total_freq*0.5) # terms 
seventy_freq<-ceiling(total_freq*0.7)
ninty_freq<-ceiling(total_freq*0.9)
tdm_df<-tdm_df %>% mutate(cum_freq=cumsum(frequency)) # calculating cumulative frequency of each term as term frequency ratio with total frequency
head(tdm_df)

##   word frequency cum_freq
## 1 said      6026     6026
## 2 will      5370    11396
## 3  one      5295    16691
## 4  can      5067    21758
## 5 just      4478    26236
## 6 like      4244    30480

cutoff_50<-min(which(tdm_df$cum_freq >= fifty_freq)) # 445L
cutoff_70<-min(which(tdm_df$cum_freq >= seventy_freq)) # 1068L
cutoff_90<-min(which(tdm_df$cum_freq >= ninty_freq)) # 2439L

# Addressing:How to increase the coverage; namely, identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases
total <- max(tdm_df$cum_freq)
tdm_df <- tdm_df %>%
         mutate(Coverage = cum_freq/total) %>%
         mutate(`Number of Words` = 1:nrow(tdm_df))

ploting the word coverage

 ggplot(tdm_df, aes(x = `Number of Words`, y = Coverage)) + labs(title = "Uni-grams")+
  geom_line() + geom_hline(yintercept = c(0.5,0.7, 0.9),color=c("red","green","blue"),linetype="dashed")

From above graph, around 500,1500 & 3800 unique words needed in a frequency sorted dictionary to cover 50%,70% and 90% of all instances in the language respectively. Similarly,word coverage ratio were estimated for bi-grams and trigrams.

Interesting findings in the dataset

I observed many empty lines in twitter text, comparitive to blogs and news which were lttle compact, therefore I readlines by setting skipnull= TRUE. Further the data contain lots of non-english words that needed to be removed. However, encoding with “UTF-8” and considering encoding with ASCII words is required for unicode characters and text specifications for englsh. The data is quite large, so i checked the time consumed and memory used in major text processing like loading, cleaning, making document term matrix, and finally with the worldcloud ploting. since, I used only 20,000 lines for this excercise, so,i need to keep balance for my memory consumption and acceptable performance while running final modelling program. I manually selected some of the stopwords that are not required in text prediction model.

Future Plans

1.Since data cleaning or text-mining is an itterartive process, my next step will be to evaluate better refined clean- up methods, including removing stopwords or profanity words, encounrtered while handling large training sets. 2. Build and deploy different prediction algorithm and examine thier performance on wider samples. 3. Optimize the acceptable run time for our best prediction model on training set. 4. Develop a Shiny app with a simple user interface that can accurately predicts the next word based on a word or phrase entered by the user. 5.Try to modify the codes or resolve problems encountered while running the model.

Coursera data science capstone project:milestone report 1

Satindra Kathania

6/8/2020