https://rpubs.com/SatKat_2020/625734
Large databases comprising of text in a target language are commonly used when generating language models for various purposes.The first step in building a predictive model for text is understanding the distribution and relationship between the words, tokens, and phrases in the text. The goal of this task is to understand the basic relationships observe in the data and prepare to build our first linguistic models. Tasks to accomplish: 1.Getting and Cleaning data 2.Getting general summary and unique words of the files 3.Finding words from foreign language 4.Word coverage analysis for the dictionary 5.Profanity filtering 6.Tokenization-Build basic n-gram model 7.Exploratory analysis
The data was downloaded from coursera provided site: Link: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip The data is from a corpus called HC Corpora.For this exercise we are using English database.
File Stats
## FileName FileSize Lines LinesNEmpty Chars LMax.Chars LMean.Chars
## 1 twitter 159.3641 2360148 2360148 162096241 140 68.68054
## 2 blogs 200.4242 899288 899288 206824382 40833 229.98695
## 3 news 196.2775 1010242 1010242 203223154 11384 201.16285
## Words
## 1 134082806
## 2 170389539
## 3 169860866
To reiterate, to build models I don’t need to use all of the data. Often relatively few randomly selected rows or chunks need to be included to get an accurate approximation to results that would be obtained using all the data.
Sample Stats
## FileName Lines Chars LMax.Chars LMin.Chars LMean.Chars Words
## 1 ST 20000 1367016 140 4 68.3508 1130682
## 2 SB 20000 4541126 4265 2 227.0565 3742891
## 3 SN 20000 3983561 2900 2 199.1781 3330133
Plotting the unique words
removed extra variables to free-up the space
Removing numbers digits,punctuations,extra strings,space and any profanity or words that are not required for prediction modelling. Note that the data can contain words of offensive and profane meaning. Therefore, we wanted to remove profane words that we do not want to predict.For this, the text file of defined profanity words is downloaded from: https://www.cs.cmu.edu/~biglou/resources/bad-words.txt and stored in working directory.
# Stopword or profanity removal
Path.P<- "~/Desktop/DS notes/NLP capstone/DS capstone/DS capstone/bad-words.txt"
profanity <- as.vector(readLines(Path.P))
# some extra unwanted words
extra <- c( "rt","re","ve","lol","em","im","gr","en","el", "st", "u.s", "p.m", "a.m", "mr", "dr", "ll", "ur", "omg", "co", "oh", "ha", "haha", "ha", "la",letters)
# clear non-english words
Scorpus <-gsub(pattern = "\\W+"," ",Scomb) # 15.3 Mb
Scorpus<- sapply(Scorpus, function(row) iconv(row, "latin1", "ASCII", sub=""))
# making a sample corpus
Scorpus <- VCorpus(VectorSource(Scorpus),readerControl = list(readPlain,language = "english",load=TRUE))# 3.1 Mb
# remove numbers
Scorpus <- tm_map(Scorpus, removeNumbers)
# convert letter to lower case
Scorpus <- tm_map(Scorpus, content_transformer(tolower))
# remove punctuation markds
Scorpus <- tm_map(Scorpus, removePunctuation)
# remove commonly occurring words not useful for prediction
Scorpus <- tm_map(Scorpus, removeWords, stopwords("english"))
# remove potentially offensive words
Scorpus <- tm_map(Scorpus, removeWords, profanity)
Scorpus <- tm_map(Scorpus, removeWords, extra)
# remove extra white space between words, leaving only one space
Scorpus <- tm_map(Scorpus, stripWhitespace)
rm("profanity","extra","Scomb")
Insepecting scorpus for chars and clean text Twt Content: chars: 872974 blg Content: chars: 2831592 nws Content: chars: 2689773
# 1.creating uni-gram
tdm <- TermDocumentMatrix(Scorpus) # creating term document matrix # 4 Mb
f_tdm <- findFreqTerms(tdm,lowfreq = 20) # Remove Sparse Terms:eliminating terms below freq <=20
#tdm_sparse <- removeSparseTerms(tdm, sparse=0.97)
f_tdm1 <- sort(rowSums(as.matrix(tdm[f_tdm, ])),decreasing = TRUE)
tdm_df<- data.frame(word=names(f_tdm1), frequency=f_tdm1)
# dim(tdm_df)# 3785 obs.2 vars
summary(tdm_df)
## word frequency
## aaron : 1 Min. : 20.0
## abandoned: 1 1st Qu.: 28.0
## abbey : 1 Median : 47.0
## abc : 1 Mean : 115.9
## abilities: 1 3rd Qu.: 101.0
## ability : 1 Max. :6026.0
## (Other) :6443
# creating bi-grams and tri-grams
# 1. create term document matrix containing 2-grams and inspect it
biGrams <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
btdm <- TermDocumentMatrix(Scorpus, control = list(tokenize = biGrams))# 35.2 Mb
f_btdm <- findFreqTerms(btdm,lowfreq = 10) # Remove Sparse Terms:eliminating terms below freq <=10
# alternative method
# btdm_sparse <- removeSparseTerms(btdm, sparse=0.98)
f_btdm1 <- sort(rowSums(as.matrix(btdm[f_btdm,])),decreasing = TRUE)
btdm_df<- data.frame(word=names(f_btdm1), frequency=f_btdm1)
#summary(btdm_df)
# 3. create term document matrix containing 3-grams and inspect it
triGrams <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
ttdm <- TermDocumentMatrix(Scorpus, control = list(tokenize = triGrams))# 45.3 Mb
f_ttdm <- findFreqTerms(ttdm,lowfreq = 5) # Remove Sparse Terms :eliminating terms below freq <=5
#ttdm_sparse <- removeSparseTerms(ttdm, sparse=0.98)
f_ttdm1 <- sort(rowSums(as.matrix(ttdm[f_ttdm,])),decreasing = TRUE)
ttdm_df<- data.frame(word=names(f_ttdm1), frequency=f_ttdm1)
#summary(ttdm_df)
rm("tdm","btdm","ttdm","Scorpus")
# combining n-grams for top 80 frequent terms
combined_DF<-cbind(tdm_df[1:80,],btdm_df[1:80,],ttdm_df[1:80,])
head(combined_DF,20)
gridExtra::grid.arrange(g1,g2,g3, ncol= 3)
Determining the unique words needed in a frequency sorted dictionary to cover 50%,70% and 90% of all instances in the language and how to Increase the Coverage of the Corpus By Using Fewer Words
# Key Words to Cover 50%,70% and 90% of All Instances
total_freq<-sum(tdm_df$frequency) # calculating total frequency
fifty_freq<-ceiling(total_freq*0.5) # terms
seventy_freq<-ceiling(total_freq*0.7)
ninty_freq<-ceiling(total_freq*0.9)
tdm_df<-tdm_df %>% mutate(cum_freq=cumsum(frequency)) # calculating cumulative frequency of each term as term frequency ratio with total frequency
head(tdm_df)
## word frequency cum_freq
## 1 said 6026 6026
## 2 will 5370 11396
## 3 one 5295 16691
## 4 can 5067 21758
## 5 just 4478 26236
## 6 like 4244 30480
cutoff_50<-min(which(tdm_df$cum_freq >= fifty_freq)) # 445L
cutoff_70<-min(which(tdm_df$cum_freq >= seventy_freq)) # 1068L
cutoff_90<-min(which(tdm_df$cum_freq >= ninty_freq)) # 2439L
# Addressing:How to increase the coverage; namely, identifying words that may not be in the corpora or using a smaller number of words in the dictionary to cover the same number of phrases
total <- max(tdm_df$cum_freq)
tdm_df <- tdm_df %>%
mutate(Coverage = cum_freq/total) %>%
mutate(`Number of Words` = 1:nrow(tdm_df))
ggplot(tdm_df, aes(x = `Number of Words`, y = Coverage)) + labs(title = "Uni-grams")+
geom_line() + geom_hline(yintercept = c(0.5,0.7, 0.9),color=c("red","green","blue"),linetype="dashed")
From above graph, around 500,1500 & 3800 unique words needed in a frequency sorted dictionary to cover 50%,70% and 90% of all instances in the language respectively. Similarly,word coverage ratio were estimated for bi-grams and trigrams.
I observed many empty lines in twitter text, comparitive to blogs and news which were lttle compact, therefore I readlines by setting skipnull= TRUE. Further the data contain lots of non-english words that needed to be removed. However, encoding with “UTF-8” and considering encoding with ASCII words is required for unicode characters and text specifications for englsh. The data is quite large, so i checked the time consumed and memory used in major text processing like loading, cleaning, making document term matrix, and finally with the worldcloud ploting. since, I used only 20,000 lines for this excercise, so,i need to keep balance for my memory consumption and acceptable performance while running final modelling program. I manually selected some of the stopwords that are not required in text prediction model.
1.Since data cleaning or text-mining is an itterartive process, my next step will be to evaluate better refined clean- up methods, including removing stopwords or profanity words, encounrtered while handling large training sets. 2. Build and deploy different prediction algorithm and examine thier performance on wider samples. 3. Optimize the acceptable run time for our best prediction model on training set. 4. Develop a Shiny app with a simple user interface that can accurately predicts the next word based on a word or phrase entered by the user. 5.Try to modify the codes or resolve problems encountered while running the model.