In the context of keyboard typing, predicting the next word is a very interesting ‘Natural Language Processing’ (NLP) problem in data science. The goal of this particular exercise is to try to analyze the data sets provided - One blog, one twitter and one news file and see what could be gleaned from the data and build a dictionary with word frequencies.
This is the beginning of the English text mining analysis on provided data. Following steps need to be performed before building the predictive model: 1) Download and process the data 2) Perform very common Exploratory Data Analysis on the text data 3) Try to find out any interesting thing about the data 4) Describe the future plan of building predictive model based on R
Next step is to perform some comon exploratory analysis on the downloaded text data.
library(stringi)
T_words <- sum(stri_count_words(T_lines))
B_words <- sum(stri_count_words(B_lines))
N_words <- sum(stri_count_words(N_lines))
Word_count <- c(Twitter = T_words,Blogs = B_words,News = N_words)
barplot(height = Word_count,col=Word_count, xlab = "Text File", ylab = "Word counts", main = "Word counts plot")
summary(T_words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30218125 30218125 30218125 30218125 30218125 30218125
summary(B_words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 38154238 38154238 38154238 38154238 38154238 38154238
summary(N_words)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2693898 2693898 2693898 2693898 2693898 2693898
This is a very crucial step before converting the text data into corpus for tokenization. During this step only 10% of sample data being processed for further analysis after removing the non-ASCII characters.
#Random sampling 10% of the data from ngram analysis
T_linescnt <- NROW(T_lines)
B_linescnt <- NROW(B_lines)
N_linescnt <- NROW(N_lines)
T_samp <- sample(T_lines,T_linescnt* 0.1)
B_samp <- sample(B_lines,B_linescnt* 0.1)
N_samp <- sample(N_lines,N_linescnt* 0.1)
# Combine sample datasets for additional pre-processing
Combsamp <- c(T_samp,B_samp,N_samp)
#locate non-ASCII characters and remove form the sample dataset
# Convert string to vector of words
CombVsc <- unlist(strsplit(Combsamp,split=", "))
#Locate the Non-ASCII characte
ASCII_index <- grep("CombVsc", iconv(CombVsc,"latin1","ASCII",sub="CombVsc"))
# Subset Original vector of words after removing the words with non-ASCII charatcters
SubVsc <- CombVsc[ - ASCII_index]
# Convert it back to string
Subsamp <- paste(SubVsc,split=", ")
tm pacakge is very useful to convert the text data into corpus. As a part of Text mining process, tm package has been used to converting characters to lowercase, remove punctuation,numbers,extra whitespace,email,URLS,Twittertags etc. In order to avoid the prfanity issue, I have removed the english words that have been listed as profane words in google word list. Source: https://github.com/RobertJGabriel/Google-profanity-words
library(tm)
## Loading required package: NLP
SampCor <- Corpus(VectorSource(Subsamp))
# Converts character to lower case
SampCor <- tm_map(SampCor,tolower)
## Warning in tm_map.SimpleCorpus(SampCor, tolower): transformation drops
## documents
# Remove Punctuation
SampCor <- tm_map(SampCor,removePunctuation)
## Warning in tm_map.SimpleCorpus(SampCor, removePunctuation): transformation
## drops documents
# Remove Numbers
SampCor <- tm_map(SampCor,removeNumbers)
## Warning in tm_map.SimpleCorpus(SampCor, removeNumbers): transformation
## drops documents
# Remove extra whitespace
SampCor <- tm_map(SampCor, stripWhitespace)
## Warning in tm_map.SimpleCorpus(SampCor, stripWhitespace): transformation
## drops documents
# Remove special characters and numbes
#URLs
URLpat <- function(x)
gsub("(ftp|http)(s?)://.*\\b","",x)
SampCor <- tm_map(SampCor, URLpat)
## Warning in tm_map.SimpleCorpus(SampCor, URLpat): transformation drops
## documents
#Emails
emailpat <- function(x)
gsub("\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b","",x)
SampCor <- tm_map(SampCor, emailpat)
## Warning in tm_map.SimpleCorpus(SampCor, emailpat): transformation drops
## documents
#Twitter tags
tweetpat <- function(x)
gsub("RT |via","",x)
SampCor <- tm_map(SampCor, tweetpat)
## Warning in tm_map.SimpleCorpus(SampCor, tweetpat): transformation drops
## documents
#Twitter usernames
usernamepat <- function(x)
gsub("[@][a - zA - Z0 - 9_]{1,15}","",x)
SampCor <- tm_map(SampCor, usernamepat)
## Warning in tm_map.SimpleCorpus(SampCor, usernamepat): transformation drops
## documents
# Remove profane words
worddat <- read.table("bad_word_list.txt",header=FALSE,sep = "\n",strip.white = TRUE)
names(worddat) <- "Bad Words"
SampCor <- tm_map(SampCor,removeWords,worddat[,1])
## Warning in tm_map.SimpleCorpus(SampCor, removeWords, worddat[, 1]):
## transformation drops documents
SampCor <- tm_map(SampCor, removeWords, stopwords("en"))
## Warning in tm_map.SimpleCorpus(SampCor, removeWords, stopwords("en")):
## transformation drops documents
Next step is to perform n-gram analysis on the cleansed corpus data and create different n-gram data frames for visualization purpose.
library(quanteda)
## Package version: 1.3.14
## Parallel computing: 2 of 4 threads used.
## See https://quanteda.io for tutorials and examples.
##
## Attaching package: 'quanteda'
## The following objects are masked from 'package:tm':
##
## as.DocumentTermMatrix, stopwords
## The following object is masked from 'package:utils':
##
## View
All_Corpus <- corpus(SampCor)
Corpus_Tokens <- tokens(All_Corpus, removeNumbers = TRUE, removePunct = TRUE,
removeSymbols = TRUE, removeSeparators = TRUE)
DFM_Corpus <- dfm(Corpus_Tokens, verbose = FALSE, ignoredFeatures = c(stopwords("english")))
## Warning: Argument ignoredFeatures not used.
Corpus_Tokens_2Gram <- tokens(All_Corpus, ngrams = 2, removeNumbers = TRUE, removePunct = TRUE,
removeSymbols = TRUE, removeSeparators = TRUE)
DFM_Corpus_2Gram <- dfm(Corpus_Tokens_2Gram, verbose = FALSE, ignoredFeatures = c(stopwords("english")))
## Warning: Argument ignoredFeatures not used.
Corpus_Tokens_3Gram <- tokens(All_Corpus, ngrams = 3, removeNumbers = TRUE, removePunct = TRUE,
removeSymbols = TRUE, removeSeparators = TRUE)
DFM_Corpus_3Gram <- dfm(Corpus_Tokens_3Gram, verbose = FALSE, ignoredFeatures = c(stopwords("english")))
## Warning: Argument ignoredFeatures not used.
ngram_freqdf <- function(tdm, sparsity){
freq <- sort(rowSums(as.matrix(removeSparseTerms(tdm, sparsity))), decreasing = TRUE)
return(data.frame(word = names(freq), freq = freq))
}
onegram_tdm <- TermDocumentMatrix(SampCor)
onegram_freqdf <- ngram_freqdf(onegram_tdm, 0.99)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(RColorBrewer)
#counts for onegram words
ggplot(subset(onegram_freqdf[1:20,]), aes(x = reorder(word, -freq), y = freq)) +
geom_bar(stat = "identity") +
labs(x = "Words", y = "Count", title = "Top 20 one-gram words(All)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.title = element_text(hjust = 0.5))
library(wordcloud)
textplot_wordcloud(DFM_Corpus, comparison = FALSE, max_words = 300,
color = brewer.pal(8, "Dark2"))
textplot_wordcloud(DFM_Corpus_2Gram, comparison = FALSE, max_words = 50,
color = c("blue", "red"))
textplot_wordcloud(DFM_Corpus_3Gram, comparison = FALSE, max_words = 10,
color = c("blue", "red"))
Based on the baplot we can determine the top 5 words that have been used frequently in the datasets are “just”,“like”,“will”,“one”,“can”. Wordcloud plots showcased higher usage of a few two-gram and three-gram words as such “cant_wait”,“right_now”,“last_night”,“dont_know”,“cant_wait_see”,“happy_mothers_day”,“happy_new_year” etc. This text analysis concludes my initial effort to discover the text patterns of three datasets.
This exploratory data analysis laid the foundation for my future work related to the capstone project. My next plan is to gather more insight on different R packages like OpneNLP, Rweka,Quanteda which could be beneficial to analyze the data and build a predictive model for this NLP task. I will also start exploring the process to build a shiny app that could leverage some of the key text mining packages and will be useful for the text mining purposes. Furthermore, the data will be classified into test and training dataset. Training data will be used to analyze the text pattern and build the algorithm. Test data will be used to validate the algorithm at the final stage of this project. Validation result and performance matrix of the algorithm will be documented and presented accordingly.