Text mining with R

Overview

In the context of keyboard typing, predicting the next word is a very interesting ‘Natural Language Processing’ (NLP) problem in data science. The goal of this particular exercise is to try to analyze the data sets provided - One blog, one twitter and one news file and see what could be gleaned from the data and build a dictionary with word frequencies.

This is the beginning of the English text mining analysis on provided data. Following steps need to be performed before building the predictive model: 1) Download and process the data 2) Perform very common Exploratory Data Analysis on the text data 3) Try to find out any interesting thing about the data 4) Describe the future plan of building predictive model based on R

Exploratory Analysis

Next step is to perform some comon exploratory analysis on the downloaded text data.

Word Count analysis

library(stringi)
T_words <- sum(stri_count_words(T_lines))
B_words <- sum(stri_count_words(B_lines))
N_words <- sum(stri_count_words(N_lines))

Word_count <- c(Twitter = T_words,Blogs = B_words,News = N_words)
barplot(height = Word_count,col=Word_count, xlab = "Text File", ylab = "Word counts", main = "Word counts plot")

summary(T_words)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 30218125 30218125 30218125 30218125 30218125 30218125

summary(B_words)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 38154238 38154238 38154238 38154238 38154238 38154238

summary(N_words)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 2693898 2693898 2693898 2693898 2693898 2693898

Data sampling

This is a very crucial step before converting the text data into corpus for tokenization. During this step only 10% of sample data being processed for further analysis after removing the non-ASCII characters.

#Random sampling 10% of the data from ngram analysis

T_linescnt <- NROW(T_lines)
B_linescnt <- NROW(B_lines)
N_linescnt <- NROW(N_lines)
T_samp <- sample(T_lines,T_linescnt* 0.1)
B_samp <- sample(B_lines,B_linescnt* 0.1)
N_samp <- sample(N_lines,N_linescnt* 0.1)

# Combine sample datasets for additional pre-processing 

Combsamp <- c(T_samp,B_samp,N_samp)

#locate non-ASCII characters and remove form the sample dataset

# Convert string to vector of words 

CombVsc <- unlist(strsplit(Combsamp,split=", "))

#Locate the Non-ASCII characte

ASCII_index <- grep("CombVsc", iconv(CombVsc,"latin1","ASCII",sub="CombVsc"))

# Subset Original vector of words after removing the words with non-ASCII charatcters

SubVsc <- CombVsc[ - ASCII_index]

# Convert it back to string

Subsamp <- paste(SubVsc,split=", ")

Corpus data analysis using tm pacakage

tm pacakge is very useful to convert the text data into corpus. As a part of Text mining process, tm package has been used to converting characters to lowercase, remove punctuation,numbers,extra whitespace,email,URLS,Twittertags etc. In order to avoid the prfanity issue, I have removed the english words that have been listed as profane words in google word list. Source: https://github.com/RobertJGabriel/Google-profanity-words

library(tm)

## Loading required package: NLP

SampCor <- Corpus(VectorSource(Subsamp))

# Converts character to lower case

SampCor <- tm_map(SampCor,tolower)

## Warning in tm_map.SimpleCorpus(SampCor, tolower): transformation drops
## documents

# Remove Punctuation

SampCor <- tm_map(SampCor,removePunctuation)

## Warning in tm_map.SimpleCorpus(SampCor, removePunctuation): transformation
## drops documents

# Remove Numbers

SampCor <- tm_map(SampCor,removeNumbers)

## Warning in tm_map.SimpleCorpus(SampCor, removeNumbers): transformation
## drops documents

# Remove extra whitespace

SampCor <- tm_map(SampCor, stripWhitespace)

## Warning in tm_map.SimpleCorpus(SampCor, stripWhitespace): transformation
## drops documents

# Remove special characters and numbes

#URLs
URLpat <- function(x)
          gsub("(ftp|http)(s?)://.*\\b","",x)

SampCor <- tm_map(SampCor, URLpat)

## Warning in tm_map.SimpleCorpus(SampCor, URLpat): transformation drops
## documents

#Emails

emailpat <- function(x)
          gsub("\\b[A-Z a-z 0-9._ - ]*[@](.*?)[.]{1,3} \\b","",x)
SampCor <- tm_map(SampCor, emailpat)

## Warning in tm_map.SimpleCorpus(SampCor, emailpat): transformation drops
## documents

#Twitter tags

tweetpat <- function(x)
            gsub("RT |via","",x)
SampCor <- tm_map(SampCor, tweetpat)

## Warning in tm_map.SimpleCorpus(SampCor, tweetpat): transformation drops
## documents

#Twitter usernames

usernamepat <- function(x)
          gsub("[@][a - zA - Z0 - 9_]{1,15}","",x)
SampCor <- tm_map(SampCor, usernamepat)

## Warning in tm_map.SimpleCorpus(SampCor, usernamepat): transformation drops
## documents

# Remove profane words

worddat <- read.table("bad_word_list.txt",header=FALSE,sep = "\n",strip.white = TRUE)
names(worddat) <- "Bad Words"

SampCor <- tm_map(SampCor,removeWords,worddat[,1])

## Warning in tm_map.SimpleCorpus(SampCor, removeWords, worddat[, 1]):
## transformation drops documents

SampCor <- tm_map(SampCor, removeWords, stopwords("en"))

## Warning in tm_map.SimpleCorpus(SampCor, removeWords, stopwords("en")):
## transformation drops documents

tokenizing and ngram analysis

Next step is to perform n-gram analysis on the cleansed corpus data and create different n-gram data frames for visualization purpose.

library(quanteda)

## Package version: 1.3.14

## Parallel computing: 2 of 4 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following objects are masked from 'package:tm':
## 
##     as.DocumentTermMatrix, stopwords

## The following object is masked from 'package:utils':
## 
##     View

All_Corpus <- corpus(SampCor)
Corpus_Tokens <- tokens(All_Corpus, removeNumbers = TRUE, removePunct = TRUE, 
                          removeSymbols = TRUE, removeSeparators = TRUE)
DFM_Corpus <- dfm(Corpus_Tokens, verbose = FALSE, ignoredFeatures = c(stopwords("english")))

## Warning: Argument ignoredFeatures not used.

Corpus_Tokens_2Gram <- tokens(All_Corpus, ngrams = 2,  removeNumbers = TRUE, removePunct = TRUE, 
                          removeSymbols = TRUE, removeSeparators = TRUE)
DFM_Corpus_2Gram <- dfm(Corpus_Tokens_2Gram, verbose = FALSE, ignoredFeatures = c(stopwords("english")))

## Warning: Argument ignoredFeatures not used.

Corpus_Tokens_3Gram <- tokens(All_Corpus, ngrams = 3,  removeNumbers = TRUE, removePunct = TRUE, 
                          removeSymbols = TRUE, removeSeparators = TRUE)
DFM_Corpus_3Gram <- dfm(Corpus_Tokens_3Gram, verbose = FALSE, ignoredFeatures = c(stopwords("english")))

## Warning: Argument ignoredFeatures not used.

ngram_freqdf <- function(tdm, sparsity){
  freq <- sort(rowSums(as.matrix(removeSparseTerms(tdm, sparsity))), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

onegram_tdm <- TermDocumentMatrix(SampCor)
onegram_freqdf <- ngram_freqdf(onegram_tdm, 0.99)

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(RColorBrewer)

#counts for onegram words

ggplot(subset(onegram_freqdf[1:20,]), aes(x = reorder(word, -freq), y = freq)) +
    geom_bar(stat = "identity") + 
    labs(x = "Words", y = "Count", title = "Top 20 one-gram words(All)") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1), plot.title = element_text(hjust = 0.5))

Top 100 One-gram words, Top 50 two-gram words, Top 10 three-gram words

library(wordcloud)

textplot_wordcloud(DFM_Corpus, comparison = FALSE, max_words = 300,
                   color = brewer.pal(8, "Dark2"))

textplot_wordcloud(DFM_Corpus_2Gram, comparison = FALSE, max_words = 50,
                   color = c("blue", "red"))

textplot_wordcloud(DFM_Corpus_3Gram, comparison = FALSE, max_words = 10,
                   color = c("blue", "red"))

Conclusion

Based on the baplot we can determine the top 5 words that have been used frequently in the datasets are “just”,“like”,“will”,“one”,“can”. Wordcloud plots showcased higher usage of a few two-gram and three-gram words as such “cant_wait”,“right_now”,“last_night”,“dont_know”,“cant_wait_see”,“happy_mothers_day”,“happy_new_year” etc. This text analysis concludes my initial effort to discover the text patterns of three datasets.

Future roadmap

This exploratory data analysis laid the foundation for my future work related to the capstone project. My next plan is to gather more insight on different R packages like OpneNLP, Rweka,Quanteda which could be beneficial to analyze the data and build a predictive model for this NLP task. I will also start exploring the process to build a shiny app that could leverage some of the key text mining packages and will be useful for the text mining purposes. Furthermore, the data will be classified into test and training dataset. Training data will be used to analyze the text pattern and build the algorithm. Test data will be used to validate the algorithm at the final stage of this project. Validation result and performance matrix of the algorithm will be documented and presented accordingly.