Cleaning texts before preforming our words prediction application.

The original data, as obtained by us from Coursera-SwiftKey, contains many irregularities that need to be addressed before the data is ready for exploratory analysis or modeling. For example the data contains emails, http/s addresses, emojis, contractions as “don’t”, upper/lower case, sympols as & @, etc. that have to either be removed or replaced/expanded (“don’t” expands into do not, e.g.) before ngrams are created.

We apply the following transformations using particulary using the gsub() function to the vectors characters obtained after the readLines() function of our texts files, in the exact sequence described below:

  1. turns numbers into an identifier NNUMM. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of such numbers as these ngrams are incorrect. However, this will lead to formation of incorrect ngrams later on

  2. turns ? and ! and . into an end of sentence identifier EEOSS. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of sentences and combine words from more than one sentence.

  3. turns abbreviations as H.S.B.C. into an identifier AABRR. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of such abbreviations as these ngrams are incorrect from information perspective.

  4. haven’t to have not, and hadn’t to had not

  5. Remove email and http/s

  6. Remove retweet entries, Remove @ people, twitter usernames

  7. Remove g, mg, lbs etc; removes all single letters except “a” and “i”

  8. Remove punctuation

Note that this removes hashtags from tweets. I examined the data and most often the hashtag precedes a word of meaning and not a name of a person. If it were the latter it would be logical to remove the whole word that is preceded by a hashtag. However, this strategy would lead to loss in meaning in the sentences and potentially affect the predictive power of the model. In addition, hopefully words with hashtags that dont have a general meaning will not repeat often, and will have a low weight in calibrating the model via the ngrams. Result: REMOVE HASHTAGS only, not words w/ hashtag

  1. replace ’m, ’s, ’are, ’ll, and all other contractions using textClean package

  2. replace @ to at and & to and

  3. Remove ’s

  4. Remove profanity words

file path of corpora

filePath <- file.path(".", "en_US")
filePath
## [1] "./en_US"

Content of corpora

dir(filePath)
## [1] "en_US.blogs.txt"   "en_US.news.txt"    "en_US.twitter.txt"

Import library for text mining

Summary of given dataset

Blogs

# read line by line and display the first lines
con1 <- file("./en_US/en_US.blogs.txt", "r")
blogs <- readLines(con1, skipNul=TRUE, encoding="UTF-8")
close(con1)
stri_stats_general(blogs)
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
# read line by line and display the first lines
con2 <- file("./en_US/en_US.news.txt", "r")
news <- readLines(con2, skipNul=TRUE, encoding="UTF-8")
## Warning in readLines(con2, skipNul = TRUE, encoding = "UTF-8"): ligne
## finale incomplète trouvée dans './en_US/en_US.news.txt'
close(con2)
stri_stats_general(news)
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

Twitter

# read line by line and display the first lines
con3 <- file("./en_US/en_US.twitter.txt", "r")
twitter <- readLines(con3, encoding="UTF-8", skipNul = TRUE)
close(con3)
stri_stats_general(twitter)
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

Sample of the text data

sampleHolderTwitter <- sample(length(twitter), length(twitter) * 0.1)
sampleHolderBlog <- sample(length(blogs), length(blogs) * 0.1)
sampleHolderNews <- sample(length(news), length(news) * 0.1)

US_Twitter_Sample <- twitter[sampleHolderTwitter]
US_Blogs_Sample <- blogs[sampleHolderBlog]
US_News_Sample <- news[sampleHolderNews]
rm(blogs)
rm(news)
rm(twitter)

Data text preprocessing cleaning

  1. Turns numbers into an identifier NNUMM. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of such numbers as these ngrams are incorrect. However, this will lead to formation of incorrect ngrams later on However, this will lead to formation of incorrect ngrams later on
# turns numbers into an identifier NNUMM. The purpose is that when forming ngrams #we discard ngrams that are formed on the edges of such numbers as these ngrams #are incorrect 
# However, this will lead to formation of incorrect ngrams later on
Blogs <- gsub("[0-9]+"," NNUMM ",US_Blogs_Sample)
News <- gsub("[0-9]+"," NNUMM ",US_News_Sample)
Twitter <- gsub("[0-9]+"," NNUMM ",US_Twitter_Sample)
  1. Turns ? and ! and . into an end of sentence identifier EEOSS. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of sentences and combine words from more than one sentence
Blogs <- gsub("\\? |\\?$|\\! |\\!$ |\\. |\\.$", " EEOSS ", Blogs)
News <- gsub("\\? |\\?$|\\! |\\!$ |\\. |\\.$", " EEOSS ", News)
Twitter <- gsub("\\? |\\?$|\\! |\\!$ |\\. |\\.$", " EEOSS ", Twitter)
  1. turns abbreviations as H.S.B.C. into an identifier AABRR. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of such abbreviations as these ngrams are incorrect from information perspective
Blogs <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", Blogs)
News <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", News)
Twitter <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", Twitter)
  1. haven’t to have not, and hadn’t to had not
Blogs <- gsub("haven't", "have not", Blogs)
Blogs <- gsub("hadn't", "had not", Blogs)
News <- gsub("haven't", "have not", News)
News <- gsub("hadn't", "had not", News)
Twitter <- gsub("haven't", "have not", Twitter)
Twitter <- gsub("hadn't", "had not", Twitter)
  1. Remove email and http/s
Blogs <- gsub("-|/|\\S+@\\S+|[Hh}ttp([^ ]+)", " ", Blogs)

News <- gsub("-|/|\\S+@\\S+|[Hh}ttp([^ ]+)", " ", News)

Twitter <- gsub("-|/|\\S+@\\S+|[Hh}ttp([^ ]+)", " ", Twitter)
  1. Remove retweet entries, Remove @ people, twitter usernames
# Regex "RT | via" for  retweets
#Regex "@([^ ]+)") for people
# regex "[@][a - zA - Z0 - 9_]{1,15}") for usernames
# Regex [^\x01-\x7F] to remove emoji
Twitter <- gsub("RT | via |@([^ ]+)|[@][a - zA - Z0 - 9_]{1,15}|[^\x01-\x7F]", " ", Twitter)
  1. Remove g, mg, lbs etc; removes all single letters except “a” and “i”
rem <- function(vectText){
  vectText <- gsub(" [1-9]+g ", " ", vectText) # grams
  vectText <- gsub(" [1-9]+mg ", " ", vectText) # miligrams, etc
  vectText <- gsub(" [1-9]+kg ", " ", vectText) #kilograms
  vectText <- gsub(" [1-9]+lbs ", " ", vectText)
  vectText <- gsub(" [1-9]+s ", " ", vectText) # seconds, etc
  vectText <- gsub(" [1-9]+m ", " ", vectText)
  vectText <- gsub(" [1-9]+h ", " ", vectText)
  vectText <- gsub(" +g ", " ", vectText) # grams
  vectText <- gsub(" +mg ", " ", vectText) # miligrams, etc
  vectText <- gsub(" +kg ", " ", vectText)
  vectText <- gsub(" +lbs ", " ", vectText)
  vectText <- gsub(" +s ", " ", vectText) # seconds, etc
  vectText <- gsub(" +m ", " ", vectText)
  vectText <- gsub(" +h ", " ", vectText)
  vectText <- gsub(" +lbs ", " ", vectText)
  vectText <- gsub(" +kg ", " ", vectText)
  vectText <- gsub(" [b-hj-z] ", " ", vectText) #all single-letter except a and i
  vectText
}
Blogs <- rem(Blogs)
News <- rem(News)
Twitter <- rem(Twitter)
  1. Remove punctuation Note that this removes hashtags from tweets. I examined the data and most often the hashtag precedes a word of meaning and not a name of a person. If it were the latter it would be logical to remove the whole word that is preceded by a hashtag. However, this strategy would lead to loss in meaning in the sentences and potentially affect the predictive power of the model. In addition, hopefully words with hashtags that dont have a general meaning will not repeat often, and will have a low weight in calibrating the model via the ngrams. Result: REMOVE HASHTAGS only, not words w/ hashtag
Blogs <- gsub(""|"|'|'", "", Blogs)
News <- gsub(""|"|'|'", "", News)
Twitter <- gsub(""|"|'|'", "", Twitter)
  1. replace ’m, ’s, ’are, ’ll, and all other contractions using textClean package
Blogs <- replace_contraction(Blogs)

News <- replace_contraction(News)

Twitter <- replace_contraction(Twitter)
  1. replace @ to at and & to and
Blogs <- gsub("@", "at", Blogs)
Blogs <- gsub("&", "and", Blogs)
News <- gsub("@", "at", News)
News <- gsub("&", "and", News)
Twitter <- gsub("@", "at", Twitter)
Twitter <- gsub("&", "and", Twitter)
  1. Remove ’s
Blogs <- gsub("'s", "", Blogs)

News <- gsub("'s", "", News)

Twitter <- gsub("'s", "", Twitter)
  1. Remove profanity words To avoid the app to predict profanity words, we have removed them to different texts
profanityFileName <- "profanity.txt"
profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
Blogs <- removeWords(Blogs, profanity$v1)

News <- removeWords(News, profanity$v1)

Twitter <- removeWords(Twitter, profanity$v1)

look at the texts preprocessed

Blogs[1:5]
## [1] "One thing that has gone on around the old house place was this weekend’s festivities EEOSS A few months back, The Boss and I were riding around after we had our breakfast at the “Egg and I” and we were talking about the holiday events where we “dote” over our girls EEOSS Then we got to talking about our “towheaded” son’s in law EEOSS I thought and eventually spoke out loud EEOSS "
## [2] "Rousseau’s attempt to justify every form of dominance by the group over individuals then appears as an inherently patriarchal and gendered sort of dominance – and we might imagine what sort of practical policies this might lead to (men must not be sissified, or the state will be weak EEOSS women must not deny their femininity, or we will run out of babies!) EEOSS "                
## [3] "It seems like the whole CJ system is an administrative blip. EEOSS "                                                                                                                                                                                                                                                                                                                           
## [4] "Beckley – Alderson – Lewisburg – Former WWII detention camps that are now converted into active federal prison complexes capable of holding several times their current populations EEOSS Alderson is presently a women’s federal reformatory EEOSS "                                                                                                                                          
## [5] "Two days after, I did write the show a letter …I guess my wanting to desperately meet my family took the overhand !"
News[1:5]
## [1] "   Experience: Avalos has lived in San Francisco  NNUMM  years,  NNUMM  of them in District  NNUMM  EEOSS He worked as a legislative aide to Supervisor Chris Daly from January  NNUMM  to July  NNUMM  EEOSS Previously he was an organizer and political coordinator for the Service Employees International Union and the director of organizing for Coleman Advocates, a child advocacy organization EEOSS He is president of the San Francisco People Organization EEOSS "
## [2] "Stoudemire finished with  NNUMM  points and seven rebounds, including  NNUMM  points in the second half EEOSS He took six shots over the final two quarters compared to eight for Anthony EEOSS "                                                                                                                                                                                                                                                                              
## [3] "Curtis won $ NNUMM , NNUMM , NNUMM  and a two year tour exemption — a more meaningful reward after being relegated to a status so low that this victory came in just the fourth PGA Tour event he managed to get into this year EEOSS "                                                                                                                                                                                                                                        
## [4] "Politics: Hale is treasurer of Sen EEOSS Cardin campaign committee, a position he has held since early  NNUMM  EEOSS "                                                                                                                                                                                                                                                                                                                                                         
## [5] "\"I saw Jeff on the left side of the field, so I started running down the field and there was no one on me,\" Reitz said EEOSS \"I did not think it was going in, but it went in.\""
Twitter[1:5]
## [1] "Watch for the new mixtape \"Sprinkle of Greatness\" Coming in Jan ' NNUMM  EEOSS With Tex killin it on every track EEOSS Free Tex EEOSS Free Tex!"
## [2] "I did not send it to Twitter just now.. EEOSS Idk how it got there now"                                                                           
## [3] "I wana go home EEOSS I have a lot of things to do EEOSS "                                                                                         
## [4] "I F**KING LOVE, CHIPTUNE MUSIC!! EEOSS =D"                                                                                                        
## [5] "Never miss an opportunity to tell someone how much theyean to you EEOSS they might not be here tomorrow EEOSS "

The texts appear good as we are expected!

We will do another cleaning when tokenizing

Tokenizing and Building unigrams bigrams and trigrams word

# Generic function for parallelizing any task (when possible)
library(parallel)
#library(doParallel)
parallelizeTask <- function(task, ...) {
  # Calculate the number of cores
  ncores <- detectCores() - 1
  # Initiate cluster
  cl <- makeCluster(ncores)
  registerDoParallel(cl)
  #print("Starting task")
  r <- task(...)
  #print("Task done")
  stopCluster(cl)
  r
}
# Returns a vector of profanity words
getProfanityWords <- function(corpus) {
  profanityFileName <- "profanity.txt"
  if (!file.exists(profanityFileName)) {
    profanity.url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
    download.file(profanity.url, destfile = profanityFileName, method = "curl")
  }
  
  if (sum(ls() == "profanity") < 1) {
    profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
    profanity <- profanity$V1
    profanity <- profanity[1:length(profanity)-1]
  }
  
  profanity
}
makeTokens <- function(input, n = 1L) {
  output <- tokenize(input, what = "word", removeNumbers = TRUE,
            removePunct = TRUE, removeSeparators = TRUE,
            removeTwitter = TRUE, removeHyphens = TRUE,
            ngrams = n, simplify = TRUE)
  output <- removeFeatures(output, getProfanityWords())
  output
}

Create a quanteda corpus

master_vector <- c(Twitter, Blogs, News)
corp <- corpus(master_vector)
saveRDS(corp, "corp.RDS")