The original data, as obtained by us from Coursera-SwiftKey, contains many irregularities that need to be addressed before the data is ready for exploratory analysis or modeling. For example the data contains emails, http/s addresses, emojis, contractions as “don’t”, upper/lower case, sympols as & @, etc. that have to either be removed or replaced/expanded (“don’t” expands into do not, e.g.) before ngrams are created.
We apply the following transformations using particulary using the gsub() function to the vectors characters obtained after the readLines() function of our texts files, in the exact sequence described below:
turns numbers into an identifier NNUMM. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of such numbers as these ngrams are incorrect. However, this will lead to formation of incorrect ngrams later on
turns ? and ! and . into an end of sentence identifier EEOSS. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of sentences and combine words from more than one sentence.
turns abbreviations as H.S.B.C. into an identifier AABRR. The purpose is that when forming ngrams we discard ngrams that are formed on the edges of such abbreviations as these ngrams are incorrect from information perspective.
haven’t to have not, and hadn’t to had not
Remove email and http/s
Remove retweet entries, Remove @ people, twitter usernames
Remove g, mg, lbs etc; removes all single letters except “a” and “i”
Remove punctuation
Note that this removes hashtags from tweets. I examined the data and most often the hashtag precedes a word of meaning and not a name of a person. If it were the latter it would be logical to remove the whole word that is preceded by a hashtag. However, this strategy would lead to loss in meaning in the sentences and potentially affect the predictive power of the model. In addition, hopefully words with hashtags that dont have a general meaning will not repeat often, and will have a low weight in calibrating the model via the ngrams. Result: REMOVE HASHTAGS only, not words w/ hashtag
replace ’m, ’s, ’are, ’ll, and all other contractions using textClean package
replace @ to at and & to and
Remove ’s
Remove profanity words
filePath <- file.path(".", "en_US")
filePath
## [1] "./en_US"
dir(filePath)
## [1] "en_US.blogs.txt" "en_US.news.txt" "en_US.twitter.txt"
# read line by line and display the first lines
con1 <- file("./en_US/en_US.blogs.txt", "r")
blogs <- readLines(con1, skipNul=TRUE, encoding="UTF-8")
close(con1)
stri_stats_general(blogs)
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
# read line by line and display the first lines
con2 <- file("./en_US/en_US.news.txt", "r")
news <- readLines(con2, skipNul=TRUE, encoding="UTF-8")
## Warning in readLines(con2, skipNul = TRUE, encoding = "UTF-8"): ligne
## finale incomplète trouvée dans './en_US/en_US.news.txt'
close(con2)
stri_stats_general(news)
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
# read line by line and display the first lines
con3 <- file("./en_US/en_US.twitter.txt", "r")
twitter <- readLines(con3, encoding="UTF-8", skipNul = TRUE)
close(con3)
stri_stats_general(twitter)
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
sampleHolderTwitter <- sample(length(twitter), length(twitter) * 0.1)
sampleHolderBlog <- sample(length(blogs), length(blogs) * 0.1)
sampleHolderNews <- sample(length(news), length(news) * 0.1)
US_Twitter_Sample <- twitter[sampleHolderTwitter]
US_Blogs_Sample <- blogs[sampleHolderBlog]
US_News_Sample <- news[sampleHolderNews]
rm(blogs)
rm(news)
rm(twitter)
# turns numbers into an identifier NNUMM. The purpose is that when forming ngrams #we discard ngrams that are formed on the edges of such numbers as these ngrams #are incorrect
# However, this will lead to formation of incorrect ngrams later on
Blogs <- gsub("[0-9]+"," NNUMM ",US_Blogs_Sample)
News <- gsub("[0-9]+"," NNUMM ",US_News_Sample)
Twitter <- gsub("[0-9]+"," NNUMM ",US_Twitter_Sample)
Blogs <- gsub("\\? |\\?$|\\! |\\!$ |\\. |\\.$", " EEOSS ", Blogs)
News <- gsub("\\? |\\?$|\\! |\\!$ |\\. |\\.$", " EEOSS ", News)
Twitter <- gsub("\\? |\\?$|\\! |\\!$ |\\. |\\.$", " EEOSS ", Twitter)
Blogs <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", Blogs)
News <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", News)
Twitter <- gsub("[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\.[A-Za-z]\\. |[A-Za-z]\\.[A-Za-z]\\. ", " AABRR ", Twitter)
Blogs <- gsub("haven't", "have not", Blogs)
Blogs <- gsub("hadn't", "had not", Blogs)
News <- gsub("haven't", "have not", News)
News <- gsub("hadn't", "had not", News)
Twitter <- gsub("haven't", "have not", Twitter)
Twitter <- gsub("hadn't", "had not", Twitter)
Blogs <- gsub("-|/|\\S+@\\S+|[Hh}ttp([^ ]+)", " ", Blogs)
News <- gsub("-|/|\\S+@\\S+|[Hh}ttp([^ ]+)", " ", News)
Twitter <- gsub("-|/|\\S+@\\S+|[Hh}ttp([^ ]+)", " ", Twitter)
# Regex "RT | via" for retweets
#Regex "@([^ ]+)") for people
# regex "[@][a - zA - Z0 - 9_]{1,15}") for usernames
# Regex [^\x01-\x7F] to remove emoji
Twitter <- gsub("RT | via |@([^ ]+)|[@][a - zA - Z0 - 9_]{1,15}|[^\x01-\x7F]", " ", Twitter)
rem <- function(vectText){
vectText <- gsub(" [1-9]+g ", " ", vectText) # grams
vectText <- gsub(" [1-9]+mg ", " ", vectText) # miligrams, etc
vectText <- gsub(" [1-9]+kg ", " ", vectText) #kilograms
vectText <- gsub(" [1-9]+lbs ", " ", vectText)
vectText <- gsub(" [1-9]+s ", " ", vectText) # seconds, etc
vectText <- gsub(" [1-9]+m ", " ", vectText)
vectText <- gsub(" [1-9]+h ", " ", vectText)
vectText <- gsub(" +g ", " ", vectText) # grams
vectText <- gsub(" +mg ", " ", vectText) # miligrams, etc
vectText <- gsub(" +kg ", " ", vectText)
vectText <- gsub(" +lbs ", " ", vectText)
vectText <- gsub(" +s ", " ", vectText) # seconds, etc
vectText <- gsub(" +m ", " ", vectText)
vectText <- gsub(" +h ", " ", vectText)
vectText <- gsub(" +lbs ", " ", vectText)
vectText <- gsub(" +kg ", " ", vectText)
vectText <- gsub(" [b-hj-z] ", " ", vectText) #all single-letter except a and i
vectText
}
Blogs <- rem(Blogs)
News <- rem(News)
Twitter <- rem(Twitter)
Blogs <- gsub(""|"|'|'", "", Blogs)
News <- gsub(""|"|'|'", "", News)
Twitter <- gsub(""|"|'|'", "", Twitter)
Blogs <- replace_contraction(Blogs)
News <- replace_contraction(News)
Twitter <- replace_contraction(Twitter)
Blogs <- gsub("@", "at", Blogs)
Blogs <- gsub("&", "and", Blogs)
News <- gsub("@", "at", News)
News <- gsub("&", "and", News)
Twitter <- gsub("@", "at", Twitter)
Twitter <- gsub("&", "and", Twitter)
Blogs <- gsub("'s", "", Blogs)
News <- gsub("'s", "", News)
Twitter <- gsub("'s", "", Twitter)
profanityFileName <- "profanity.txt"
profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
Blogs <- removeWords(Blogs, profanity$v1)
News <- removeWords(News, profanity$v1)
Twitter <- removeWords(Twitter, profanity$v1)
Blogs[1:5]
## [1] "One thing that has gone on around the old house place was this weekends festivities EEOSS A few months back, The Boss and I were riding around after we had our breakfast at the Egg and I and we were talking about the holiday events where we dote over our girls EEOSS Then we got to talking about our towheaded sons in law EEOSS I thought and eventually spoke out loud EEOSS "
## [2] "Rousseaus attempt to justify every form of dominance by the group over individuals then appears as an inherently patriarchal and gendered sort of dominance and we might imagine what sort of practical policies this might lead to (men must not be sissified, or the state will be weak EEOSS women must not deny their femininity, or we will run out of babies!) EEOSS "
## [3] "It seems like the whole CJ system is an administrative blip. EEOSS "
## [4] "Beckley Alderson Lewisburg Former WWII detention camps that are now converted into active federal prison complexes capable of holding several times their current populations EEOSS Alderson is presently a womens federal reformatory EEOSS "
## [5] "Two days after, I did write the show a letter
I guess my wanting to desperately meet my family took the overhand !"
News[1:5]
## [1] " Experience: Avalos has lived in San Francisco NNUMM years, NNUMM of them in District NNUMM EEOSS He worked as a legislative aide to Supervisor Chris Daly from January NNUMM to July NNUMM EEOSS Previously he was an organizer and political coordinator for the Service Employees International Union and the director of organizing for Coleman Advocates, a child advocacy organization EEOSS He is president of the San Francisco People Organization EEOSS "
## [2] "Stoudemire finished with NNUMM points and seven rebounds, including NNUMM points in the second half EEOSS He took six shots over the final two quarters compared to eight for Anthony EEOSS "
## [3] "Curtis won $ NNUMM , NNUMM , NNUMM and a two year tour exemption a more meaningful reward after being relegated to a status so low that this victory came in just the fourth PGA Tour event he managed to get into this year EEOSS "
## [4] "Politics: Hale is treasurer of Sen EEOSS Cardin campaign committee, a position he has held since early NNUMM EEOSS "
## [5] "\"I saw Jeff on the left side of the field, so I started running down the field and there was no one on me,\" Reitz said EEOSS \"I did not think it was going in, but it went in.\""
Twitter[1:5]
## [1] "Watch for the new mixtape \"Sprinkle of Greatness\" Coming in Jan ' NNUMM EEOSS With Tex killin it on every track EEOSS Free Tex EEOSS Free Tex!"
## [2] "I did not send it to Twitter just now.. EEOSS Idk how it got there now"
## [3] "I wana go home EEOSS I have a lot of things to do EEOSS "
## [4] "I F**KING LOVE, CHIPTUNE MUSIC!! EEOSS =D"
## [5] "Never miss an opportunity to tell someone how much theyean to you EEOSS they might not be here tomorrow EEOSS "
The texts appear good as we are expected!
We will do another cleaning when tokenizing
# Generic function for parallelizing any task (when possible)
library(parallel)
#library(doParallel)
parallelizeTask <- function(task, ...) {
# Calculate the number of cores
ncores <- detectCores() - 1
# Initiate cluster
cl <- makeCluster(ncores)
registerDoParallel(cl)
#print("Starting task")
r <- task(...)
#print("Task done")
stopCluster(cl)
r
}
# Returns a vector of profanity words
getProfanityWords <- function(corpus) {
profanityFileName <- "profanity.txt"
if (!file.exists(profanityFileName)) {
profanity.url <- "https://raw.githubusercontent.com/shutterstock/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/master/en"
download.file(profanity.url, destfile = profanityFileName, method = "curl")
}
if (sum(ls() == "profanity") < 1) {
profanity <- read.csv(profanityFileName, header = FALSE, stringsAsFactors = FALSE)
profanity <- profanity$V1
profanity <- profanity[1:length(profanity)-1]
}
profanity
}
makeTokens <- function(input, n = 1L) {
output <- tokenize(input, what = "word", removeNumbers = TRUE,
removePunct = TRUE, removeSeparators = TRUE,
removeTwitter = TRUE, removeHyphens = TRUE,
ngrams = n, simplify = TRUE)
output <- removeFeatures(output, getProfanityWords())
output
}
master_vector <- c(Twitter, Blogs, News)
corp <- corpus(master_vector)
saveRDS(corp, "corp.RDS")