Introduction

The objective of this report is to load, clean and explore the input data that will be used later to predict the next word that a certain user will write based on previous inputs.

The data consist of a collection of tweets from Twitter, posts from blogs and news that have been obtained through web crawling by HC Corpora.

This report is part of the Data Science Specialization Capstone project from Coursera in collaboration with Swiftkey.

Input data description

The input data is downloaded from the provided URL and consists of three text files in four languages (English, Russian, German and Finnish). In our case, English files are studied. Each file consists of data extracted from one source of information:

  • Twitter
  • Blogs
  • News
# Script inputs:
urlRawData <-                                   # URL where raw data is located
  "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
rawDataPath <- "./0. Data"                  # Path to store & unzip raw data
rawDataFile <- paste0(rawDataPath,              # Name of the file with raw data
                      "/Coursera-SwiftKey-rawData.zip")
deDEPath <- paste0(rawDataPath,
                   "/final/de_DE/")         # Expected path for German data
enUSPath <- paste0(rawDataPath,
                   "/final/en_US/")         # Expected path for English data
fiFIPath <- paste0(rawDataPath,
                   "/final/fi_FI/")         # Expected path for Finnish data
ruRUPath <- paste0(rawDataPath,
                   "/final/ru_RU/" )        # Expected path for Russian data


# Initial checks
if (!file.exists(rawDataPath)) {
  dir.create(rawDataPath) 
}

# Step 1: Downloading the data
download.file(urlRawData, rawDataFile)
downloadDate <- date()

# Step 2: Unzipping and loading it in R
unzip(rawDataFile,exdir = rawDataPath)

# Step 3: Reading lines of the data
enUStwitter <- readLines(paste0(enUSPath,list.files(enUSPath,"\\.twitter.")),
                          encoding = "UTF-8", skipNul = TRUE)
enUSblogs <- readLines(paste0(enUSPath,list.files(enUSPath,"\\.blogs.")),
                          encoding = "UTF-8", skipNul = TRUE)
con <- file(paste0(enUSPath,list.files(enUSPath,"\\.news.")), open = "rb",
            encoding = "UTF-8")
enUSnews <- readLines(con)
close(con)

The following table provide basic information from each file:

# Summary of the three files
twitter <- file.info(paste0(enUSPath,list.files(enUSPath,"\\.twitter.")))$size / (1024^2)
blogs <- file.info(paste0(enUSPath,list.files(enUSPath,"\\.blogs.")))$size / (1024^2)
news <- file.info(paste0(enUSPath,list.files(enUSPath,"\\.news.")))$size / (1024^2)
fSize <- c(twitter,blogs,news)
twitter <- lapply(enUStwitter,nchar)
blogs <- lapply(enUSblogs,nchar)
news <- lapply(enUSnews,nchar)
maxLength <- c(twitter[which.max(twitter)], blogs[which.max(blogs)], news[which.max(news)])
minLength <- c(twitter[which.min(twitter)], blogs[which.min(blogs)], news[which.min(news)])
nbrLines <- c(length(twitter), length(blogs), length(news))

twitter <- sum(sapply(strsplit(enUStwitter,"\\s+"), length))
blogs <- sum(sapply(strsplit(enUSblogs,"\\s+"), length))
news <- sum(sapply(strsplit(enUSnews,"\\s+"), length))

nbrWords <- c(twitter,blogs,news)
source <- c("Twitter", "Blogs", "News")
att <- c("Source", "Size (MB)", "Nr. Words", "Nr. Lines", "Min. Line Length", "Max. Line Length")

fileDescription <- data.frame(source,fSize,nbrWords,nbrLines,unlist(minLength),unlist(maxLength))
colnames(fileDescription) <- att
kable(fileDescription)
Source Size (MB) Nr. Words Nr. Lines Min. Line Length Max. Line Length
Twitter 159.3641 30373583 2360148 2 140
Blogs 200.4242 37334131 899288 1 40833
News 196.2775 34372598 1010242 1 11384

Cleaning data

This section is focused on cleaning and preparing the data. Also, deleting what is not required for this use case:

  • Profanities.
  • Icons such as emojis.
  • Email addresses and webpages.

A sample of the three objects is finally analyzed to avoid high computational load.

# Profanities
profanities <- c("4r5e","5h1t","5hit","a55","anal","anus","ar5e","arrse","arse",
                 "ass","ass-fucker","asses","assfucker","assfukka","asshole",
                 "assholes","asswhole","a_s_s","b!tch","b00bs","b17ch","b1tch",
                 "ballbag","balls","ballsack","bastard","beastial","beastiality",
                 "bellend","bestial","bestiality","bi+ch","biatch","bitch",
                 "bitcher","bitchers","bitches","bitchin","bitching","bloody",
                 "blow job","blowjob","blowjobs","boiolas","bollock","bollok",
                 "boner","boob","boobs","booobs","boooobs","booooobs","booooooobs",
                 "breasts","buceta","bugger","bum","bunny fucker","butt","butthole",
                 "buttmuch","buttplug","c0ck","c0cksucker","carpet muncher",
                 "cawk","chink","cipa","cl1t","clit","clitoris","clits","cnut",
                 "cock","cock-sucker","cockface","cockhead","cockmunch",
                 "cockmuncher","cocks","cocksuck","cocksucked","cocksucker",
                 "cocksucking","cocksucks","cocksuka","cocksukka","cok",
                 "cokmuncher","coksucka","coon","cox","crap","cum","cummer",
                 "cumming","cums","cumshot","cunilingus","cunillingus",
                 "cunnilingus","cunt","cuntlick","cuntlicker","cuntlicking",
                 "cunts","cyalis","cyberfuc","cyberfuck","cyberfucked",
                 "cyberfucker","cyberfuckers","cyberfucking","d1ck","damn",
                 "dick","dickhead","dildo","dildos","dink","dinks","dirsa",
                 "dlck","dog-fucker","doggin","dogging","donkeyribber","doosh",
                 "duche","dyke","ejaculate","ejaculated","ejaculates",
                 "ejaculating","ejaculatings","ejaculation","ejakulate",
                 "f u c k","f u c k e r","f4nny","fag","fagging","faggitt",
                 "faggot","faggs","fagot","fagots","fags","fanny","fannyflaps",
                 "fannyfucker","fanyy","fatass","fcuk","fcuker","fcuking","feck",
                 "fecker","felching","fellate","fellatio","fingerfuck",
                 "fingerfucked","fingerfucker","fingerfuckers","fingerfucking",
                 "fingerfucks","fistfuck","fistfucked","fistfucker","fistfuckers",
                 "fistfucking","fistfuckings","fistfucks","flange","fook","fooker",
                 "fuck","fucka","fucked","fucker","fuckers","fuckhead","fuckheads",
                 "fuckin","fucking","fuckings","fuckingshitmotherfucker","fuckme",
                 "fucks","fuckwhit","fuckwit","fudge packer","fudgepacker","fuk",
                 "fuker","fukker","fukkin","fuks","fukwhit","fukwit","fux","fux0r",
                 "f_u_c_k","gangbang","gangbanged","gangbangs","gaylord","gaysex",
                 "goatse","God","god-dam","god-damned","goddamn","goddamned",
                 "hardcoresex","hell","heshe","hoar","hoare","hoer","homo","hore",
                 "horniest","horny","hotsex","jack-off","jackoff","jap","jerk-off",
                 "jism","jiz","jizm","jizz","kawk","knob","knobead","knobed",
                 "knobend","knobhead","knobjocky","knobjokey","kock","kondum",
                 "kondums","kum","kummer","kumming","kums","kunilingus","l3i+ch",
                 "l3itch","labia","lmfao","lust","lusting","m0f0","m0fo",
                 "m45terbate","ma5terb8","ma5terbate","masochist","master-bate",
                 "masterb8","masterbat*","masterbat3","masterbate","masterbation",
                 "masterbations","masturbate","mo-fo","mof0","mofo","mothafuck",
                 "mothafucka","mothafuckas","mothafuckaz","mothafucked",
                 "mothafucker","mothafuckers","mothafuckin","mothafucking",
                 "mothafuckings","mothafucks","mother fucker","motherfuck",
                 "motherfucked","motherfucker","motherfuckers","motherfuckin",
                 "motherfucking","motherfuckings","motherfuckka","motherfucks",
                 "muff","mutha","muthafecker","muthafuckker","muther",
                 "mutherfucker","n1gga","n1gger","nazi","nigg3r","nigg4h",
                 "nigga","niggah","niggas","niggaz","nigger","niggers","nob",
                 "nob jokey","nobhead","nobjocky","nobjokey","numbnuts",
                 "nutsack","orgasim","orgasims","orgasm","orgasms","p0rn",
                 "pawn","pecker","penis","penisfucker","phonesex","phuck",
                 "phuk","phuked","phuking","phukked","phukking","phuks","phuq",
                 "pigfucker","pimpis","piss","pissed","pisser","pissers","pisses",
                 "pissflaps","pissin","pissing","pissoff","poop","porn","porno",
                 "pornography","pornos","prick","pricks","pron","pube","pusse",
                 "pussi","pussies","pussy","pussys","rectum","retard","rimjaw",
                 "rimming","s hit","s.o.b.","sadist","schlong","screwing","scroat",
                 "scrote","scrotum","semen","sex","sh!+","sh!t","sh1t","shag",
                 "shagger","shaggin","shagging","shemale","shi+","shit","shitdick",
                 "shite","shited","shitey","shitfuck","shitfull","shithead",
                 "shiting","shitings","shits","shitted","shitter","shitters",
                 "shitting","shittings","shitty","skank","slut","sluts","smegma",
                 "smut","snatch","son-of-a-bitch","spac","spunk","s_h_i_t",
                 "t1tt1e5","t1tties","teets","teez","testical","testicle","tit",
                 "titfuck","tits","titt","tittie5","tittiefucker","titties",
                 "tittyfuck","tittywank","titwank","tosser","turd","tw4t","twat",
                 "twathead","twatty","twunt","twunter","v14gra","v1gra","vagina",
                 "viagra","vulva","w00se","wang","wank","wanker","wanky","whoar",
                 "whore","willies","willy","xrated","xxx")

  # Starting to copy raw data:
  tidyEnUStwitter <- enUStwitter
  tidyEnUSblogs <- enUSblogs
  tidyEnUSnews <- enUSnews
  
  # Emojis and symbols made by people are ignored...
  tidyEnUStwitter <- gsub("<3"," ",tidyEnUStwitter)
  tidyEnUSblogs <- gsub("<3"," ",tidyEnUSblogs)
  tidyEnUSnews <- gsub("<3"," ",tidyEnUSnews)
  
  # Symbol "<" will make HTMLParse to fail so it will be changed by " "
  # Except at the beginning that any space or tab will be deleted as well
  # tidyEnUStwitter[grep("(^[ <>\t]+|[<>]+)",tidyEnUStwitter)]
  tidyEnUStwitter <- gsub("(^[ <>\t]+|[<>]+)","",tidyEnUStwitter)
  # tidyEnUStwitter[grep("(^[ <>\t]+|[<>]+)",tidyEnUStwitter)]
  
  # tidyEnUSblogs[grep("(^[ <>\t]+|[<>]+)",tidyEnUSblogs)]
  tidyEnUSblogs <- gsub("(^[ <>\t]+|[<>]+)","",tidyEnUSblogs)
  # tidyEnUSblogs[grep("(^[ <>\t]+|[<>]+)",tidyEnUSblogs)]
  
  # tidyEnUSnews[grep("(^[ <>\t]+|[<>]+)",tidyEnUSnews)]
  tidyEnUSnews <- gsub("(^[ <>\t]+|[<>]+)","",tidyEnUSnews)
  # tidyEnUSnews[grep("(^[ <>\t]+|[<>]+)",tidyEnUSnews)]
  
  # If there is a HTML character, it is going to parse it
  for (i in 1:length(tidyEnUStwitter)){
    if (!is.na(tidyEnUStwitter[i])){
      tidyEnUStwitter[i] <- xpathApply(htmlParse(tidyEnUStwitter[i],asText = TRUE,
                                                 encoding =  'UTF-8'),
                                       "//body//text()",xmlValue)[[1]]  
    }
  }
  
  for (i in 1:length(tidyEnUSblogs)){
    if (!is.na(tidyEnUSblogs[i])){
      tidyEnUSblogs[i] <- xpathApply(htmlParse(tidyEnUSblogs[i],asText = TRUE,
                                               encoding =  'UTF-8'),
                                     "//body//text()",xmlValue)[[1]]  
    }
  }
  
  for (i in 1:length(tidyEnUSnews)){
    if (!is.na(tidyEnUSnews[i])){
      tidyEnUSnews[i] <- xpathApply(htmlParse(tidyEnUSnews[i],asText = TRUE,
                                               encoding =  'UTF-8'),
                                     "//body//text()",xmlValue)[[1]]  
    }
  }
  
  # After doing HTML parsing, sentences like 2255969 2273597 from Twitter data
  # are correctly decode and more undesired punctuation symbols or emojis
  # appear in the data. Now, not meaningful characters will be deleted:
  
  tidyEnUStwitter <- gsub("<3"," ",tidyEnUStwitter)
  tidyEnUSblogs <- gsub("<3"," ",tidyEnUSblogs)
  tidyEnUSnews <- gsub("<3"," ",tidyEnUSnews)
  
  # Substitution of contractions of words in English:
  # 'm -> am
  # n't -> not
  # 've -> have
  # 'll -> will
  # - by nothing (space)
  # 'd -> would or had and 's -> is or 's are not solved here
  
  tidyEnUStwitter <- gsub("'m"," am",tidyEnUStwitter)
  tidyEnUStwitter <- gsub("n't"," not",tidyEnUStwitter)
  tidyEnUStwitter <- gsub("'ve"," have",tidyEnUStwitter)
  tidyEnUStwitter <- gsub("'ll"," will",tidyEnUStwitter)
  tidyEnUStwitter <- gsub("-"," ",tidyEnUStwitter)

  tidyEnUSblogs <- gsub("'m"," am",tidyEnUSblogs)
  tidyEnUSblogs <- gsub("n't"," not",tidyEnUSblogs)
  tidyEnUSblogs <- gsub("'ve"," have",tidyEnUSblogs)
  tidyEnUSblogs <- gsub("'ll"," will",tidyEnUSblogs)
  tidyEnUSblogs <- gsub("-"," ",tidyEnUSblogs)
  
  tidyEnUSnews <- gsub("'m"," am",tidyEnUSnews)
  tidyEnUSnews <- gsub("n't"," not",tidyEnUSnews)
  tidyEnUSnews <- gsub("'ve"," have",tidyEnUSnews)
  tidyEnUSnews <- gsub("'ll"," will",tidyEnUSnews)
  tidyEnUSnews <- gsub("-"," ",tidyEnUSnews)

  # Finally, URL are removed as well as emails
  tidyEnUStwitter <- gsub("http[[:alnum:]]*","",tidyEnUStwitter)
  tidyEnUSblogs <- gsub("http[[:alnum:]]*","",tidyEnUSblogs)
  tidyEnUSnews <- gsub("http[[:alnum:]]*","",tidyEnUSnews)
  
  tidyEnUStwitter <- gsub("[:/]*www.[[:alnum:].-_]+.[[:alnum:].-_]+","",tidyEnUStwitter)
  tidyEnUSblogs <- gsub("[:/]*www.[[:alnum:].-_]+.[[:alnum:].-_]+","",tidyEnUSblogs)
  tidyEnUSnews <- gsub("[:/]*www.[[:alnum:].-_]+.[[:alnum:].-_]+","",tidyEnUSnews)
  
  tidyEnUStwitter <- gsub("[[:alnum:]._-]+@[[:alnum:].-]+","",tidyEnUStwitter)
  tidyEnUSblogs <- gsub("[[:alnum:]._-]+@[[:alnum:].-]+","",tidyEnUSblogs)
  tidyEnUSnews <- gsub("[[:alnum:]._-]+@[[:alnum:].-]+","",tidyEnUSnews)
  
    # Joining all sources after sampling due to the huge processing time required:
  set.seed(1121983)
  iTwitter <- sample(1:length(tidyEnUStwitter),size = 50000, replace = FALSE)
  set.seed(1121983)
  iBlogs <- sample(1:length(tidyEnUSblogs),size = 50000, replace = FALSE)
  set.seed(1121983)
  iNews <- sample(1:length(tidyEnUSnews),size = 50000, replace = FALSE)
  
  text <- c(tidyEnUStwitter[iTwitter],tidyEnUSblogs[iBlogs],tidyEnUSnews[iNews])
  
  vsText <- VectorSource(text) 
  docs <- Corpus(vsText)
  # inspect(head(docs))
  
  docsClean <- tm_map(docs, content_transformer(function(x) 
                                               iconv(x, to="UTF-8", sub="byte")))
  docsClean <- tm_map(docsClean, 
                      content_transformer(function(x) 
                                          stri_replace_all_fixed(x, "\t", " ")))
  docsClean <- tm_map(docsClean, content_transformer(tolower))
  docsClean <- tm_map(docsClean, removeNumbers)
  docsClean <- tm_map(docsClean, removePunctuation)
  docsClean <- tm_map(docsClean, removeWords, profanities)
  docsClean <- tm_map(docsClean, stripWhitespace)
  docsAnalyze <- tm_map(docsClean, removeWords, stopwords("english"))
  docsAnalyze <- tm_map(docsAnalyze,stemDocument, language="english")

Exploratory analysis

Even though stopwords such as “I”, “me”, “a” that are very common in English are important for our use case, this analysis will not include them because they will appear as the most common words in the text and it is more interesting to see the kind of topics and words that are used.

Also, stemming must be avoided as the prediction algorithm requires the whole word including whether it is singular or plural depending on previous words, but for the analysis of the most frequent terms, stemming is done. The objective is to avoid that the analysis could be meaningless.

Finally, this section will explore the combination of two (bigrams) and three words (trigrams) that is also more common in the data.

Whenever a term or combination of words is rare in the data, it will not be considered to avoid memory issues. Also, it is a method to avoid analyzing particular terms or foreign words that might be included in the data.

Frequently used words

Wordclouds and plots are used to identify the most common words used in the sample data.

 # UNIGRAMS
  docsTDMAnalyze <- TermDocumentMatrix(docsAnalyze, control=list(bounds = list(global = c(400,Inf))))
  # Analyzing data after stemming and deleting stop words
  m <- as.matrix(docsTDMAnalyze)
  v <- sort(rowSums(m),decreasing=TRUE)
  d <- data.frame(Word = names(v),Frequency=v)
  d$Word <- factor(d$Word, levels = d$Word[order(d$Frequency)])
  # Wordcloud
  pal <- brewer.pal(9, "BuGn") 
  pal <- pal[-(1:4)] 
  wordcloud(d$Word, d$Frequency, min.freq=4000, colors = pal)

  # 20 most common unigrams
  g <- ggplot(d[1:20,], aes(x = Word, y = Frequency)) + geom_bar(stat = "identity")
  g <- g + coord_flip() + ggtitle("The 20 words most frequently used")
  g

  # Number of words to get 50% - 90% of the instances
  cum <- cumsum(d$Frequency)
  cut50 <- 0.5 * sum(d$Frequency)
  words50 <- length(cum) - sum(cum > cut50) + 1
  cut90 <- 0.9 * sum(d$Frequency)
  words90 <- length(cum) - sum(cum > cut90) + 1
  rm(m,v,d)

As there are words that are very common regarding the others, 50% of the word instances in the text could be achieved just with 215 while 90% of the word instances requires 825 out of 1114 words in total.

Frequently used bigrams

Wordclouds and plots are used to identify the most common combination of two words used in the data.

  # BIGRAMS
  BigramTokenizer <- function(x) {
    NGramTokenizer(x, Weka_control(min = 2, max = 2))
  }
  docsTDMAnalyzeBigrams <- TermDocumentMatrix(docsAnalyze, 
                                              control = list(tokenize = BigramTokenizer,
                                                             bounds = list(global = c(80,Inf))))
  
  # Analyzing data after stemming and deleting stop words
  m2 <- as.matrix(docsTDMAnalyzeBigrams)
  v2 <- sort(rowSums(m2),decreasing=TRUE)
  d2 <- data.frame(Bigram = names(v2),Frequency=v2)
  d2$Bigram <- factor(d2$Bigram, levels = d2$Bigram[order(d2$Frequency)])
  
  # Wordcloud
  pal <- brewer.pal(9, "BuGn") 
  pal <- pal[-(1:4)] 
  wordcloud(d2$Bigram, d2$Frequency, min.freq=600, colors = pal)

  # 20 most common bigrams
  g <- ggplot(d2[1:20,], aes(x = Bigram, y = Frequency)) + geom_bar(stat = "identity")
  g <- g + coord_flip() + ggtitle("The 20 bigrams most frequently used")
  g

  rm(m2,v2,d2)

Frequently used trigrams

Wordclouds and plots are used to identify the most common combination of three words used in the data.

  # TRIGRAMS
  TrigramTokenizer <- function(x){
    NGramTokenizer(x, Weka_control(min = 3, max = 3))
  } 
  docsTDMAnalyzeTrigrams <- TermDocumentMatrix(docsAnalyze, control = list(tokenize = TrigramTokenizer,
                                                                           bounds = list(global = c(20,Inf))))
  
  # Analyzing data after stemming and deleting stop words
  m3 <- as.matrix(docsTDMAnalyzeTrigrams)
  v3 <- sort(rowSums(m3),decreasing=TRUE)
  d3 <- data.frame(Trigram = names(v3),Frequency=v3)
  d3$Trigram <- factor(d3$Trigram, levels = d3$Trigram[order(d3$Frequency)])
  
  # Wordcloud
  pal <- brewer.pal(9, "BuGn") 
  pal <- pal[-(1:4)] 
  wordcloud(d3$Trigram, d3$Frequency, min.freq=70, colors = pal)
## Warning in wordcloud(d3$Trigram, d3$Frequency, min.freq = 70, colors =
## pal): presid barack obama could not be fit on page. It will not be plotted.

  # 20 most common trigrams
  g <- ggplot(d3[1:20,], aes(x = Trigram, y = Frequency)) + geom_bar(stat = "identity") 
  g <- g + coord_flip() + ggtitle("The 20 trigrams most frequently used")
  g

  rm(m3,v3,d3)

Conclusions and Next Steps

It is important to highlight the following challenges already noticeable in this phase:

  • Limited computation power using the personal laptop.
  • Limited memory allocation for vectors.
  • Package “tm” has an issue with NGram tokenization algorithm in the latest version (0.7-1). Version 0.6-2 worked well.

Even if computational limitations forced the deletion of words that are not frequently used, this process helped in the identification of words that might not belong to the language, have some typo or be personal terms known by the author that are not covered in this application that will be general and used for many people. However, personalizing the application for a person is a good option when the data is available.

Thanks to stemming process, the number of words decrease and in fact, a solution to decrease the number of words to have in a dictionary is to keep them in its root form and then according to the previous words, provide the correct word form (for instance whether it should be singular or plural). But it may cause the prediction algorithm to be more complicated.

Next steps consist of the selection of the proper prediction model and the first approach will consider stopwords and stemming will not be done. However, this work will be done with quanteda package again as the performance was simply unacceptable.