1 Introduction

Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. However, typing on mobile devices can be a serious pain.

In this project, we aim at building a predictive text product that makes it easier for people to type. To this end, we start with analyzing a large corpus of text documents to discover the structure in the data and see how words are put together. The process involves cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, we will build a predictive text shiny app.

This project is the capstone project for the Data Sciene Specialization offered by John Hopkins University. The data for this project has been provided by the Swiftkey company. SwiftKey has built a smart keyboard for mobile devices. One cornerstone of their smart keyboard is predictive text models. The basic training data can be found here. The data is from a corpus called HC Corpora. We may need to collect/use other data during the project.

There are four different databases, each for one specific language. The languages include German, English, Finnish, and Russian. In this project, we deal with the English database. There are the following textual files in the English database:

en_us.blogs.txt
en_us.news.txt
en_us.twitter.txt

The current report addresses the first phase of the project, where we get familiar with the databases and do the necessary cleaning. The following tasks have been accomplished in the current report:

Tokenization: identifying appropriate tokens such as words, punctuation, and numbers. Writing a function that takes a file as input and returns a tokenized version of it.
Profanity filtering: removing profanity and other words you do not want to predict.
Some Priliminary analysis on the data.

The structure of the rest of the report is as follows: In Sec. 2, Sec 3, and Sec. 4, we deal with the blogs, twitter, and news databases, respectively. In each section, we first load the data and do some summary on the data. Then, we get tokenize the data in several different levels. We then take a look at the profanity expressions in the data. Finally, we clean up the data from unimportant info and profanity expressions. Sec. 5, we integrate the data and do some analysis on it.

2 Blogs

In Sec. 2.1, we read the en_US.blgs.txt file, and do some preliminary analysis. Sec. 2.2 tokenizes the data in several levels. Moreover, we do some analysis on the tokens of the data. Sec. 2.3 extracts the profanity expressions from the data and does some preliminary analysis on them. Sec. 2.4 cleans up the data.

2.1 Loading

In the follwoing script, we first read the data to get a summary of the data:

# Reading the US Blogs data
data_us_blogs <- read_lines("Data/en_US/en_US.blogs.txt")
# Summary of the data
summary(data_us_blogs)

   Length     Class      Mode 
   899288 character character

length_blogs <- sapply(data_us_blogs, nchar)
#maximum length 
max_length_blogs <- max(length_blogs)
max_length_blogs

[1] 40833

#minimum length
min_length_blogs <- min(length_blogs)
min_length_blogs

[1] 1

As we see above,

The blogs dataset includes 899288 lines/observations, and
The number of characters in an observation ranges between 1 and 40833.

2.2 Tokens

In the following scripts, we tokenize the blogs dataset in several levels as follow:

all tokens
all tokens excluding seperators
all tokens excluding sepertors and numbers
all tokens excluding seperators, numbers, and punctuations
all tokens excluding seperators, numbers, punctuations, English stopwords.

2.2.1 All tokens

# all tokens including seperators
tokens_blogs <- tokens(data_us_blogs, remove_separators = FALSE)
num_tokens_blogs <- sapply(tokens_blogs, length)
#Maximum number of tokens in a line of blogs
max_num_tokens_blogs <- max(num_tokens_blogs)
max_num_tokens_blogs

[1] 14068

#Number of lines with maximum number of tokens
length(which(num_tokens_blogs == max_num_tokens_blogs))

[1] 1

#Minimum number of tokens in a line of blogs
min_num_tokens_blogs <- min(num_tokens_blogs)
min_num_tokens_blogs

[1] 1

#Number of lines with minimum number of tokens
length(which(num_tokens_blogs == min_num_tokens_blogs))

[1] 6105

As seen,

The number of tokens in an observation/line of the blogs dataset ranges between 1 and 14068.
There is only 1 line with maximum number of tokens.
There are 6105 lines with a single token. This means that about 0.679% of the observations has only one tokens.

2.2.2 All tokens excluding seperators

# all tokens excluding seperators
tokens_wosep_blogs <- tokens(data_us_blogs, remove_separators = TRUE)
num_tokens_wosep_blogs <- sapply(tokens_wosep_blogs, length)
#Maximum number of tokens w/o seperations in a line of blogs
max_num_tokens_wosep_blogs <- max(num_tokens_wosep_blogs)
max_num_tokens_wosep_blogs

[1] 7439

#Number of lines with maximum number of tokens w/o seperations
length(which(num_tokens_wosep_blogs == max_num_tokens_wosep_blogs))

[1] 1

#Minimum number of tokens w/o seperations in a line of blogs
min_num_tokens_wosep_blogs <- min(num_tokens_wosep_blogs)
min_num_tokens_wosep_blogs

[1] 0

#Number of lines with minimum number of tokens w/o seperations
length(which(num_tokens_wosep_blogs == min_num_tokens_wosep_blogs))

[1] 14

Therefore, excluding the seperators,

The number of tokens in an observation/line of the blogs dataset ranges between 0 and 7439.
There is only 1 line with maximum number of tokens.
There are 14 lines with maximum number of tokens. This means that, disregarding seperators, about 0.002% of the observations are empty lines.

2.2.3 All tokens excluding sepertors and numbers

# all tokens excluding seperators and numbers
tokens_wosepnum_blogs <- tokens(data_us_blogs, remove_separators = TRUE, remove_numbers = TRUE)
num_tokens_wosepnum_blogs <- sapply(tokens_wosepnum_blogs, length)
#Maximum number of tokens w/o seperations and numbers in a line of blogs
max_num_tokens_wosepnum_blogs <- max(num_tokens_wosepnum_blogs)
max_num_tokens_wosepnum_blogs

[1] 7092

#Number of lines with maximum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_blogs == max_num_tokens_wosepnum_blogs))

[1] 1

#Minimum number of tokens w/o seperations and numbers in a line of blogs
min_num_tokens_wosepnum_blogs <- min(num_tokens_wosepnum_blogs)
min_num_tokens_wosepnum_blogs

[1] 0

#Number of lines with minimum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_blogs == min_num_tokens_wosepnum_blogs))

[1] 409

Therefore, excluding the seperators and numbers,

The number of tokens in an observation/line of the blogs dataset ranges between 0 and 7092.
There is only 1 line with maximum number of tokens.
There are 409 lines with minimum number of tokens, i.e., 0. This means that, disregarding seperations and numbers, about 0.045% of the observations are empty lines.

2.2.4 All tokens excluding seperators, numbers, and punctuations

# all tokens excluding seperators, numbers, and punctuations
tokens_wosepnumpnct_blogs <- tokens(data_us_blogs, remove_separators = TRUE, remove_numbers = TRUE, remove_punct = TRUE)
num_tokens_wosepnumpnct_blogs <- sapply(tokens_wosepnumpnct_blogs, length)
#Maximum number of tokens w/o seperations, numbers, and punctuations in a line of blogs
max_num_tokens_wosepnumpnct_blogs <- max(num_tokens_wosepnumpnct_blogs)
max_num_tokens_wosepnumpnct_blogs

[1] 6312

#Number of lines with maximum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_blogs == max_num_tokens_wosepnumpnct_blogs))

[1] 1

#Minimum number of tokens w/o seperations, numbers, and punctuations in a line of blogs
min_num_tokens_wosepnumpnct_blogs <- min(num_tokens_wosepnumpnct_blogs)
min_num_tokens_wosepnumpnct_blogs

[1] 0

#Number of lines with minimum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_blogs == min_num_tokens_wosepnumpnct_blogs))

[1] 960

Therefore, excluding the seperators, numbers, and punctuations,

The number of tokens in an observation/line of blogs ranges between 0 and 6312.
There is only 1 line with maximum number of tokens.
There are 960 lines with minimum number of tokens. This means that, disregarding seperations, numbers, and punctuations, about 0.107% of the observations are empty lines.

2.2.5 All tokens excluding seperators, numbers, punctuations, stopwords

# all tokens excluding seperators, numbers, punctuations, and stopwords
tokens_wosepnumpnctstp_blogs <- tokens_remove(tokens_wosepnumpnct_blogs, pattern = stopwords('en'))
num_tokens_wosepnumpnctstp_blogs <- sapply(tokens_wosepnumpnctstp_blogs, length)
#Maximum number of tokens w/o seperations, numbers, punctuations, and stopwords in a line of blogs
max_num_tokens_wosepnumpnctstp_blogs <- max(num_tokens_wosepnumpnctstp_blogs)
max_num_tokens_wosepnumpnctstp_blogs

[1] 3893

#Number of lines with maximum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_blogs == max_num_tokens_wosepnumpnctstp_blogs))

[1] 1

#Minimum number of tokens w/o seperations, numbers, punctuations, stopwords in a line of blogs
min_num_tokens_wosepnumpnctstp_blogs <- min(num_tokens_wosepnumpnctstp_blogs)
min_num_tokens_wosepnumpnctstp_blogs

[1] 0

#Number of lines with minimum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_blogs == min_num_tokens_wosepnumpnctstp_blogs))

[1] 3193

Therefore, excluding the seperators, numbers, punctuations, and English stopwords,

The number of tokens in an observation/line in the blogs dataset ranges between 0 and 3893.
There is only 1 line with maximum number of tokens.
There are 3193 lines with minimum number of tokens. This means that, disregarding seperations, numbers, punctuations, and stopwords, about 0.355% of the observations are empty lines.

2.2.6 Wordcloud

Let us now take a look at the worldcloud of the tokens in the blogs dataset. Note that we do not consider punctuations, English stopwords, numbers, and seperations:

#Worldcloud - Blogs
set.seed(100)
blogs_dfm <- dfm(tokens_wosepnumpnctstp_blogs)
textplot_wordcloud(blogs_dfm, min_count = 6, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

2.3 Profanity Expressions

In this section, we will exclude the observations, which contain English profanity words/expressions. We read a list of profanity expressions from a text file.² Let us first take a look at the profanity expressions in the blogs dataset:

bad_words <- read_lines("bad_words.txt")
tokens_bad_blogs <- tokens_select(tokens_wosepnumpnctstp_blogs, pattern = bad_words, selection = "keep" )
set.seed(101)
blogs_bad_dfm <- dfm(tokens_bad_blogs)
textplot_wordcloud(blogs_bad_dfm, min_count = 1, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

blogs_bad_tab_doc <- as.data.frame(tidy(blogs_bad_dfm))%>% 
        group_by(document) %>% 
        summarize(num_bad = sum(count))
summary(blogs_bad_tab_doc)

   document            num_bad      
 Length:18637       Min.   : 1.000  
 Class :character   1st Qu.: 1.000  
 Mode  :character   Median : 1.000  
                    Mean   : 1.301  
                    3rd Qu.: 1.000  
                    Max.   :21.000

The number of lines in the blog dataset which have some profanity expression is 18637. This implies that about 2.072% of the blogs dataset is toxic.

2.4 Cleaning

In this section, we are going to clean the blogs dataset in two levels:

Clean the data from the empty lines, i.e., the lines from the blogs dataset, where they are either empty lines, or contain only punctuations, stopwords, numbers, or seperations.
Profanity filtering, i.e. Clean the data from profanity expressions.

2.4.1 Cleaning data from NULL values

As we already calculated, disregarding English stopwords, punctuations, seperators, and numbers, there are 3193 numbers of empty lines. In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_blogs_v2. Moreover, we write the result in a new text file, called en_us.blogs_nnull.txt.

idx_empty_blogs <- which(num_tokens_wosepnumpnctstp_blogs == 0)
#Removing the empty lines
data_us_blogs_v2 <- data_us_blogs[-idx_empty_blogs]
#Writing the result into a text file
fileConn<-file("en_us.blogs_nnull.txt")
writeLines(data_us_blogs_v2, fileConn)
close(fileConn)

2.4.2 Profanity Filtering

In this section, we are going to lines from the blogs dataset, which contain some profanity expressions. Correspondingly, we create a new dataset, called data_us_blogs_clean. We then write it into a text file, named en_us.blogs_clean.txt.

In the following script, we first get the indices of the lines where there are some profanity expressions; then, we concatinate them with indices of empty lines:

idx_bad_blogs <- as.numeric(substr(blogs_bad_tab_doc$document, 5, nchar(blogs_bad_tab_doc$document)))
idx_not_blogs <- c(idx_bad_blogs, idx_empty_blogs)
length(idx_not_blogs)

[1] 21830

Therefore, disregarding Englisg stopwords, punctuations, seperators, numbers, and profanity expressions, there are 21830 numbers of lines to be removed from the blogs dataset. This is 2.427% of the blogs dataset.

#Removing the empty and bad lines
data_us_blogs_clean <- data_us_blogs[-idx_not_blogs]
#Writing the result into a text file
fileConn<-file("en_us.blogs_clean.txt")
writeLines(data_us_blogs_clean, fileConn)
close(fileConn)

Now, the clean blogs dataset contains 877458 lines/observations.

3 Twitter

In this seciton, we follow the same tasks done on the Blogs dataset, this time on the twitters dataset. In Sec. 3.1, we read the en_US.twitter.txt file, and do some preliminary analysis. Sec. 3.2 tokenizes the data in several levels. Moreover, we do some analysis on the tokens of the data. Sec. 3.3 extracts the profanity expressions from the data and does some preliminary analysis on them. Sec. 3.4 cleans up the data.

3.1 Loading

Let us first read the en_US.twitter.txt to get a summary of the data:

# Reading the US Twitters data
data_us_twitter <- readLines("Data/en_US/en_US.twitter.txt", skipNul = TRUE)
# Summary of the data
summary(data_us_twitter)

   Length     Class      Mode 
  2360148 character character

length_twitter <- sapply(data_us_twitter, nchar)
#maximum length 
max_length_twitter <- max(length_twitter)
max_length_twitter

[1] 213

#minimum length
min_length_twitter <- min(length_twitter)
min_length_twitter

[1] 2

The twitters dataset includes 2360148 lines/observations, and
The number of characters in an observation ranges between 213 and 2.

3.2 Tokens

In the following scripts, we get several kinds of tokens from the twitter data set.

3.2.1 All tokens

# all tokens including seperators
tokens_twitter <- tokens(data_us_twitter, remove_separators = FALSE)
num_tokens_twitter <- sapply(tokens_twitter, length)
#Maximum number of tokens in a line of twitter
max_num_tokens_twitter <- max(num_tokens_twitter)
max_num_tokens_twitter

[1] 151

#Number of lines with maximum number of tokens
length(which(num_tokens_twitter == max_num_tokens_twitter))

[1] 1

#Minimum number of tokens in a line of twitter
min_num_tokens_twitter <- min(num_tokens_twitter)
min_num_tokens_twitter

[1] 1

#Number of lines with minimum number of tokens
length(which(num_tokens_twitter == min_num_tokens_twitter))

[1] 367

The number of tokens in an observation/line of the twitter dataset ranges between 1 and 151.
There is only 1 line with maximum number of tokens.
There are 367 lines with a single token. This means that about 0.016% of the observations has only one token.

3.2.2 All tokens excluding seperators

# all tokens excluding seperators
tokens_wosep_twitter <- tokens(data_us_twitter, remove_separators = TRUE)
num_tokens_wosep_twitter <- sapply(tokens_wosep_twitter, length)
#Maximum number of tokens w/o seperations in a line of twitts
max_num_tokens_wosep_twitter <- max(num_tokens_wosep_twitter)
max_num_tokens_wosep_twitter

[1] 107

#Number of lines with maximum number of tokens w/o seperators
length(which(num_tokens_wosep_twitter == max_num_tokens_wosep_twitter))

[1] 1

#Minimum number of tokens w/o seperations in a line of twitts
min_num_tokens_wosep_twitter <- min(num_tokens_wosep_twitter)
min_num_tokens_wosep_twitter

[1] 1

#Number of lines with minimum number of tokens w/o seperations
length(which(num_tokens_wosep_twitter == min_num_tokens_wosep_twitter))

[1] 369

Therefore, excluding the seperators,

The number of tokens in an observation/line in the twitter dataset ranges between 1 and 107.
There is only 1 line with maximum number of tokens.
There are 369 lines with a signle token. This means that, disregarding seperators, about 0.016% of the observations has only one token.

3.2.3 All tokens excluding sepertors and numbers

# all tokens excluding seperators and numbers
tokens_wosepnum_twitter <- tokens(data_us_twitter, remove_separators = TRUE, remove_numbers = TRUE)
num_tokens_wosepnum_twitter <- sapply(tokens_wosepnum_twitter, length)
#Maximum number of tokens w/o seperations and numbers in a line of twitts
max_num_tokens_wosepnum_twitter <- max(num_tokens_wosepnum_twitter)
max_num_tokens_wosepnum_twitter

[1] 96

#Number of lines with maximum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_twitter == max_num_tokens_wosepnum_twitter))

[1] 1

#Minimum number of tokens w/o seperations and numbers in a line of twitts
min_num_tokens_wosepnum_twitter <- min(num_tokens_wosepnum_twitter)
min_num_tokens_wosepnum_twitter

[1] 0

#Number of lines with minimum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_twitter == min_num_tokens_wosepnum_twitter))

[1] 1

Therefore, excluding the seperators and numbers,

The number of tokens in an observation/line in the twitter dataset ranges between 0 and 96.
There is only 1 line with maximum number of tokens.
There is only 1 line with minimum number of tokens, i.e., 0. This means that, disregarding seperations and numbers, about 0% of the observations are empty lines.

3.2.4 All tokens excluding seperators, numbers, and punctuations

# all tokens excluding seperators, numbers, and punctuations
tokens_wosepnumpnct_twitter <- tokens(data_us_twitter, remove_separators = TRUE, remove_numbers = TRUE, remove_punct = TRUE)
num_tokens_wosepnumpnct_twitter <- sapply(tokens_wosepnumpnct_twitter, length)
#Maximum number of tokens w/o seperations, numbers, and punctuations in a line of twitts
max_num_tokens_wosepnumpnct_twitter <- max(num_tokens_wosepnumpnct_twitter)
max_num_tokens_wosepnumpnct_twitter

[1] 60

#Number of lines with maximum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_twitter == max_num_tokens_wosepnumpnct_twitter))

[1] 1

#Minimum number of tokens w/o seperations, numbers, and punctuations in a line of twitts
min_num_tokens_wosepnumpnct_twitter <- min(num_tokens_wosepnumpnct_twitter)
min_num_tokens_wosepnumpnct_twitter

[1] 0

#Number of lines with minimum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_twitter == min_num_tokens_wosepnumpnct_twitter))

[1] 3

Therefore, excluding the seperations, numbers, and punctuations,

The number of tokens in an observation/line in the twitter dataset ranges between 0 and 60.
There is only 1 line with maximum number of tokens.
There are 3 lines with minimum number of tokens. This means that, disregarding seperations, numbers, and punctuations, about 0% of the observations are empty lines.

3.2.5 All tokens excluding seperators, numbers, punctuations, stopwords

# all tokens excluding seperators, numbers, punctuations, and stopwords
tokens_wosepnumpnctstp_twitter <- tokens_remove(tokens_wosepnumpnct_twitter, pattern = stopwords('en'))
num_tokens_wosepnumpnctstp_twitter <- sapply(tokens_wosepnumpnctstp_twitter, length)
#Maximum number of tokens w/o seperations, numbers, punctuations, and stopwords in a line of twitts
max_num_tokens_wosepnumpnctstp_twitter <- max(num_tokens_wosepnumpnctstp_twitter)
max_num_tokens_wosepnumpnctstp_twitter

[1] 48

#Number of lines with maximum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_twitter == max_num_tokens_wosepnumpnctstp_twitter))

[1] 1

#Minimum number of tokens w/o seperations, numbers, punctuations, stopwords in a line of twitts
min_num_tokens_wosepnumpnctstp_twitter <- min(num_tokens_wosepnumpnctstp_twitter)
min_num_tokens_wosepnumpnctstp_twitter

[1] 0

#Number of lines with minimum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_twitter == min_num_tokens_wosepnumpnctstp_twitter))

[1] 8049

Therefore, excluding the seperators, numbers, punctuations, and English stopwords,

The number of tokens in an observation/line in the twitter dataset ranges between 0 and 48.
There is only 1 line with maximum number of tokens.
There are 8049 lines with minimum number of tokens. This means that, disregarding seperations, numbers, punctuations, and stopwords, about 0.341% of the observations are empty lines.

3.2.6 WordCloud

Let us now take a look at the worldcloud of the tokens in the twitter dataset. Note that we do not consider punctuations, English stopwords, numbers, and seperations:

#Worldcloud - twitter
set.seed(200)
twitter_dfm <- dfm(tokens_wosepnumpnctstp_twitter)
textplot_wordcloud(twitter_dfm, min_count = 6, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

3.3 Profanity Expressions

In this section, we will analyze the observations form the twitter dataset, which contain English profanity words/expressions. Let us first take a look at the profanity words in the twitter dataset:

#bad_words <- read_lines("bad_words.txt")
tokens_bad_twitter <- tokens_select(tokens_wosepnumpnctstp_twitter, pattern = bad_words, selection = "keep" )
set.seed(102)
twitter_bad_dfm <- dfm(tokens_bad_twitter)
textplot_wordcloud(twitter_bad_dfm, min_count = 1, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

twitter_bad_tab_doc <- as.data.frame(tidy(twitter_bad_dfm))%>% 
        group_by(document) %>% 
        summarize(num_bad = sum(count))
summary(twitter_bad_tab_doc)

   document            num_bad      
 Length:88180       Min.   : 1.000  
 Class :character   1st Qu.: 1.000  
 Mode  :character   Median : 1.000  
                    Mean   : 1.154  
                    3rd Qu.: 1.000  
                    Max.   :35.000

The number of lines in the twitter dataset which have some profanity expression is 88180. This implies that about 3.736% of the the twitter dataset is toxic.

3.4 Cleaning

In this section, we are going to clean the twitter dataset in two levels:

Clean the data from the empty lines, i.e., the lines from the twitter dataset, where they are either empty lines, or have only either punctuations, stopwords, numbers, or seperations.
Profanity filtering, i.e. clean the data from profanity expressions.

3.4.1 Cleaning data from NULL values

As we already calculated, disregarding English stopwords, punctuations, seperators, and numbers, there are 8049 numbers of empty lines. In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_twitter_v2. Moreover, we save the result in a new text file, called en_us.twitter_nnull.txt.

idx_empty_twitter <- which(num_tokens_wosepnumpnctstp_twitter == 0)
#Removing the empty lines
data_us_twitter_v2 <- data_us_twitter[-idx_empty_twitter]
#Writing the result into a text file
fileConn<-file("en_us.twitter_nnull.txt")
writeLines(data_us_twitter_v2, fileConn)
close(fileConn)

3.4.2 Profanity Filtering

In the following script, we first get the indices of the lines where there are some profanity expressions; then, we concatinate them with indices of empty lines:

idx_bad_twitter <- as.numeric(substr(twitter_bad_tab_doc$document, 5, nchar(twitter_bad_tab_doc$document)))
idx_not_twitter <- c(idx_bad_twitter, idx_empty_twitter)
length(idx_not_twitter)

[1] 96229

Therefore, disregarding Englisg stopwords, punctuations, seperators, numbers, and profanity expressions, there are 96229 numbers of lines to be removed from the twitter dataset. This is 4.077% of the twitter dataset.

In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_twitter_clean. Moreover, we save the result in a new text file, called en_us.twitter_clean.txt.

#Removing the empty and bad lines
data_us_twitter_clean <- data_us_twitter[-idx_not_twitter]
#Writing the result into a text file
fileConn<-file("en_us.twitter_clean.txt")
writeLines(data_us_twitter_clean, fileConn)
close(fileConn)

Now, the clean twitter dataset contains 2263919 lines/observations.

4 News

In this seciton, we follow the same tasks done on the Blogs and Twitter datasets, this time on the news dataset. In Sec. 4.1, we read the en_US.news.txt file, and do some preliminary analysis. Sec. 4.2 tokenizes the data in several levels. Moreover, we do some analysis on the tokens of the data. Sec. 4.3 extracts the profanity expressions from the data and does some preliminary analysis on them. Sec. 4.4 cleans up the data.

4.1 Loading

Let us first read the en_US.news.txt to get a summary of the data:

# Reading the US News data
data_us_news <- read_lines("Data/en_US/en_US.news.txt")
# Summary of the data
summary(data_us_news)

   Length     Class      Mode 
  1010242 character character

length_news <- sapply(data_us_news, nchar)
#maximum length 
max_length_news <- max(length_news)
max_length_news

[1] 11384

#minimum length
min_length_news <- min(length_news)
min_length_news

[1] 1

The news dataset includes 1010242 lines/observations, and
The number of characters in an observation ranges between 11384 and 1.

4.2 Tokens

In the following scripts, we get several kinds of tokens from the news data set.

4.2.1 All tokens

# all tokens including seperators
tokens_news <- tokens(data_us_news, remove_separators = FALSE)
num_tokens_news <- sapply(tokens_news, length)
#Maximum number of tokens in a line of news
max_num_tokens_news <- max(num_tokens_news)
max_num_tokens_news

[1] 4102

#Number of lines with maximum number of tokens
length(which(num_tokens_news == max_num_tokens_news))

[1] 1

#Minimum number of tokens in a line of news
min_num_tokens_news <- min(num_tokens_news)
min_num_tokens_news

[1] 1

#Number of lines with minimum number of tokens
length(which(num_tokens_news == min_num_tokens_news))

[1] 2540

The number of tokens in an observation/line in the news dataset ranges between 1 and 4102.
There is only 1 line with maximum number of tokens.
There are 2540 lines with a single token. This means that about 0.251% of the observations has only one token.

4.2.2 All tokens excluding seperators

# all tokens excluding seperators
tokens_wosep_news <- tokens(data_us_news, remove_separators = TRUE)
num_tokens_wosep_news <- sapply(tokens_wosep_news, length)
#Maximum number of tokens w/o seperations in a line of news
max_num_tokens_wosep_news <- max(num_tokens_wosep_news)
max_num_tokens_wosep_news

[1] 2733

#Number of lines with maximum number of tokens w/o seperators
length(which(num_tokens_wosep_news == max_num_tokens_wosep_news))

[1] 1

#Minimum number of tokens w/o seperations in a line of news
min_num_tokens_wosep_news <- min(num_tokens_wosep_news)
min_num_tokens_wosep_news

[1] 0

#Number of lines with minimum number of tokens w/o seperations
length(which(num_tokens_wosep_news == min_num_tokens_wosep_news))

[1] 1

Therefore, excluding the seperators,

The number of tokens in an observation/line of news ranges between 0 and 2733.
There is only 1 line with maximum number of tokens.
There is only 1 empty line. This means that, disregarding seperators, about 0% of the observations is null.

4.2.3 All tokens excluding sepertors and numbers

# all tokens excluding seperators and numbers
tokens_wosepnum_news <- tokens(data_us_news, remove_separators = TRUE, remove_numbers = TRUE)
num_tokens_wosepnum_news <- sapply(tokens_wosepnum_news, length)
#Maximum number of tokens w/o seperations and numbers in a line of news
max_num_tokens_wosepnum_news <- max(num_tokens_wosepnum_news)
max_num_tokens_wosepnum_news

[1] 2733

#Number of lines with maximum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_news == max_num_tokens_wosepnum_news))

[1] 1

#Minimum number of tokens w/o seperations and numbers in a line of news
min_num_tokens_wosepnum_news <- min(num_tokens_wosepnum_news)
min_num_tokens_wosepnum_news

[1] 0

#Number of lines with minimum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_news == min_num_tokens_wosepnum_news))

[1] 341

Therefore, excluding the seperations and numbers,

The number of tokens in an observation/line in the news dataset ranges between 0 and 2733.
There is only 1 line with maximum number of tokens.
There are 341 lines with minimum number of tokens, i.e., 0. This means that, disregarding seperations and numbers, about 0.034% of the observations are empty lines.

4.2.4 All tokens excluding seperators, numbers, and punctuations

# all tokens excluding seperators, numbers, and punctuations
tokens_wosepnumpnct_news <- tokens(data_us_news, remove_separators = TRUE, remove_numbers = TRUE, remove_punct = TRUE)
num_tokens_wosepnumpnct_news <- sapply(tokens_wosepnumpnct_news, length)
#Maximum number of tokens w/o seperations, numbers, and punctuations in a line of news
max_num_tokens_wosepnumpnct_news <- max(num_tokens_wosepnumpnct_news)
max_num_tokens_wosepnumpnct_news

[1] 1370

#Number of lines with maximum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_news == max_num_tokens_wosepnumpnct_news))

[1] 2

#Minimum number of tokens w/o seperations, numbers, and punctuations in a line of news
min_num_tokens_wosepnumpnct_news <- min(num_tokens_wosepnumpnct_news)
min_num_tokens_wosepnumpnct_news

[1] 0

#Number of lines with minimum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_news == min_num_tokens_wosepnumpnct_news))

[1] 797

Therefore, excluding the seperators, numbers, and punctuations,

The number of tokens in an observation/line of news ranges between 0 and 1370.
There are only 2 lines with maximum number of tokens.
There are 797 lines with minimum number of tokens. This means that, disregarding seperators, numbers, and punctuations, about 0.079% of the observations are empty lines.

4.2.5 All tokens excluding seperators, numbers, punctuations, stopwords

# all tokens excluding seperators, numbers, punctuations, and stopwords
tokens_wosepnumpnctstp_news <- tokens_remove(tokens_wosepnumpnct_news, pattern = stopwords('en'))
num_tokens_wosepnumpnctstp_news <- sapply(tokens_wosepnumpnctstp_news, length)
#Maximum number of tokens w/o seperations, numbers, punctuations, and stopwords in a line of news
max_num_tokens_wosepnumpnctstp_news <- max(num_tokens_wosepnumpnctstp_news)
max_num_tokens_wosepnumpnctstp_news

[1] 1315

#Number of lines with maximum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_news == max_num_tokens_wosepnumpnctstp_news))

[1] 1

#Minimum number of tokens w/o seperations, numbers, punctuations, stopwords in a line of news
min_num_tokens_wosepnumpnctstp_news <- min(num_tokens_wosepnumpnctstp_news)
min_num_tokens_wosepnumpnctstp_news

[1] 0

#Number of lines with minimum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_news == min_num_tokens_wosepnumpnctstp_news))

[1] 1366

Therefore, excluding the seperators, numbers, punctuations, and English stopwords,

The number of tokens in an observation/line in the news dataset ranges between 0 and 1315.
There is only 1 line with maximum number of tokens.
There are 1366 lines with minimum number of tokens. This means that, disregarding seperators, numbers, punctuations, and stopwords, about 0.135% of the observations are empty lines.

4.2.6 WordCloud

Let us now take a look at the worldcloud of the tokens in the news dataset. Note that we do not consider punctuations, English stopwords, numbers, and seperations:

#Worldcloud - news
set.seed(400)
news_dfm <- dfm(tokens_wosepnumpnctstp_news)
textplot_wordcloud(news_dfm, min_count = 6, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

4.3 Profanity Expressions

In this section, we will analyze the observations form the news dataset, whcih contain English profanity words/expressions. Let us first take a look at the profanity words in the news dataset:

tokens_bad_news <- tokens_select(tokens_wosepnumpnctstp_news, pattern = bad_words, selection = "keep" )
set.seed(102)
news_bad_dfm <- dfm(tokens_bad_news)
textplot_wordcloud(news_bad_dfm, min_count = 1, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

news_bad_tab_doc <- as.data.frame(tidy(news_bad_dfm))%>% 
        group_by(document) %>% 
        summarize(num_bad = sum(count))
summary(news_bad_tab_doc)

   document            num_bad      
 Length:7960        Min.   : 1.000  
 Class :character   1st Qu.: 1.000  
 Mode  :character   Median : 1.000  
                    Mean   : 1.147  
                    3rd Qu.: 1.000  
                    Max.   :12.000

The number of lines in the news dataset which have some profanity expression is 7960. This implies that about 0.788% of the the news dataset is toxic.

4.4 Cleaning

In this section, we are going to clean the news dataset in two levels:

Clean the data from the empty lines, i.e., the lines from the news dataset, where they are either empty lines, or have only either punctuations, stopwords, numbers, or seperations.
Profanity filtering, i.e. clean the data from profanity expressions.

4.4.1 Cleaning data from NULL values

As we already calculated, disregarding English stopwords, punctuations, seperators, and numbers, there are 1366 numbers of empty lines. In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_news_v2. Moreover, we save the result in a new text file, called en_us.news_nnull.txt.

idx_empty_news <- which(num_tokens_wosepnumpnctstp_news == 0)
#Removing the empty lines
data_us_news_v2 <- data_us_news[-idx_empty_news]
#Writing the result into a text file
fileConn<-file("en_us.news_nnull.txt")
writeLines(data_us_news_v2, fileConn)
close(fileConn)

4.4.2 Profanity Filtering

In this section, we are going to remove the lines with profanity expressions from the dataset.

In the following script, we first get the indices of the lines where there are some profanity expressions; then, we concatinate them with indices of empty lines:

idx_bad_news <- as.numeric(substr(news_bad_tab_doc$document, 5, nchar(news_bad_tab_doc$document)))
idx_not_news <- c(idx_bad_news, idx_empty_news)
length(idx_not_news)

[1] 9326

Therefore, disregarding Englisg stopwords, punctuations, seperators, numbers, and profanity expressions, there are 9326 numbers of lines to be removed from the news dataset. This is 0.923% of the news dataset.

In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_news_clean. Moreover, we save the result in a new text file, called en_us.news_clean.txt.

#Removing the empty and bad lines
data_us_news_clean <- data_us_news[-idx_not_news]
#Writing the result into a text file
fileConn<-file("en_us.news_clean.txt")
writeLines(data_us_news_clean, fileConn)
close(fileConn)

Now, the clean news dataset contains 1000916 lines/observations.

5 Integration

5.1 Text Files

In this section, we are going to integrate the following kinds of vectors:

data_us_XYZ_v2, and
data_us_XYZ_clean

where XYZ \(\in \{\)blogs, twitter, news\(\}\), and write them in the following text files, respectively:

en_us_nnull.txt, and
en_us_clean.txt

# data_us_v2 & en_us_nnul.txt (excluding lines with null or unimportant info)
#Integrating the corresponding vectors
data_us_v2 <- c(data_us_blogs_v2, data_us_twitter_v2, data_us_news_v2) 
#Writing the result into a text file
fileConn <- file("en_us_nnull.txt")
writeLines(data_us_v2, fileConn)
close(fileConn)

summary(data_us_v2)

   Length     Class      Mode 
  4257070 character character

length_us <- sapply(data_us_v2, nchar)
#maximum length 
max_length_us <- max(length_us)
max_length_us

[1] 40833

#minimum length
min_length_us <- min(length_us)
min_length_us

[1] 1

Therefore, excluding null lines from the blogs, news, and twitter datasets:

The integrated dataset includes 4257070 lines/observations, and
The number of characters in an observation in the integrated dataset ranges between 1 and 40833.

# data_us_clean & en_us_clean.txt (excluding lines with null or bad expressions)
#Integrating the corresponding vectors
data_us_clean <- c(data_us_blogs_clean, data_us_twitter_clean, data_us_news_clean)
#Writing the result into a text file
fileConn<-file("en_us_clean.txt")
writeLines(data_us_clean, fileConn)
close(fileConn)

summary(data_us_clean)

   Length     Class      Mode 
  4142293 character character

length_us_clean <- sapply(data_us_clean, nchar)
#maximum length 
max_length_us_clean <- max(length_us_clean)
max_length_us_clean

[1] 40833

#minimum length
min_length_us_clean <- min(length_us_clean)
min_length_us_clean

[1] 1

Therefore, cleaning up the blogs, news, and twitter datasets:

The integrated dataset includes 4142293 lines/observations, and
The number of characters in an observation in the integrated dataset ranges between 1 and 40833.

5.2 Tokens

In this section, we are going to integrate the different kinds of tokens that we got for each datasets.

In the following script, we integrate different kinds of tokens extracted from three different sources, blogs, twitter, and news. For each type of tokens, We first rename the elements so that we can distinguish between different sources. The names of the elements of a token relevant to blogs ( twitter and news, respectively) will start with blogs (twitter and news, respectively).³

The first script integrates the tokens tokens_blog, tokens_twitter, and tokens_news into a token named tokens_us. This would give us all the tokens.

#rename the names of the tokens 
names(tokens_blogs) <- nameKeys(i = 5, vec = names(tokens_blogs), txt = "blog")
names(tokens_twitter) <- nameKeys(i = 5, vec = names(tokens_twitter), txt = "twitter")
names(tokens_news) <- nameKeys(i = 5, vec = names(tokens_news), txt = "news")
#Integrate 
tokens_us <- append(append(tokens_blogs, tokens_twitter), tokens_news)
#number of tokens in each observation
num_tokens_us <- sapply(tokens_us, length)

Some observations:

The number of tokens in a line from all sources ranges between 1 and 14068.
There is only 1 line with maximum number of tokens.
There are 9012 lines with a single token.

The following script integrates the tokens tokens_wosep_blogs, tokens_wosep_twitter, and tokens_wosep_news into a token named tokens_wosep_us. This would give us all all the tokens exclusing the seperators.

#rename the names of the tokens 
names(tokens_wosep_blogs) <- nameKeys(i = 5, vec = names(tokens_wosep_blogs), txt = "blog")
names(tokens_wosep_twitter) <- nameKeys(i = 5, vec = names(tokens_wosep_twitter), txt = "twitter")
names(tokens_wosep_news) <- nameKeys(i = 5, vec = names(tokens_wosep_news), txt = "news")
#Integrate 
tokens_wosep_us <- append(append(tokens_wosep_blogs, tokens_wosep_twitter), tokens_wosep_news)
#number of tokens in each observation
num_tokens_wosep_us <- sapply(tokens_wosep_us, length)

Excluding seperators, some observations are as follow:

The number of tokens in a line from all sources ranges between 0 and 7439.
There is only 1 line with maximum number of tokens.
There are 15 lines with no tokens.

The following script integrates the tokens tokens_wosepnum_blogs, tokens_wosepnum_twitter, and tokens_wosepnum_news into a token named tokens_wosepnum_us. This would give us all the tokens exclusing the seperators and numbers.

#rename the names of the tokens 
names(tokens_wosepnum_blogs) <- nameKeys(i = 5, vec = names(tokens_wosepnum_blogs), txt = "blog")
names(tokens_wosepnum_twitter) <- nameKeys(i = 5, vec = names(tokens_wosepnum_twitter), txt = "twitter")
names(tokens_wosepnum_news) <- nameKeys(i = 5, vec = names(tokens_wosepnum_news), txt = "news")
#Integrate 
tokens_wosepnum_us <- append(append(tokens_wosepnum_blogs, tokens_wosepnum_twitter), tokens_wosepnum_news)
#number of tokens in each observation
num_tokens_wosepnum_us <- sapply(tokens_wosepnum_us, length)

Excluding seperators and numbers, some observations are as follow:

The number of tokens in a line from all sources ranges between 0 and 7092.
There is only 1 line with maximum number of tokens.
There are 751 lines with no tokens.

The following script integrates the tokens tokens_wosepnumpnct_blogs, tokens_wosepnumpnct_twitter, and tokens_wosepnumpnct_news into a token named tokens_wosepnumpnct_us. This would give us all the tokens exclusing the seperators, numbers, and punctuations.

#rename the names of the tokens 
names(tokens_wosepnumpnct_blogs) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnct_blogs), txt = "blog")
names(tokens_wosepnumpnct_twitter) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnct_twitter), txt = "twitter")
names(tokens_wosepnumpnct_news) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnct_news), txt = "news")
#Integrate 
tokens_wosepnumpnct_us <- append(append(tokens_wosepnumpnct_blogs, tokens_wosepnumpnct_twitter), tokens_wosepnumpnct_news)
#number of tokens in each observation
num_tokens_wosepnumpnct_us <- sapply(tokens_wosepnumpnct_us, length)

Excluding seperators, numbers, and punctuations, some observations are as follow:

The number of tokens in a line from all sources ranges between 0 and 6312.
There is only 1 line with maximum number of tokens.
There are 1760 lines with no tokens.

The following script integrates the tokens tokens_wosepnumpnctstp_blogs, tokens_wosepnumpnctstp_twitter, and tokens_wosepnumpnctstp_news into a token named tokens_wosepnumpnctstp_us. This would give us all the tokens exclusing the seperators, numbers, punctuations, and English stopwords.

#rename the names of the tokens 
names(tokens_wosepnumpnctstp_blogs) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnctstp_blogs), txt = "blog")
names(tokens_wosepnumpnctstp_twitter) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnctstp_twitter), txt = "twitter")
names(tokens_wosepnumpnctstp_news) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnctstp_news), txt = "news")
#Integrate 
tokens_wosepnumpnctstp_us <- append(append(tokens_wosepnumpnctstp_blogs, tokens_wosepnumpnctstp_twitter), tokens_wosepnumpnctstp_news)
#number of tokens in each observation
num_tokens_wosepnumpnctstp_us <- sapply(tokens_wosepnumpnctstp_us, length)

Excluding seperators, numbers, punctuations, and stop-words, some observations are as follow:

The number of tokens in a line from all sources ranges between 0 and 3893.
There is only 1 line with maximum number of tokens.
There are 12608 lines with no tokens.

Let us now take a look at the worldcloud of all tokens in the blogs dataset, excluding punctuations, English stopwords, numbers, and seperations:

#Worldcloud - Blogs
set.seed(111)
us_dfm <- dfm(tokens_wosepnumpnctstp_us)
textplot_wordcloud(us_dfm, min_count = 6, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

Now, let us take a look at profanity expressions in all the resources. The following script integrates the tokens tokens_bad_blogs, tokens_bad_twitter, and tokens_bad_news into a token named tokens_bad_us. This would give us all the profanity tokens in the all resources.

#rename the names of the tokens 
names(tokens_bad_blogs) <- nameKeys(i = 5, vec = names(tokens_bad_blogs), txt = "blog")
names(tokens_bad_twitter) <- nameKeys(i = 5, vec = names(tokens_bad_twitter), txt = "twitter")
names(tokens_bad_news) <- nameKeys(i = 5, vec = names(tokens_bad_news), txt = "news")
#Integrate 
tokens_bad_us <- append(append(tokens_bad_blogs, tokens_bad_twitter), tokens_bad_news)
set.seed(222)
us_bad_dfm <- dfm(tokens_bad_us)
textplot_wordcloud(us_bad_dfm, min_count = 1, random_order = FALSE,
                    rotation = .25, 
                    color = RColorBrewer::brewer.pal(8,"Dark2"))

us_bad_tab_doc <- as.data.frame(tidy(us_bad_dfm))%>% 
        group_by(document) %>% 
        summarize(num_bad = sum(count))
summary(us_bad_tab_doc)

   document            num_bad      
 Length:114777      Min.   : 1.000  
 Class :character   1st Qu.: 1.000  
 Mode  :character   Median : 1.000  
                    Mean   : 1.178  
                    3rd Qu.: 1.000  
                    Max.   :35.000

The number of lines in all datasets which have some profanity expression is 114777.

a.a.safilian@gmail.com ↩
We have used the following reference for the profanity expressions in English: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en ↩
To this end, we have defined a function calle nameKeys.↩

A Predictive Text Product: Getting, Tokenizing, and Cleaning Data

Aliakbar Safilian1

April 2, 2019

1 Introduction

2 Blogs

2.1 Loading

2.2 Tokens

2.2.1 All tokens

2.2.2 All tokens excluding seperators

2.2.3 All tokens excluding sepertors and numbers

2.2.4 All tokens excluding seperators, numbers, and punctuations

2.2.5 All tokens excluding seperators, numbers, punctuations, stopwords

2.2.6 Wordcloud

2.3 Profanity Expressions

2.4 Cleaning

2.4.1 Cleaning data from NULL values

2.4.2 Profanity Filtering

3 Twitter

3.1 Loading

3.2 Tokens

3.2.1 All tokens

3.2.2 All tokens excluding seperators

3.2.3 All tokens excluding sepertors and numbers

3.2.4 All tokens excluding seperators, numbers, and punctuations

3.2.5 All tokens excluding seperators, numbers, punctuations, stopwords

3.2.6 WordCloud

3.3 Profanity Expressions

3.4 Cleaning

3.4.1 Cleaning data from NULL values

3.4.2 Profanity Filtering

4 News

4.1 Loading

4.2 Tokens

4.2.1 All tokens

4.2.2 All tokens excluding seperators

4.2.3 All tokens excluding sepertors and numbers

4.2.4 All tokens excluding seperators, numbers, and punctuations

4.2.5 All tokens excluding seperators, numbers, punctuations, stopwords

4.2.6 WordCloud

4.3 Profanity Expressions

4.4 Cleaning

4.4.1 Cleaning data from NULL values

4.4.2 Profanity Filtering

5 Integration

5.1 Text Files

5.2 Tokens

Aliakbar Safilian¹