Around the world, people are spending an increasing amount of time on their mobile devices for email, social networking, banking and a whole range of other activities. However, typing on mobile devices can be a serious pain.
In this project, we aim at building a predictive text product that makes it easier for people to type. To this end, we start with analyzing a large corpus of text documents to discover the structure in the data and see how words are put together. The process involves cleaning and analyzing text data, then building and sampling from a predictive text model. Finally, we will build a predictive text shiny app.
This project is the capstone project for the Data Sciene Specialization offered by John Hopkins University. The data for this project has been provided by the Swiftkey company. SwiftKey has built a smart keyboard for mobile devices. One cornerstone of their smart keyboard is predictive text models. The basic training data can be found here. The data is from a corpus called HC Corpora. We may need to collect/use other data during the project.
There are four different databases, each for one specific language. The languages include German, English, Finnish, and Russian. In this project, we deal with the English database. There are the following textual files in the English database:
en_us.blogs.txten_us.news.txten_us.twitter.txtThe current report addresses the first phase of the project, where we get familiar with the databases and do the necessary cleaning. The following tasks have been accomplished in the current report:
The structure of the rest of the report is as follows: In Sec. 2, Sec 3, and Sec. 4, we deal with the blogs, twitter, and news databases, respectively. In each section, we first load the data and do some summary on the data. Then, we get tokenize the data in several different levels. We then take a look at the profanity expressions in the data. Finally, we clean up the data from unimportant info and profanity expressions. Sec. 5, we integrate the data and do some analysis on it.
In Sec. 2.1, we read the en_US.blgs.txt file, and do some preliminary analysis. Sec. 2.2 tokenizes the data in several levels. Moreover, we do some analysis on the tokens of the data. Sec. 2.3 extracts the profanity expressions from the data and does some preliminary analysis on them. Sec. 2.4 cleans up the data.
In the follwoing script, we first read the data to get a summary of the data:
# Reading the US Blogs data
data_us_blogs <- read_lines("Data/en_US/en_US.blogs.txt")
# Summary of the data
summary(data_us_blogs)
Length Class Mode
899288 character character
length_blogs <- sapply(data_us_blogs, nchar)
#maximum length
max_length_blogs <- max(length_blogs)
max_length_blogs
[1] 40833
#minimum length
min_length_blogs <- min(length_blogs)
min_length_blogs
[1] 1
As we see above,
In the following scripts, we tokenize the blogs dataset in several levels as follow:
# all tokens including seperators
tokens_blogs <- tokens(data_us_blogs, remove_separators = FALSE)
num_tokens_blogs <- sapply(tokens_blogs, length)
#Maximum number of tokens in a line of blogs
max_num_tokens_blogs <- max(num_tokens_blogs)
max_num_tokens_blogs
[1] 14068
#Number of lines with maximum number of tokens
length(which(num_tokens_blogs == max_num_tokens_blogs))
[1] 1
#Minimum number of tokens in a line of blogs
min_num_tokens_blogs <- min(num_tokens_blogs)
min_num_tokens_blogs
[1] 1
#Number of lines with minimum number of tokens
length(which(num_tokens_blogs == min_num_tokens_blogs))
[1] 6105
As seen,
The number of tokens in an observation/line of the blogs dataset ranges between 1 and 14068.
There is only 1 line with maximum number of tokens.
There are 6105 lines with a single token. This means that about 0.679% of the observations has only one tokens.
# all tokens excluding seperators
tokens_wosep_blogs <- tokens(data_us_blogs, remove_separators = TRUE)
num_tokens_wosep_blogs <- sapply(tokens_wosep_blogs, length)
#Maximum number of tokens w/o seperations in a line of blogs
max_num_tokens_wosep_blogs <- max(num_tokens_wosep_blogs)
max_num_tokens_wosep_blogs
[1] 7439
#Number of lines with maximum number of tokens w/o seperations
length(which(num_tokens_wosep_blogs == max_num_tokens_wosep_blogs))
[1] 1
#Minimum number of tokens w/o seperations in a line of blogs
min_num_tokens_wosep_blogs <- min(num_tokens_wosep_blogs)
min_num_tokens_wosep_blogs
[1] 0
#Number of lines with minimum number of tokens w/o seperations
length(which(num_tokens_wosep_blogs == min_num_tokens_wosep_blogs))
[1] 14
Therefore, excluding the seperators,
The number of tokens in an observation/line of the blogs dataset ranges between 0 and 7439.
There is only 1 line with maximum number of tokens.
There are 14 lines with maximum number of tokens. This means that, disregarding seperators, about 0.002% of the observations are empty lines.
# all tokens excluding seperators and numbers
tokens_wosepnum_blogs <- tokens(data_us_blogs, remove_separators = TRUE, remove_numbers = TRUE)
num_tokens_wosepnum_blogs <- sapply(tokens_wosepnum_blogs, length)
#Maximum number of tokens w/o seperations and numbers in a line of blogs
max_num_tokens_wosepnum_blogs <- max(num_tokens_wosepnum_blogs)
max_num_tokens_wosepnum_blogs
[1] 7092
#Number of lines with maximum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_blogs == max_num_tokens_wosepnum_blogs))
[1] 1
#Minimum number of tokens w/o seperations and numbers in a line of blogs
min_num_tokens_wosepnum_blogs <- min(num_tokens_wosepnum_blogs)
min_num_tokens_wosepnum_blogs
[1] 0
#Number of lines with minimum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_blogs == min_num_tokens_wosepnum_blogs))
[1] 409
Therefore, excluding the seperators and numbers,
The number of tokens in an observation/line of the blogs dataset ranges between 0 and 7092.
There is only 1 line with maximum number of tokens.
There are 409 lines with minimum number of tokens, i.e., 0. This means that, disregarding seperations and numbers, about 0.045% of the observations are empty lines.
# all tokens excluding seperators, numbers, and punctuations
tokens_wosepnumpnct_blogs <- tokens(data_us_blogs, remove_separators = TRUE, remove_numbers = TRUE, remove_punct = TRUE)
num_tokens_wosepnumpnct_blogs <- sapply(tokens_wosepnumpnct_blogs, length)
#Maximum number of tokens w/o seperations, numbers, and punctuations in a line of blogs
max_num_tokens_wosepnumpnct_blogs <- max(num_tokens_wosepnumpnct_blogs)
max_num_tokens_wosepnumpnct_blogs
[1] 6312
#Number of lines with maximum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_blogs == max_num_tokens_wosepnumpnct_blogs))
[1] 1
#Minimum number of tokens w/o seperations, numbers, and punctuations in a line of blogs
min_num_tokens_wosepnumpnct_blogs <- min(num_tokens_wosepnumpnct_blogs)
min_num_tokens_wosepnumpnct_blogs
[1] 0
#Number of lines with minimum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_blogs == min_num_tokens_wosepnumpnct_blogs))
[1] 960
Therefore, excluding the seperators, numbers, and punctuations,
The number of tokens in an observation/line of blogs ranges between 0 and 6312.
There is only 1 line with maximum number of tokens.
There are 960 lines with minimum number of tokens. This means that, disregarding seperations, numbers, and punctuations, about 0.107% of the observations are empty lines.
# all tokens excluding seperators, numbers, punctuations, and stopwords
tokens_wosepnumpnctstp_blogs <- tokens_remove(tokens_wosepnumpnct_blogs, pattern = stopwords('en'))
num_tokens_wosepnumpnctstp_blogs <- sapply(tokens_wosepnumpnctstp_blogs, length)
#Maximum number of tokens w/o seperations, numbers, punctuations, and stopwords in a line of blogs
max_num_tokens_wosepnumpnctstp_blogs <- max(num_tokens_wosepnumpnctstp_blogs)
max_num_tokens_wosepnumpnctstp_blogs
[1] 3893
#Number of lines with maximum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_blogs == max_num_tokens_wosepnumpnctstp_blogs))
[1] 1
#Minimum number of tokens w/o seperations, numbers, punctuations, stopwords in a line of blogs
min_num_tokens_wosepnumpnctstp_blogs <- min(num_tokens_wosepnumpnctstp_blogs)
min_num_tokens_wosepnumpnctstp_blogs
[1] 0
#Number of lines with minimum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_blogs == min_num_tokens_wosepnumpnctstp_blogs))
[1] 3193
Therefore, excluding the seperators, numbers, punctuations, and English stopwords,
The number of tokens in an observation/line in the blogs dataset ranges between 0 and 3893.
There is only 1 line with maximum number of tokens.
There are 3193 lines with minimum number of tokens. This means that, disregarding seperations, numbers, punctuations, and stopwords, about 0.355% of the observations are empty lines.
Let us now take a look at the worldcloud of the tokens in the blogs dataset. Note that we do not consider punctuations, English stopwords, numbers, and seperations:
#Worldcloud - Blogs
set.seed(100)
blogs_dfm <- dfm(tokens_wosepnumpnctstp_blogs)
textplot_wordcloud(blogs_dfm, min_count = 6, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
In this section, we will exclude the observations, which contain English profanity words/expressions. We read a list of profanity expressions from a text file.2 Let us first take a look at the profanity expressions in the blogs dataset:
bad_words <- read_lines("bad_words.txt")
tokens_bad_blogs <- tokens_select(tokens_wosepnumpnctstp_blogs, pattern = bad_words, selection = "keep" )
set.seed(101)
blogs_bad_dfm <- dfm(tokens_bad_blogs)
textplot_wordcloud(blogs_bad_dfm, min_count = 1, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
blogs_bad_tab_doc <- as.data.frame(tidy(blogs_bad_dfm))%>%
group_by(document) %>%
summarize(num_bad = sum(count))
summary(blogs_bad_tab_doc)
document num_bad
Length:18637 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 1.301
3rd Qu.: 1.000
Max. :21.000
The number of lines in the blog dataset which have some profanity expression is 18637. This implies that about 2.072% of the blogs dataset is toxic.
In this section, we are going to clean the blogs dataset in two levels:
As we already calculated, disregarding English stopwords, punctuations, seperators, and numbers, there are 3193 numbers of empty lines. In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_blogs_v2. Moreover, we write the result in a new text file, called en_us.blogs_nnull.txt.
idx_empty_blogs <- which(num_tokens_wosepnumpnctstp_blogs == 0)
#Removing the empty lines
data_us_blogs_v2 <- data_us_blogs[-idx_empty_blogs]
#Writing the result into a text file
fileConn<-file("en_us.blogs_nnull.txt")
writeLines(data_us_blogs_v2, fileConn)
close(fileConn)
In this section, we are going to lines from the blogs dataset, which contain some profanity expressions. Correspondingly, we create a new dataset, called data_us_blogs_clean. We then write it into a text file, named en_us.blogs_clean.txt.
In the following script, we first get the indices of the lines where there are some profanity expressions; then, we concatinate them with indices of empty lines:
idx_bad_blogs <- as.numeric(substr(blogs_bad_tab_doc$document, 5, nchar(blogs_bad_tab_doc$document)))
idx_not_blogs <- c(idx_bad_blogs, idx_empty_blogs)
length(idx_not_blogs)
[1] 21830
Therefore, disregarding Englisg stopwords, punctuations, seperators, numbers, and profanity expressions, there are 21830 numbers of lines to be removed from the blogs dataset. This is 2.427% of the blogs dataset.
#Removing the empty and bad lines
data_us_blogs_clean <- data_us_blogs[-idx_not_blogs]
#Writing the result into a text file
fileConn<-file("en_us.blogs_clean.txt")
writeLines(data_us_blogs_clean, fileConn)
close(fileConn)
Now, the clean blogs dataset contains 877458 lines/observations.
In this seciton, we follow the same tasks done on the Blogs dataset, this time on the twitters dataset. In Sec. 3.1, we read the en_US.twitter.txt file, and do some preliminary analysis. Sec. 3.2 tokenizes the data in several levels. Moreover, we do some analysis on the tokens of the data. Sec. 3.3 extracts the profanity expressions from the data and does some preliminary analysis on them. Sec. 3.4 cleans up the data.
Let us first read the en_US.twitter.txt to get a summary of the data:
# Reading the US Twitters data
data_us_twitter <- readLines("Data/en_US/en_US.twitter.txt", skipNul = TRUE)
# Summary of the data
summary(data_us_twitter)
Length Class Mode
2360148 character character
length_twitter <- sapply(data_us_twitter, nchar)
#maximum length
max_length_twitter <- max(length_twitter)
max_length_twitter
[1] 213
#minimum length
min_length_twitter <- min(length_twitter)
min_length_twitter
[1] 2
In the following scripts, we get several kinds of tokens from the twitter data set.
# all tokens including seperators
tokens_twitter <- tokens(data_us_twitter, remove_separators = FALSE)
num_tokens_twitter <- sapply(tokens_twitter, length)
#Maximum number of tokens in a line of twitter
max_num_tokens_twitter <- max(num_tokens_twitter)
max_num_tokens_twitter
[1] 151
#Number of lines with maximum number of tokens
length(which(num_tokens_twitter == max_num_tokens_twitter))
[1] 1
#Minimum number of tokens in a line of twitter
min_num_tokens_twitter <- min(num_tokens_twitter)
min_num_tokens_twitter
[1] 1
#Number of lines with minimum number of tokens
length(which(num_tokens_twitter == min_num_tokens_twitter))
[1] 367
The number of tokens in an observation/line of the twitter dataset ranges between 1 and 151.
There is only 1 line with maximum number of tokens.
There are 367 lines with a single token. This means that about 0.016% of the observations has only one token.
# all tokens excluding seperators
tokens_wosep_twitter <- tokens(data_us_twitter, remove_separators = TRUE)
num_tokens_wosep_twitter <- sapply(tokens_wosep_twitter, length)
#Maximum number of tokens w/o seperations in a line of twitts
max_num_tokens_wosep_twitter <- max(num_tokens_wosep_twitter)
max_num_tokens_wosep_twitter
[1] 107
#Number of lines with maximum number of tokens w/o seperators
length(which(num_tokens_wosep_twitter == max_num_tokens_wosep_twitter))
[1] 1
#Minimum number of tokens w/o seperations in a line of twitts
min_num_tokens_wosep_twitter <- min(num_tokens_wosep_twitter)
min_num_tokens_wosep_twitter
[1] 1
#Number of lines with minimum number of tokens w/o seperations
length(which(num_tokens_wosep_twitter == min_num_tokens_wosep_twitter))
[1] 369
Therefore, excluding the seperators,
The number of tokens in an observation/line in the twitter dataset ranges between 1 and 107.
There is only 1 line with maximum number of tokens.
There are 369 lines with a signle token. This means that, disregarding seperators, about 0.016% of the observations has only one token.
# all tokens excluding seperators and numbers
tokens_wosepnum_twitter <- tokens(data_us_twitter, remove_separators = TRUE, remove_numbers = TRUE)
num_tokens_wosepnum_twitter <- sapply(tokens_wosepnum_twitter, length)
#Maximum number of tokens w/o seperations and numbers in a line of twitts
max_num_tokens_wosepnum_twitter <- max(num_tokens_wosepnum_twitter)
max_num_tokens_wosepnum_twitter
[1] 96
#Number of lines with maximum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_twitter == max_num_tokens_wosepnum_twitter))
[1] 1
#Minimum number of tokens w/o seperations and numbers in a line of twitts
min_num_tokens_wosepnum_twitter <- min(num_tokens_wosepnum_twitter)
min_num_tokens_wosepnum_twitter
[1] 0
#Number of lines with minimum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_twitter == min_num_tokens_wosepnum_twitter))
[1] 1
Therefore, excluding the seperators and numbers,
The number of tokens in an observation/line in the twitter dataset ranges between 0 and 96.
There is only 1 line with maximum number of tokens.
There is only 1 line with minimum number of tokens, i.e., 0. This means that, disregarding seperations and numbers, about 0% of the observations are empty lines.
# all tokens excluding seperators, numbers, and punctuations
tokens_wosepnumpnct_twitter <- tokens(data_us_twitter, remove_separators = TRUE, remove_numbers = TRUE, remove_punct = TRUE)
num_tokens_wosepnumpnct_twitter <- sapply(tokens_wosepnumpnct_twitter, length)
#Maximum number of tokens w/o seperations, numbers, and punctuations in a line of twitts
max_num_tokens_wosepnumpnct_twitter <- max(num_tokens_wosepnumpnct_twitter)
max_num_tokens_wosepnumpnct_twitter
[1] 60
#Number of lines with maximum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_twitter == max_num_tokens_wosepnumpnct_twitter))
[1] 1
#Minimum number of tokens w/o seperations, numbers, and punctuations in a line of twitts
min_num_tokens_wosepnumpnct_twitter <- min(num_tokens_wosepnumpnct_twitter)
min_num_tokens_wosepnumpnct_twitter
[1] 0
#Number of lines with minimum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_twitter == min_num_tokens_wosepnumpnct_twitter))
[1] 3
Therefore, excluding the seperations, numbers, and punctuations,
The number of tokens in an observation/line in the twitter dataset ranges between 0 and 60.
There is only 1 line with maximum number of tokens.
There are 3 lines with minimum number of tokens. This means that, disregarding seperations, numbers, and punctuations, about 0% of the observations are empty lines.
# all tokens excluding seperators, numbers, punctuations, and stopwords
tokens_wosepnumpnctstp_twitter <- tokens_remove(tokens_wosepnumpnct_twitter, pattern = stopwords('en'))
num_tokens_wosepnumpnctstp_twitter <- sapply(tokens_wosepnumpnctstp_twitter, length)
#Maximum number of tokens w/o seperations, numbers, punctuations, and stopwords in a line of twitts
max_num_tokens_wosepnumpnctstp_twitter <- max(num_tokens_wosepnumpnctstp_twitter)
max_num_tokens_wosepnumpnctstp_twitter
[1] 48
#Number of lines with maximum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_twitter == max_num_tokens_wosepnumpnctstp_twitter))
[1] 1
#Minimum number of tokens w/o seperations, numbers, punctuations, stopwords in a line of twitts
min_num_tokens_wosepnumpnctstp_twitter <- min(num_tokens_wosepnumpnctstp_twitter)
min_num_tokens_wosepnumpnctstp_twitter
[1] 0
#Number of lines with minimum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_twitter == min_num_tokens_wosepnumpnctstp_twitter))
[1] 8049
Therefore, excluding the seperators, numbers, punctuations, and English stopwords,
The number of tokens in an observation/line in the twitter dataset ranges between 0 and 48.
There is only 1 line with maximum number of tokens.
There are 8049 lines with minimum number of tokens. This means that, disregarding seperations, numbers, punctuations, and stopwords, about 0.341% of the observations are empty lines.
Let us now take a look at the worldcloud of the tokens in the twitter dataset. Note that we do not consider punctuations, English stopwords, numbers, and seperations:
#Worldcloud - twitter
set.seed(200)
twitter_dfm <- dfm(tokens_wosepnumpnctstp_twitter)
textplot_wordcloud(twitter_dfm, min_count = 6, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
In this section, we will analyze the observations form the twitter dataset, which contain English profanity words/expressions. Let us first take a look at the profanity words in the twitter dataset:
#bad_words <- read_lines("bad_words.txt")
tokens_bad_twitter <- tokens_select(tokens_wosepnumpnctstp_twitter, pattern = bad_words, selection = "keep" )
set.seed(102)
twitter_bad_dfm <- dfm(tokens_bad_twitter)
textplot_wordcloud(twitter_bad_dfm, min_count = 1, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
twitter_bad_tab_doc <- as.data.frame(tidy(twitter_bad_dfm))%>%
group_by(document) %>%
summarize(num_bad = sum(count))
summary(twitter_bad_tab_doc)
document num_bad
Length:88180 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 1.154
3rd Qu.: 1.000
Max. :35.000
The number of lines in the twitter dataset which have some profanity expression is 88180. This implies that about 3.736% of the the twitter dataset is toxic.
In this section, we are going to clean the twitter dataset in two levels:
As we already calculated, disregarding English stopwords, punctuations, seperators, and numbers, there are 8049 numbers of empty lines. In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_twitter_v2. Moreover, we save the result in a new text file, called en_us.twitter_nnull.txt.
idx_empty_twitter <- which(num_tokens_wosepnumpnctstp_twitter == 0)
#Removing the empty lines
data_us_twitter_v2 <- data_us_twitter[-idx_empty_twitter]
#Writing the result into a text file
fileConn<-file("en_us.twitter_nnull.txt")
writeLines(data_us_twitter_v2, fileConn)
close(fileConn)
In the following script, we first get the indices of the lines where there are some profanity expressions; then, we concatinate them with indices of empty lines:
idx_bad_twitter <- as.numeric(substr(twitter_bad_tab_doc$document, 5, nchar(twitter_bad_tab_doc$document)))
idx_not_twitter <- c(idx_bad_twitter, idx_empty_twitter)
length(idx_not_twitter)
[1] 96229
Therefore, disregarding Englisg stopwords, punctuations, seperators, numbers, and profanity expressions, there are 96229 numbers of lines to be removed from the twitter dataset. This is 4.077% of the twitter dataset.
In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_twitter_clean. Moreover, we save the result in a new text file, called en_us.twitter_clean.txt.
#Removing the empty and bad lines
data_us_twitter_clean <- data_us_twitter[-idx_not_twitter]
#Writing the result into a text file
fileConn<-file("en_us.twitter_clean.txt")
writeLines(data_us_twitter_clean, fileConn)
close(fileConn)
Now, the clean twitter dataset contains 2263919 lines/observations.
In this seciton, we follow the same tasks done on the Blogs and Twitter datasets, this time on the news dataset. In Sec. 4.1, we read the en_US.news.txt file, and do some preliminary analysis. Sec. 4.2 tokenizes the data in several levels. Moreover, we do some analysis on the tokens of the data. Sec. 4.3 extracts the profanity expressions from the data and does some preliminary analysis on them. Sec. 4.4 cleans up the data.
Let us first read the en_US.news.txt to get a summary of the data:
# Reading the US News data
data_us_news <- read_lines("Data/en_US/en_US.news.txt")
# Summary of the data
summary(data_us_news)
Length Class Mode
1010242 character character
length_news <- sapply(data_us_news, nchar)
#maximum length
max_length_news <- max(length_news)
max_length_news
[1] 11384
#minimum length
min_length_news <- min(length_news)
min_length_news
[1] 1
In the following scripts, we get several kinds of tokens from the news data set.
# all tokens including seperators
tokens_news <- tokens(data_us_news, remove_separators = FALSE)
num_tokens_news <- sapply(tokens_news, length)
#Maximum number of tokens in a line of news
max_num_tokens_news <- max(num_tokens_news)
max_num_tokens_news
[1] 4102
#Number of lines with maximum number of tokens
length(which(num_tokens_news == max_num_tokens_news))
[1] 1
#Minimum number of tokens in a line of news
min_num_tokens_news <- min(num_tokens_news)
min_num_tokens_news
[1] 1
#Number of lines with minimum number of tokens
length(which(num_tokens_news == min_num_tokens_news))
[1] 2540
The number of tokens in an observation/line in the news dataset ranges between 1 and 4102.
There is only 1 line with maximum number of tokens.
There are 2540 lines with a single token. This means that about 0.251% of the observations has only one token.
# all tokens excluding seperators
tokens_wosep_news <- tokens(data_us_news, remove_separators = TRUE)
num_tokens_wosep_news <- sapply(tokens_wosep_news, length)
#Maximum number of tokens w/o seperations in a line of news
max_num_tokens_wosep_news <- max(num_tokens_wosep_news)
max_num_tokens_wosep_news
[1] 2733
#Number of lines with maximum number of tokens w/o seperators
length(which(num_tokens_wosep_news == max_num_tokens_wosep_news))
[1] 1
#Minimum number of tokens w/o seperations in a line of news
min_num_tokens_wosep_news <- min(num_tokens_wosep_news)
min_num_tokens_wosep_news
[1] 0
#Number of lines with minimum number of tokens w/o seperations
length(which(num_tokens_wosep_news == min_num_tokens_wosep_news))
[1] 1
Therefore, excluding the seperators,
The number of tokens in an observation/line of news ranges between 0 and 2733.
There is only 1 line with maximum number of tokens.
There is only 1 empty line. This means that, disregarding seperators, about 0% of the observations is null.
# all tokens excluding seperators and numbers
tokens_wosepnum_news <- tokens(data_us_news, remove_separators = TRUE, remove_numbers = TRUE)
num_tokens_wosepnum_news <- sapply(tokens_wosepnum_news, length)
#Maximum number of tokens w/o seperations and numbers in a line of news
max_num_tokens_wosepnum_news <- max(num_tokens_wosepnum_news)
max_num_tokens_wosepnum_news
[1] 2733
#Number of lines with maximum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_news == max_num_tokens_wosepnum_news))
[1] 1
#Minimum number of tokens w/o seperations and numbers in a line of news
min_num_tokens_wosepnum_news <- min(num_tokens_wosepnum_news)
min_num_tokens_wosepnum_news
[1] 0
#Number of lines with minimum number of tokens w/o seperations and numbers
length(which(num_tokens_wosepnum_news == min_num_tokens_wosepnum_news))
[1] 341
Therefore, excluding the seperations and numbers,
The number of tokens in an observation/line in the news dataset ranges between 0 and 2733.
There is only 1 line with maximum number of tokens.
There are 341 lines with minimum number of tokens, i.e., 0. This means that, disregarding seperations and numbers, about 0.034% of the observations are empty lines.
# all tokens excluding seperators, numbers, and punctuations
tokens_wosepnumpnct_news <- tokens(data_us_news, remove_separators = TRUE, remove_numbers = TRUE, remove_punct = TRUE)
num_tokens_wosepnumpnct_news <- sapply(tokens_wosepnumpnct_news, length)
#Maximum number of tokens w/o seperations, numbers, and punctuations in a line of news
max_num_tokens_wosepnumpnct_news <- max(num_tokens_wosepnumpnct_news)
max_num_tokens_wosepnumpnct_news
[1] 1370
#Number of lines with maximum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_news == max_num_tokens_wosepnumpnct_news))
[1] 2
#Minimum number of tokens w/o seperations, numbers, and punctuations in a line of news
min_num_tokens_wosepnumpnct_news <- min(num_tokens_wosepnumpnct_news)
min_num_tokens_wosepnumpnct_news
[1] 0
#Number of lines with minimum number of tokens w/o seperations, numbers, and punctuations
length(which(num_tokens_wosepnumpnct_news == min_num_tokens_wosepnumpnct_news))
[1] 797
Therefore, excluding the seperators, numbers, and punctuations,
The number of tokens in an observation/line of news ranges between 0 and 1370.
There are only 2 lines with maximum number of tokens.
There are 797 lines with minimum number of tokens. This means that, disregarding seperators, numbers, and punctuations, about 0.079% of the observations are empty lines.
# all tokens excluding seperators, numbers, punctuations, and stopwords
tokens_wosepnumpnctstp_news <- tokens_remove(tokens_wosepnumpnct_news, pattern = stopwords('en'))
num_tokens_wosepnumpnctstp_news <- sapply(tokens_wosepnumpnctstp_news, length)
#Maximum number of tokens w/o seperations, numbers, punctuations, and stopwords in a line of news
max_num_tokens_wosepnumpnctstp_news <- max(num_tokens_wosepnumpnctstp_news)
max_num_tokens_wosepnumpnctstp_news
[1] 1315
#Number of lines with maximum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_news == max_num_tokens_wosepnumpnctstp_news))
[1] 1
#Minimum number of tokens w/o seperations, numbers, punctuations, stopwords in a line of news
min_num_tokens_wosepnumpnctstp_news <- min(num_tokens_wosepnumpnctstp_news)
min_num_tokens_wosepnumpnctstp_news
[1] 0
#Number of lines with minimum number of tokens w/o seperations, numbers, punctuations, and stopwords
length(which(num_tokens_wosepnumpnctstp_news == min_num_tokens_wosepnumpnctstp_news))
[1] 1366
Therefore, excluding the seperators, numbers, punctuations, and English stopwords,
The number of tokens in an observation/line in the news dataset ranges between 0 and 1315.
There is only 1 line with maximum number of tokens.
There are 1366 lines with minimum number of tokens. This means that, disregarding seperators, numbers, punctuations, and stopwords, about 0.135% of the observations are empty lines.
Let us now take a look at the worldcloud of the tokens in the news dataset. Note that we do not consider punctuations, English stopwords, numbers, and seperations:
#Worldcloud - news
set.seed(400)
news_dfm <- dfm(tokens_wosepnumpnctstp_news)
textplot_wordcloud(news_dfm, min_count = 6, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
In this section, we will analyze the observations form the news dataset, whcih contain English profanity words/expressions. Let us first take a look at the profanity words in the news dataset:
tokens_bad_news <- tokens_select(tokens_wosepnumpnctstp_news, pattern = bad_words, selection = "keep" )
set.seed(102)
news_bad_dfm <- dfm(tokens_bad_news)
textplot_wordcloud(news_bad_dfm, min_count = 1, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
news_bad_tab_doc <- as.data.frame(tidy(news_bad_dfm))%>%
group_by(document) %>%
summarize(num_bad = sum(count))
summary(news_bad_tab_doc)
document num_bad
Length:7960 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 1.147
3rd Qu.: 1.000
Max. :12.000
The number of lines in the news dataset which have some profanity expression is 7960. This implies that about 0.788% of the the news dataset is toxic.
In this section, we are going to clean the news dataset in two levels:
As we already calculated, disregarding English stopwords, punctuations, seperators, and numbers, there are 1366 numbers of empty lines. In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_news_v2. Moreover, we save the result in a new text file, called en_us.news_nnull.txt.
idx_empty_news <- which(num_tokens_wosepnumpnctstp_news == 0)
#Removing the empty lines
data_us_news_v2 <- data_us_news[-idx_empty_news]
#Writing the result into a text file
fileConn<-file("en_us.news_nnull.txt")
writeLines(data_us_news_v2, fileConn)
close(fileConn)
In this section, we are going to remove the lines with profanity expressions from the dataset.
In the following script, we first get the indices of the lines where there are some profanity expressions; then, we concatinate them with indices of empty lines:
idx_bad_news <- as.numeric(substr(news_bad_tab_doc$document, 5, nchar(news_bad_tab_doc$document)))
idx_not_news <- c(idx_bad_news, idx_empty_news)
length(idx_not_news)
[1] 9326
Therefore, disregarding Englisg stopwords, punctuations, seperators, numbers, and profanity expressions, there are 9326 numbers of lines to be removed from the news dataset. This is 0.923% of the news dataset.
In the following script, we eliminate such lines from our dataset. The result is saved in a character vector, called data_us_news_clean. Moreover, we save the result in a new text file, called en_us.news_clean.txt.
#Removing the empty and bad lines
data_us_news_clean <- data_us_news[-idx_not_news]
#Writing the result into a text file
fileConn<-file("en_us.news_clean.txt")
writeLines(data_us_news_clean, fileConn)
close(fileConn)
Now, the clean news dataset contains 1000916 lines/observations.
In this section, we are going to integrate the following kinds of vectors:
data_us_XYZ_v2, anddata_us_XYZ_cleanwhere XYZ \(\in \{\)blogs, twitter, news\(\}\), and write them in the following text files, respectively:
en_us_nnull.txt, anden_us_clean.txt# data_us_v2 & en_us_nnul.txt (excluding lines with null or unimportant info)
#Integrating the corresponding vectors
data_us_v2 <- c(data_us_blogs_v2, data_us_twitter_v2, data_us_news_v2)
#Writing the result into a text file
fileConn <- file("en_us_nnull.txt")
writeLines(data_us_v2, fileConn)
close(fileConn)
summary(data_us_v2)
Length Class Mode
4257070 character character
length_us <- sapply(data_us_v2, nchar)
#maximum length
max_length_us <- max(length_us)
max_length_us
[1] 40833
#minimum length
min_length_us <- min(length_us)
min_length_us
[1] 1
Therefore, excluding null lines from the blogs, news, and twitter datasets:
# data_us_clean & en_us_clean.txt (excluding lines with null or bad expressions)
#Integrating the corresponding vectors
data_us_clean <- c(data_us_blogs_clean, data_us_twitter_clean, data_us_news_clean)
#Writing the result into a text file
fileConn<-file("en_us_clean.txt")
writeLines(data_us_clean, fileConn)
close(fileConn)
summary(data_us_clean)
Length Class Mode
4142293 character character
length_us_clean <- sapply(data_us_clean, nchar)
#maximum length
max_length_us_clean <- max(length_us_clean)
max_length_us_clean
[1] 40833
#minimum length
min_length_us_clean <- min(length_us_clean)
min_length_us_clean
[1] 1
Therefore, cleaning up the blogs, news, and twitter datasets:
In this section, we are going to integrate the different kinds of tokens that we got for each datasets.
In the following script, we integrate different kinds of tokens extracted from three different sources, blogs, twitter, and news. For each type of tokens, We first rename the elements so that we can distinguish between different sources. The names of the elements of a token relevant to blogs ( twitter and news, respectively) will start with blogs (twitter and news, respectively).3
The first script integrates the tokens tokens_blog, tokens_twitter, and tokens_news into a token named tokens_us. This would give us all the tokens.
#rename the names of the tokens
names(tokens_blogs) <- nameKeys(i = 5, vec = names(tokens_blogs), txt = "blog")
names(tokens_twitter) <- nameKeys(i = 5, vec = names(tokens_twitter), txt = "twitter")
names(tokens_news) <- nameKeys(i = 5, vec = names(tokens_news), txt = "news")
#Integrate
tokens_us <- append(append(tokens_blogs, tokens_twitter), tokens_news)
#number of tokens in each observation
num_tokens_us <- sapply(tokens_us, length)
Some observations:
The number of tokens in a line from all sources ranges between 1 and 14068.
There is only 1 line with maximum number of tokens.
There are 9012 lines with a single token.
The following script integrates the tokens tokens_wosep_blogs, tokens_wosep_twitter, and tokens_wosep_news into a token named tokens_wosep_us. This would give us all all the tokens exclusing the seperators.
#rename the names of the tokens
names(tokens_wosep_blogs) <- nameKeys(i = 5, vec = names(tokens_wosep_blogs), txt = "blog")
names(tokens_wosep_twitter) <- nameKeys(i = 5, vec = names(tokens_wosep_twitter), txt = "twitter")
names(tokens_wosep_news) <- nameKeys(i = 5, vec = names(tokens_wosep_news), txt = "news")
#Integrate
tokens_wosep_us <- append(append(tokens_wosep_blogs, tokens_wosep_twitter), tokens_wosep_news)
#number of tokens in each observation
num_tokens_wosep_us <- sapply(tokens_wosep_us, length)
Excluding seperators, some observations are as follow:
The number of tokens in a line from all sources ranges between 0 and 7439.
There is only 1 line with maximum number of tokens.
There are 15 lines with no tokens.
The following script integrates the tokens tokens_wosepnum_blogs, tokens_wosepnum_twitter, and tokens_wosepnum_news into a token named tokens_wosepnum_us. This would give us all the tokens exclusing the seperators and numbers.
#rename the names of the tokens
names(tokens_wosepnum_blogs) <- nameKeys(i = 5, vec = names(tokens_wosepnum_blogs), txt = "blog")
names(tokens_wosepnum_twitter) <- nameKeys(i = 5, vec = names(tokens_wosepnum_twitter), txt = "twitter")
names(tokens_wosepnum_news) <- nameKeys(i = 5, vec = names(tokens_wosepnum_news), txt = "news")
#Integrate
tokens_wosepnum_us <- append(append(tokens_wosepnum_blogs, tokens_wosepnum_twitter), tokens_wosepnum_news)
#number of tokens in each observation
num_tokens_wosepnum_us <- sapply(tokens_wosepnum_us, length)
Excluding seperators and numbers, some observations are as follow:
The number of tokens in a line from all sources ranges between 0 and 7092.
There is only 1 line with maximum number of tokens.
There are 751 lines with no tokens.
The following script integrates the tokens tokens_wosepnumpnct_blogs, tokens_wosepnumpnct_twitter, and tokens_wosepnumpnct_news into a token named tokens_wosepnumpnct_us. This would give us all the tokens exclusing the seperators, numbers, and punctuations.
#rename the names of the tokens
names(tokens_wosepnumpnct_blogs) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnct_blogs), txt = "blog")
names(tokens_wosepnumpnct_twitter) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnct_twitter), txt = "twitter")
names(tokens_wosepnumpnct_news) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnct_news), txt = "news")
#Integrate
tokens_wosepnumpnct_us <- append(append(tokens_wosepnumpnct_blogs, tokens_wosepnumpnct_twitter), tokens_wosepnumpnct_news)
#number of tokens in each observation
num_tokens_wosepnumpnct_us <- sapply(tokens_wosepnumpnct_us, length)
Excluding seperators, numbers, and punctuations, some observations are as follow:
The number of tokens in a line from all sources ranges between 0 and 6312.
There is only 1 line with maximum number of tokens.
There are 1760 lines with no tokens.
The following script integrates the tokens tokens_wosepnumpnctstp_blogs, tokens_wosepnumpnctstp_twitter, and tokens_wosepnumpnctstp_news into a token named tokens_wosepnumpnctstp_us. This would give us all the tokens exclusing the seperators, numbers, punctuations, and English stopwords.
#rename the names of the tokens
names(tokens_wosepnumpnctstp_blogs) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnctstp_blogs), txt = "blog")
names(tokens_wosepnumpnctstp_twitter) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnctstp_twitter), txt = "twitter")
names(tokens_wosepnumpnctstp_news) <- nameKeys(i = 5, vec = names(tokens_wosepnumpnctstp_news), txt = "news")
#Integrate
tokens_wosepnumpnctstp_us <- append(append(tokens_wosepnumpnctstp_blogs, tokens_wosepnumpnctstp_twitter), tokens_wosepnumpnctstp_news)
#number of tokens in each observation
num_tokens_wosepnumpnctstp_us <- sapply(tokens_wosepnumpnctstp_us, length)
Excluding seperators, numbers, punctuations, and stop-words, some observations are as follow:
The number of tokens in a line from all sources ranges between 0 and 3893.
There is only 1 line with maximum number of tokens.
There are 12608 lines with no tokens.
Let us now take a look at the worldcloud of all tokens in the blogs dataset, excluding punctuations, English stopwords, numbers, and seperations:
#Worldcloud - Blogs
set.seed(111)
us_dfm <- dfm(tokens_wosepnumpnctstp_us)
textplot_wordcloud(us_dfm, min_count = 6, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
Now, let us take a look at profanity expressions in all the resources. The following script integrates the tokens tokens_bad_blogs, tokens_bad_twitter, and tokens_bad_news into a token named tokens_bad_us. This would give us all the profanity tokens in the all resources.
#rename the names of the tokens
names(tokens_bad_blogs) <- nameKeys(i = 5, vec = names(tokens_bad_blogs), txt = "blog")
names(tokens_bad_twitter) <- nameKeys(i = 5, vec = names(tokens_bad_twitter), txt = "twitter")
names(tokens_bad_news) <- nameKeys(i = 5, vec = names(tokens_bad_news), txt = "news")
#Integrate
tokens_bad_us <- append(append(tokens_bad_blogs, tokens_bad_twitter), tokens_bad_news)
set.seed(222)
us_bad_dfm <- dfm(tokens_bad_us)
textplot_wordcloud(us_bad_dfm, min_count = 1, random_order = FALSE,
rotation = .25,
color = RColorBrewer::brewer.pal(8,"Dark2"))
us_bad_tab_doc <- as.data.frame(tidy(us_bad_dfm))%>%
group_by(document) %>%
summarize(num_bad = sum(count))
summary(us_bad_tab_doc)
document num_bad
Length:114777 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 1.178
3rd Qu.: 1.000
Max. :35.000
The number of lines in all datasets which have some profanity expression is 114777.
We have used the following reference for the profanity expressions in English: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en↩
To this end, we have defined a function calle nameKeys.↩