We start by setting up the environment and getting to know a little bit about our databases - the three files with English texts made available on https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip.
if(!file.exists('Coursera-SwiftKey.zip')){
download.file('https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip', 'Coursera-SwiftKey.zip')}
unzip('Coursera-SwiftKey.zip', overwrite = FALSE)
list_of_packages <- c("tm", "filehash", "openNLP", "readr", "SnowballC", "R.utils", "tidyverse", "tidytext", "markovchain", "data.table",
"tau", "plyr", "dplyr", "plotly", "ggplot2")
new_packages <- list_of_packages[!(list_of_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
library(tm)
library(filehash)
library(readr)
library(SnowballC)
library(stringr)
library(R.utils)
library(tidyverse)
library(tidytext)
library(data.table)
library(ggplot2)
library(tau)
library(plyr)
library(dplyr)
library(plotly)
n_lines_blogs <- countLines(paste(getwd(), '/final/en_US/en_US.blogs.txt', sep = ""))
n_lines_news <- countLines(paste(getwd(), '/final/en_US/en_US.news.txt', sep = ""))
n_lines_twitter <- countLines(paste(getwd(), '/final/en_US/en_US.twitter.txt', sep = ""))
twitter <- PlainTextDocument(read_lines(file = paste(getwd(), '/final/en_US/en_US.twitter.txt', sep = ""), n_max = -1L, progress = show_progress()),
heading = "KJB", id = 'twitter',
language = "en", description = "Report File")
blogs <- PlainTextDocument(read_lines(file = paste(getwd(), '/final/en_US/en_US.blogs.txt', sep = ""), n_max = -1L, progress = show_progress()),
heading = "KJB", id = 'twitter',
language = "en", description = "Report File")
news <- PlainTextDocument(read_lines(file = paste(getwd(), '/final/en_US/en_US.news.txt', sep = ""), n_max = -1L, progress = show_progress()),
heading = "KJB", id = 'twitter',
language = "en", description = "Report File")
That will tell us that the “blogs” file has 899288 lines; in “news” there are 1010242 lines; and in “twitter”, 2360148 lines. Now, we’re going to use the solution proposed by Jason Watts in https://www.codementor.io/@jhwatts2010/counting-words-with-r-ds35hzgmj, with some minor twitches, to expose the top 50 words - stop words excluded - in each file. We’re going to use a pair of functions defined below.
## CesarTC input: standard cleaning function
std_clean <- function(x){
if("Corpus" %in% class(x)){
x <- tm_map(x, removePunctuation, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = FALSE, ucp = FALSE)
x <- tm_map(x, removeNumbers)
x <- tm_map(x, str_replace_all, pattern = '”', replacement = '\"')
x <- tm_map(x, str_replace_all, pattern = '“', replacement = '\"')
x <- tm_map(x, str_replace_all, pattern = '’', replacement = '\'')
x <- tm_map(x, str_replace_all, pattern = ' - ', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = ' \"', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = ' \'', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = '\" ', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = '\' ', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = '\"$', replacement = '')
x <- tm_map(x, str_replace_all, pattern = '\'$', replacement = '')
x <- tm_map(x, iconv, from = 'latin1', to = 'ASCII', sub = '') #remove unexpected characters - for example, chinese/japanese/korean characters
x <- tm_map(x, str_replace_all, pattern = ' ', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = ' ', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = ' ', replacement = ' ')
x <- tm_map(x, str_replace_all, pattern = ' ', replacement = ' ')
} else {
x <- removePunctuation(x, preserve_intra_word_contractions = TRUE, preserve_intra_word_dashes = FALSE, ucp = FALSE)
x <- removeNumbers(x)
x <- str_replace_all(x, pattern = '”', replacement = '\"')
x <- str_replace_all(x, pattern = '“', replacement = '\"')
x <- str_replace_all(x, pattern = '’', replacement = '\'')
x <- str_replace_all(x, pattern = ' - ', replacement = ' ')
x <- str_replace_all(x, pattern = ' \"', replacement = ' ')
x <- str_replace_all(x, pattern = ' \'', replacement = ' ')
x <- str_replace_all(x, pattern = '\" ', replacement = ' ')
x <- str_replace_all(x, pattern = '\' ', replacement = ' ')
x <- str_replace_all(x, pattern = '\"$', replacement = '')
x <- str_replace_all(x, pattern = '\'$', replacement = '')
x <- iconv(x, from = 'latin1', to = 'ASCII', sub = '') #remove unexpected characters - for example, chinese/japanese/korean characters
x <- str_replace_all(x, pattern = ' ', replacement = ' ')
x <- str_replace_all(x, pattern = ' ', replacement = ' ')
x <- str_replace_all(x, pattern = ' ', replacement = ' ')
x <- str_replace_all(x, pattern = ' ', replacement = ' ')
}
return(x)
}
## CesarTC end
## function defined based on Jason Watts' script
tokenize_plot_top50 <- function(data, n_gram){
# Remove Stop Words and Tokenize Text
tkn_data <- tau::textcnt(tm::scan_tokenizer(std_clean(data)), method = "string", n = 1L,
lower = if(deparse(substitute(data)) == 'news'){
as.integer(round(n_lines_news / 1000, 0))}
else if(deparse(substitute(data)) == 'twitter'){
as.integer(round(n_lines_twitter / 1000, 0))}
else if(deparse(substitute(data)) == 'blogs'){
as.integer(round(n_lines_blogs / 1000, 0))}
else{2L},
split = '[[:space]]+')
# Change List to Data Frame
tkn_data <- plyr::ldply(tkn_data, data.frame)
colnames(tkn_data)<-c("word", "frequency")
# Using dplyr Filter
Results <- tkn_data %>%
filter(!(tkn_data[,1] %in% stopwords()))
Results <- Results %>%
arrange(desc(frequency)) %>%
head(50) %>%
mutate(frequency = frequency/1000) %>%
arrange(word)
return(list(tokenized = tkn_data,
plot = ggplot(Results, aes(x=word, y=frequency, fill=word)) +
geom_bar(width = 0.75, stat = "identity", colour = "black", size = 1) + coord_polar(theta = "x") +
xlab("") + ylab("") + ggtitle(paste("Word Frequency (times 1,000) -", deparse(substitute(data)), sep = " ")) + theme(legend.position = "none") + labs(x = NULL, y = NULL)
))
}
tkn_twitter <- tokenize_plot_top50(twitter, 1L)
tkn_blogs <- tokenize_plot_top50(blogs, 1L)
tkn_news <- tokenize_plot_top50(news, 1L)
Now let’s see the most common words on each file.
tkn_twitter['plot']
## $plot
tkn_blogs['plot']
## $plot
tkn_news['plot']
## $plot
We found it interesting that the most common words were quite similar on twitter and blogs. Informal contractions, such as “u”, “lol”, “oh”, and “rt”, made the top50 only on the twitter file. On the more formal news texts, a different mix of terms emerged, with the word “said” clearly dominating the actions.
Is that the pattern? Does blogs resemble more twitter than news texts, as the top50 cut suggests?
When tokenizing, we made a choice of dropping the words with frequencies lower than 0.1% of the number of lines on each file. That was quite an empirical (and admittedly questionable) cut. Still, it left us with 1237 on the twitter source, 3458 on the blogs source, and 3473 on the news source. For the time being, we are quite comfortable with those sample sizes, but changing it would be no headache at all.
The length of each data set is not the same, but grouping the terms by the sources they appeared in is quite enlightening:
check <- tkn_news$tokenized %>%
arrange(desc(frequency)) %>%
# head(50) %>%
full_join(tkn_twitter$tokenized , by = 'word') %>%
full_join(tkn_blogs$tokenized , by = 'word') %>%
rename(news_freq = frequency.x,
twitter_freq = frequency.y,
blogs_freq = frequency) %>%
mutate(oc = paste(if_else(is.na(news_freq), "", "news-"),
if_else(is.na(twitter_freq), "", "twitter-"),
if_else(is.na(blogs_freq), "", "blogs"), sep = "")) %>%
mutate(oc = str_replace(oc,'-$',''))
check %>% dplyr::group_by(oc) %>% dplyr::summarize(n = dplyr::n()) %>% dplyr::arrange(desc(n))
## # A tibble: 7 x 2
## oc n
## <chr> <int>
## 1 news-blogs 1520
## 2 news-twitter-blogs 1038
## 3 news 902
## 4 blogs 805
## 5 twitter-blogs 95
## 6 twitter 91
## 7 news-twitter 13
As we can see above, words common to news and blogs are more frequent than words from all sources. That makes sense, given the smaller number of tokens from twitter that made the our arbitrary cut explained above. The number of words from news and blogs is roughly the same, while twitter has only 36% of that count. Still, the number of unique words on each source is not proportional: news have more unique words than any other source (902), and twitter has 10-12% as much as the others (91). Words that only appear on blogs and twitter are more frequent than those common only to twitter and news. That indicates, not surprisingly, that blogs are sources “closer” to twitter than news, but resemble the latter much more than former.
To advance in this study, we’re planning on exploring 2-gram and 3-gram combinations to learn interesting word sequences. Later, we expect to be able to predict three different words for any given combination of 0-2 words.