This is the first report in the Data Science Capstone in the Data Science Specialization by John Hopkins University. The purpose is to showcase my progress towards building a text prediction model and a Shiny app that allows users to type a word, and predict what words comes next.
The motivation for this project is to:
In this report, I will be downloading the corpus data provided, sample a small percentage of it, and clean the data by removing stop words and other unnecessary text data. Then, I performed a brief exploration of the data in terms of frequency of words using the n-gram models to better understand the data and gain insights towards building the model.
The data for this project was obtained from the course website through this link, which came from HC Corpora, a collection of corpora for various languages freely available to download. View corpora-info.md for more information about the data.
There were several languages provided, but I only used the English files:
en_US.blogs.txt: blogs texten_US.news.txt: news feed texten_US.twitter.txt: twitter text dataI first start with loading the libraries using the package check function from this article to check if they’re installed
# list all packages
packages = c("tidyverse", # a collection of R packages designed for data science
"tidytext", # text mining library that follows the tidy data principle
"here",# easy file referencing in project-oriented workflows
"tm", # A framework for text mining applications within R.
"RColorBrewer", # color palette for R plots.
"wordcloud2", # word cloud as html widgets
"htmlwidgets", # html widgets in R
"webshot", # take screenshots of web pages from R
"kableExtra") # simple table generator to make table outputs nicer
## If a package is installed, it will be loaded. If any
## are not, the missing package(s) will be installed
## from CRAN and then loaded.
package.check <- lapply(
packages,
FUN = function(x) {
if (!require(x, character.only = TRUE)) {
install.packages(x, dependencies = TRUE)
library(x, character.only = TRUE)
}
}
)
webshot::install_phantomjs() # to save wordcloud plot
theme_set(theme_classic())
I coded a helper function download_data which essentially downloads the data from the URL from the course website based on the given local, then puts it under the data folder as a folder based on the specified name. It also removes the other data files. Please refer to appendix for the code.
# download_data("en_US", "original")
# save the data path
data_path <- here("10_DataScienceCapstone/data/original")
file_names <- list.files(data_path) # list all files
# save paths for our text data files
blogs_txt_path <- here(data_path, file_names[1])
news_txt_path <- here(data_path, file_names[2])
twitter_txt_path <- here(data_path, file_names[3])
file_names
## [1] "blogs.txt" "news.txt" "twitter.txt"
I also created a function file_info, which takes the file name as arguments, and outputs a tibble with the following information:
Refer to appendix for the code and for a bash alternative.
# show information of our text data.
file_info(file_names) %>%
kbl() %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| file_name | size | line_count | word_count | max_line |
|---|---|---|---|---|
| blogs.txt | 200.4 MB | 899,288 | 37,334,690 | 140 |
| news.txt | 196.3 MB | 1,010,242 | 34,372,720 | 11,384 |
| twitter.txt | 159.4 MB | 2,360,148 | 30,374,206 | 40,833 |
From this, we see that the largest file is are the blogs, followed by news then twitter.
Since the data is fairly large as observed from the file information, I’ve decided to sample the data to speed up my analysis and the development of the initial model. I have three sources - news, blogs and tweets.
Since news text will logically have better English standards like spelling and grammar, I will sample slightly more (10%) from news and 5% from blogs and twitter, which can be done using probability and rbinom from R
Code for sample_file is in appendix below.
# prob is probability of sampling
sample_file(file_names[1], blogs_txt_path, prob = 0.05)
sample_file(file_names[2], news_txt_path, prob = 0.1)
sample_file(file_names[3], twitter_txt_path, prob = 0.05)
sample_path <- here("10_DataScienceCapstone/data/sampled")
list.files(sample_path)
## [1] "sub_blogs.txt" "sub_news.txt" "sub_twitter.txt"
The sampled data will be written to a new folder called sampled under the data folder, and the text files will be have the “sub_” prefix.
With our sampled data, we can now read in the text files individually using a helper function readTxtFile which uses the readLines function.
The text files are then combined together.
sampled_file_names <- list.files(sample_path)
twitter_txt <- readTxtFile(here(sample_path, sampled_file_names[1]))
news_txt <- readTxtFile(here(sample_path, sampled_file_names[2]))
blogs_txt <- readTxtFile(here(sample_path, sampled_file_names[3]))
combined_txt <- paste(c(twitter_txt, news_txt, blogs_txt))
Now it’s time to clean the data to prepare it for analysis.
Most of these tasks are already performed by the unnest_tokens function from tidytext, which makes my job easier.
# Load data for stop words
data(stop_words)
# head(stop_words)
# get bad words from github repository
get_bad_words <- function() {
url <-
"https://raw.githubusercontent.com/RobertJGabriel/Google-profanity-words/master/list.txt"
bad_words <- read.delim2(
file = url,
header = F,
sep = "\t",
col.names = "text"
)
return(bad_words)
}
bad_words <- get_bad_words() %>% rename('word' = text)
# Create own stop words here
# my_stop_words <- data.frame(word = c())
With profane and stop words data ready, we can begin building our n-gram models. Note the cleaning is happening as we are building our ngram tibbles.
unigram_tb <- tibble(line = 1:(length(combined_txt)), text = combined_txt) %>%
unnest_tokens(word, text) %>% # turn our text file into individual words
anti_join(stop_words, by = "word") %>% # remove stop words
anti_join(bad_words, by = "word") %>% # remove profane words
filter(!str_detect(word, "\\d+")) %>% # filter digits
mutate_at("word", str_replace, "[[:punct:]]", "") # remove punctuation
head(unigram_tb)
## # A tibble: 6 x 2
## line word
## <int> <chr>
## 1 1 peter
## 2 1 schiff
## 3 1 hard
## 4 1 pretty
## 5 1 bad
## 6 1 americans
bigram_tb <- tibble(line = 1:(length(combined_txt)), text = combined_txt) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
mutate_at("bigram", str_replace, "[[:punct:]]", "") %>%
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% stop_words$word) %>%
filter(!word2 %in% stop_words$word) %>%
unite(bigram, word1, word2, sep = " ") %>%
filter(!str_detect(bigram, "\\d+")) %>%
filter(!str_detect(bigram, "NA")) # remove missing values
head(bigram_tb)
## # A tibble: 6 x 2
## line bigram
## <int> <chr>
## 1 1 peter schiff
## 2 1 schiff hard
## 3 1 pretty bad
## 4 1 buy stuff
## 5 1 individual liberty
## 6 2 neighbor recommended
trigram_tb <- tibble(line = 1:(length(combined_txt)), text = combined_txt) %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
mutate_at("trigram", str_replace, "[[:punct:]]", "") %>%
separate(trigram, c("word1", "word2", "word3"), sep = " ") %>%
filter(!word1 %in% stop_words$word,
!word2 %in% stop_words$word,
!word3 %in% stop_words$word) %>%
unite(trigram, word1, word2, word3, sep = " ") %>%
filter(!str_detect(trigram, "\\d+")) %>%
filter(!str_detect(trigram, "NA"))
head(trigram_tb)
## # A tibble: 6 x 2
## line trigram
## <int> <chr>
## 1 1 peter schiff hard
## 2 2 neighbor recommended chasing
## 3 2 recommended chasing fireflies
## 4 2 charles martin id
## 5 2 neighbors favorite author
## 6 2 favorite author fireflies
With the data cleaned, we can start analyzing the frequency of words.
To visualize the data, helper functions plot_top and wordcloud_plot were created to plot the top_n words and word cloud.
plot_top(unigram_tb, 15)
file_name <- "unigram_wc.png"
wordcloud_plot(unigram_tb, file_name, 150)
knitr::include_graphics(path.expand(here("10_DataScienceCapstone/figs", file_name)))
For single term words, the words time, people and day are very common. We also see the word ‘rt’ which might stand for retweet from twitter, and should be added to stop words. We also see teh words game, team, and night which all points towards a sports game. Then there’s the word ‘lol’ which I believe comes from twitter as well.
The word cloud shows us more words, and can prove useful for identifying stop words that we should add later on.
plot_top(bigram_tb, 15)
file_name <- "bigram_wc.png"
wordcloud_plot(bigram_tb, file_name, 150)
knitr::include_graphics(path.expand(here("10_DataScienceCapstone/figs", file_name)))
For bigrams, we see some US locations are very common, along with terms related to the government (white house, vice president, supreme court) which possible came from the news corpus. We also see the terms happy birthday and mothers day, which is should belong to the twitter corpus.
plot_top(trigram_tb, 15)
file_name <- "trigram_wc.png"
wordcloud_plot(trigram_tb, file_name, 100)
knitr::include_graphics(path.expand(here("10_DataScienceCapstone/figs", file_name)))
As we analyse trigrams now, we see the previous term “mothers day” is now connected to “happy mothers day”, which shows the relationship between the terms. Holidays like Cinco de Mayo and St. Patrick’s Day is also popping up, along with names like Gov Chris Christie, President Barrack Obama, and Martin Luther King.
For computers to understand our data, we need to convert it into a machine understandable form. In natural language processing (NLP), one of the techniques is called TF-IDF, which stands for term frequency, inverse document frequency.
TF-IDF will convert text documents in to a form where each sentence is a document and words in the sentence are tokens. The result is something called a DocumentTermMatrix (DTM), or TermDocumentMatrix (TDM), depending on whether the documents correspond to row or column. What this does is essentially provide measure to weigh the importance of different words.
Using the tm package, I can cast my data frames into a dtm.
my_dtm <- tibble(line = 1:(length(combined_txt)), text = combined_txt) %>%
unnest_tokens(word, text) %>%
count(line, word) %>%
cast_dtm(line, word, n)
my_dtm
## <<DocumentTermMatrix (documents: 265350, terms: 153707)>>
## Non-/sparse entries: 5813631/40780338819
## Sparsity : 100%
## Maximal term length: 109
## Weighting : term frequency (tf)
Our dtm has a total of 265350 sentences and 153707 terms. It also seems to be 100% sparse, which can cause problems to our model. This will have to be fixed later on.
The analysis helped me understand more about what kind of information my sampled data captures. With a dtm ready, the next step is to get more data for testing and validation, then build the model. After that, I will start building the shiny app for users to use the data product. Throughout the process, I will by studying more from the book Tidy text mining and research suitable algorithms to use.
The steps are summarized below:
Prepare train test and validation dataset
Train and evaluate text prediction model
Build shiny app
Slide deck
I plan to generate another random sample from the news dataset to validate my prediction model. I choose the news dataset because it should contain the most proper English text.
# locale options: en_US, de_DE, ru_RU and fi_FI
# outdir = directory name
download_data <- function(locale, outdir) {
here::i_am("10_DataScienceCapstone/rmd/milestone-report.Rmd")
data_path <- here("10_DataScienceCapstone/data")
if (dir.exists(here(data_path, outdir))) {
print("directory already exists")
} else {
options(timeout = 200) # to prevent timeout error
# download data into temp file
temp <- tempfile()
download.file(url = "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip",
temp)
# unzip tempfile and remove
unzip(zipfile = temp, exdir = data_path)
unlink(temp)
}
# save directory of extracted zip file
final_path <- here(data_path, "final")
# create outdir directory
dir.create(here(data_path, outdir))
# grabs files with en_US
old_names <-
list.files(
path = final_path,
pattern = paste0("^", locale),
recursive = TRUE
)
# provide new names
new_names <- c("blogs.txt", "news.txt", "twitter.txt")
# rename and remove old ones.
file.rename(from = file.path(final_path, old_names),
to = file.path(here(data_path, outdir), new_names))
# remove final folder from zip file
unlink(here(data_path, "final"), recursive = TRUE)
}
# create txt file
readTxtFile <- function(path) {
con <- file(path, "r")
text <- readLines(con, skipNul = T)
close(con)
return(text)
}
# Creates a table given the text files
file_info <- function(names) {
# file size
size <- file.info(here(data_path, names))$size / (2 ** 20)
# word count
(total_words_bash <-
system("wc -w ../data/original/*.txt", intern = TRUE))
regexp <- "[[:digit:]]+"
word_count <-
unlist(str_split(str_extract(total_words_bash, regexp), " ")[1:3])
line_count <- c()
max_line <- c()
for (name in names) {
file <- readTxtFile(here(data_path, name))
num_lines <- length(file)
longest_line <- as.numeric(summary(nchar(file))["Max."])
line_count <- c(line_count, num_lines)
max_line <- c(longest_line, max_line)
}
tb <- tibble(
"file_name" = names,
"size" = paste(round(size, 1), "MB"),
"line_count" = line_count,
"word_count" = as.integer(word_count),
"max_line" = as.integer(max_line)
) %>%
mutate_if(is.numeric, list( ~ prettyNum(., big.mark = ",")))
return(tb)
}
# -w gives word count
# -c gives byte count
# -l gives line count
echo " lines words bytes"
wc -l -w -c ../data/original/*
sample_file <- function(filename, filepath, prob) {
set.seed(2021)
con <- file(filepath, "r")
file <- readLines(con, skipNul = T)
len <- length(file)
sub_file <- file[rbinom(n = len, size = 1, prob = prob) == 1]
close(con)
sample_path <- here("10_DataScienceCapstone/data/sampled")
if (!dir.exists(sample_path)) {
dir.create(sample_path)
}
new_file_path <- paste0(sample_path, "/sub_", filename)
if (!file.exists(new_file_path)) {
out <- file(new_file_path, "w")
writeLines(sub_file, con = out)
close(out)
}
}
# plots top n words
plot_top <- function(tibble, top_num) {
tibble %>%
rename(ngram = colnames(tibble)[2]) %>%
count(ngram, sort = TRUE) %>%
slice(1:top_num) %>%
mutate(ngram = reorder(ngram, n)) %>%
ggplot(aes(n, ngram)) +
geom_col() +
labs(y = NULL)
}
# word cloud plots top n words
wordcloud_plot <- function(tibble, file_name, top_num=100) {
wordcloud <- tibble %>%
rename(ngram = colnames(tibble)[2]) %>%
count(ngram, sort = TRUE) %>%
slice(1:top_num) %>%
wordcloud2(size=0.7, color='random-dark', minRotation = 0, maxRotation = 0)
saveWidget(wordcloud, "tmp.html", selfcontained = F)
webshot("tmp.html", here("10_DataScienceCapstone/figs", file_name), delay = 5, vwidth = 1000, vheight = 800)
unlink(here("10_DataScienceCapstone/rmd", "tmp_files"), recursive = TRUE)
unlink(here("10_DataScienceCapstone/rmd", "tmp.html"))
}