Introduction

This is the capstone project for the Data Science Specialization.

In this report, we demonstrate some basic features of the dataset, and do some exploratory analysis with the unigrams in the data.

Loading data

Here we load the data from the original dataset, and count the lines, words and characters with a wc command.

File Sizes

File Name Line Count Word Count Character Count
blogs_sample.txt 180171 7490703 41988757
news_sample.txt 15303 521131 3105656
twitter_sample.txt 200000 2579440 13985117
Total 395474 10591274 59079530

Since the size of data is huge, we make a sample that consists 20% of the records, and save the samples.

set.seed(5832)

processFile <- function(filepath, sampleProp) {
  df = data.frame(readLines(filepath))
  return(sample_frac(df, sampleProp))
}


dir.create(file.path(".", "sample"), showWarnings = FALSE)
if(!file.exists("sample/twitter_sample.txt")){
  twsample <- processFile("final/en_US/en_US.twitter.txt", 0.2)
  write.table(twsample, "sample/twitter_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
if(!file.exists("sample/blogs_sample.txt")){
  blogsample <- processFile("final/en_US/en_US.blogs.txt",  0.2)
  write.table(blogsample, "sample/blogs_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
if(!file.exists("sample/news_sample.txt")){
  newsample <- processFile("final/en_US/en_US.news.txt", 0.2)
  write.table(newsample, "sample/news_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}

Here we divide the data into three subset by 60% / 20% / 20%, and use them as training, testing and validation sets respectively.

if(!file.exists("data/training.rds")){
  tw <- readLines("sample/twitter_sample.txt")
  bl <- readLines("sample/blogs_sample.txt")
  nw <- readLines("sample/news_sample.txt")
  text = c(tw, bl, nw)
  Encoding(text) <- "UTF-8"
  docs <- data_frame(text)
  set.seed(4869)
  intrain <- sample(nrow(docs), 0.6 * nrow(docs))
  training <- docs[intrain,]
  dir.create(file.path(".", "data"), showWarnings = FALSE)
  saveRDS(training, "data/training.rds")
  testing <- docs[-intrain, ]
  invalid <- sample(nrow(testing), 0.5 * nrow(testing))
  validating <- testing[invalid,]
  testing <- testing[-invalid,]
  saveRDS(validating, "data/validating.rds")
  saveRDS(testing, "data/testing.rds")
} else{
  training <- readRDS("data/training.rds")
}

Bad Words

We do not want our app to produce profane words. Therefore we read in a list of bad words we want to avoid.

bad.words <- read.csv("bad-words.txt", col.names = c("word"),header = FALSE)

Unigram Exploration

We tokenize the text samples into words, remove the bad words and stop words, and do some exploratory analysis.

unigram <- training %>% unnest_tokens(word, text) %>%
    filter(!grepl("[+-]?([0-9]*[.])?[0-9]+", word)) %>% 
      count(word) %>%
          anti_join(bad.words) %>%
            ungroup() %>%
              arrange(desc(n))
## Joining, by = "word"
data(stop_words)
tidystop <- unigram %>% anti_join(stop_words)
## Joining, by = "word"

Now we have a tidy unigram sample, and we can find the most frequent words that appeared in our sample, and form a nice word cloud with them.

tidystop %>%
   .[1:30,] %>%
   mutate(word = reorder(word, n)) %>%
   ggplot(aes(word, n)) +
   geom_col() +
   xlab(NULL) +
   coord_flip()

library(wordcloud)
## Loading required package: RColorBrewer
tidystop %>%
  with(wordcloud(word, n, max.words = 120))

Referece

tidytext
ngram
katz’s back off
bad words smoothing and discount CS498JH Introduction to NLP