Milestone Report for Data Science Capstone Project

Introduction

This is the capstone project for the Data Science Specialization.

In this report, we demonstrate some basic features of the dataset, and do some exploratory analysis with the unigrams in the data.

Loading data

Here we load the data from the original dataset, and count the lines, words and characters with a wc command.

File Sizes

File Name	Line Count	Word Count	Character Count
blogs_sample.txt	180171	7490703	41988757
news_sample.txt	15303	521131	3105656
twitter_sample.txt	200000	2579440	13985117
Total	395474	10591274	59079530

Since the size of data is huge, we make a sample that consists 20% of the records, and save the samples.

set.seed(5832)

processFile <- function(filepath, sampleProp) {
  df = data.frame(readLines(filepath))
  return(sample_frac(df, sampleProp))
}


dir.create(file.path(".", "sample"), showWarnings = FALSE)
if(!file.exists("sample/twitter_sample.txt")){
  twsample <- processFile("final/en_US/en_US.twitter.txt", 0.2)
  write.table(twsample, "sample/twitter_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
if(!file.exists("sample/blogs_sample.txt")){
  blogsample <- processFile("final/en_US/en_US.blogs.txt",  0.2)
  write.table(blogsample, "sample/blogs_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
if(!file.exists("sample/news_sample.txt")){
  newsample <- processFile("final/en_US/en_US.news.txt", 0.2)
  write.table(newsample, "sample/news_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}

Here we divide the data into three subset by 60% / 20% / 20%, and use them as training, testing and validation sets respectively.

if(!file.exists("data/training.rds")){
  tw <- readLines("sample/twitter_sample.txt")
  bl <- readLines("sample/blogs_sample.txt")
  nw <- readLines("sample/news_sample.txt")
  text = c(tw, bl, nw)
  Encoding(text) <- "UTF-8"
  docs <- data_frame(text)
  set.seed(4869)
  intrain <- sample(nrow(docs), 0.6 * nrow(docs))
  training <- docs[intrain,]
  dir.create(file.path(".", "data"), showWarnings = FALSE)
  saveRDS(training, "data/training.rds")
  testing <- docs[-intrain, ]
  invalid <- sample(nrow(testing), 0.5 * nrow(testing))
  validating <- testing[invalid,]
  testing <- testing[-invalid,]
  saveRDS(validating, "data/validating.rds")
  saveRDS(testing, "data/testing.rds")
} else{
  training <- readRDS("data/training.rds")
}

Bad Words

We do not want our app to produce profane words. Therefore we read in a list of bad words we want to avoid.

bad.words <- read.csv("bad-words.txt", col.names = c("word"),header = FALSE)

Unigram Exploration

We tokenize the text samples into words, remove the bad words and stop words, and do some exploratory analysis.

unigram <- training %>% unnest_tokens(word, text) %>%
    filter(!grepl("[+-]?([0-9]*[.])?[0-9]+", word)) %>% 
      count(word) %>%
          anti_join(bad.words) %>%
            ungroup() %>%
              arrange(desc(n))

## Joining, by = "word"

data(stop_words)
tidystop <- unigram %>% anti_join(stop_words)

## Joining, by = "word"

Now we have a tidy unigram sample, and we can find the most frequent words that appeared in our sample, and form a nice word cloud with them.

tidystop %>%
   .[1:30,] %>%
   mutate(word = reorder(word, n)) %>%
   ggplot(aes(word, n)) +
   geom_col() +
   xlab(NULL) +
   coord_flip()

library(wordcloud)

## Loading required package: RColorBrewer

tidystop %>%
  with(wordcloud(word, n, max.words = 120))

Referece

tidytext
ngram
katz’s back off
bad words smoothing and discount CS498JH Introduction to NLP