This is the capstone project for the Data Science Specialization.
In this report, we demonstrate some basic features of the dataset, and do some exploratory analysis with the unigrams in the data.
Here we load the data from the original dataset, and count the lines, words and characters with a wc command.
| File Name | Line Count | Word Count | Character Count |
|---|---|---|---|
| blogs_sample.txt | 180171 | 7490703 | 41988757 |
| news_sample.txt | 15303 | 521131 | 3105656 |
| twitter_sample.txt | 200000 | 2579440 | 13985117 |
| Total | 395474 | 10591274 | 59079530 |
Since the size of data is huge, we make a sample that consists 20% of the records, and save the samples.
set.seed(5832)
processFile <- function(filepath, sampleProp) {
df = data.frame(readLines(filepath))
return(sample_frac(df, sampleProp))
}
dir.create(file.path(".", "sample"), showWarnings = FALSE)
if(!file.exists("sample/twitter_sample.txt")){
twsample <- processFile("final/en_US/en_US.twitter.txt", 0.2)
write.table(twsample, "sample/twitter_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
if(!file.exists("sample/blogs_sample.txt")){
blogsample <- processFile("final/en_US/en_US.blogs.txt", 0.2)
write.table(blogsample, "sample/blogs_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
if(!file.exists("sample/news_sample.txt")){
newsample <- processFile("final/en_US/en_US.news.txt", 0.2)
write.table(newsample, "sample/news_sample.txt", row.names = FALSE, col.names = FALSE, quote = FALSE)
}
Here we divide the data into three subset by 60% / 20% / 20%, and use them as training, testing and validation sets respectively.
if(!file.exists("data/training.rds")){
tw <- readLines("sample/twitter_sample.txt")
bl <- readLines("sample/blogs_sample.txt")
nw <- readLines("sample/news_sample.txt")
text = c(tw, bl, nw)
Encoding(text) <- "UTF-8"
docs <- data_frame(text)
set.seed(4869)
intrain <- sample(nrow(docs), 0.6 * nrow(docs))
training <- docs[intrain,]
dir.create(file.path(".", "data"), showWarnings = FALSE)
saveRDS(training, "data/training.rds")
testing <- docs[-intrain, ]
invalid <- sample(nrow(testing), 0.5 * nrow(testing))
validating <- testing[invalid,]
testing <- testing[-invalid,]
saveRDS(validating, "data/validating.rds")
saveRDS(testing, "data/testing.rds")
} else{
training <- readRDS("data/training.rds")
}
We do not want our app to produce profane words. Therefore we read in a list of bad words we want to avoid.
bad.words <- read.csv("bad-words.txt", col.names = c("word"),header = FALSE)
We tokenize the text samples into words, remove the bad words and stop words, and do some exploratory analysis.
unigram <- training %>% unnest_tokens(word, text) %>%
filter(!grepl("[+-]?([0-9]*[.])?[0-9]+", word)) %>%
count(word) %>%
anti_join(bad.words) %>%
ungroup() %>%
arrange(desc(n))
## Joining, by = "word"
data(stop_words)
tidystop <- unigram %>% anti_join(stop_words)
## Joining, by = "word"
Now we have a tidy unigram sample, and we can find the most frequent words that appeared in our sample, and form a nice word cloud with them.
tidystop %>%
.[1:30,] %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
xlab(NULL) +
coord_flip()
library(wordcloud)
## Loading required package: RColorBrewer
tidystop %>%
with(wordcloud(word, n, max.words = 120))