This is the milestone report of the Data Science Capstone project. The goal of this report is to acquire the dataset and perform exploratory analysis to provide direction to the final prediction app.
The dataset is downloaded directly from the course website. it consists of texts entries collected from blogs, newspapers and twitter in four languarges: English, Finnish, German and Russian. For this proejct, we will only be using the English files.
##Download the dataset
corpora_url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if(!file.exists("corpora")){
dir.create("corpora")
}
if(!file.exists("corpora/final/en_US")){
temp_zip <- tempfile()
download.file(corpora_url, destfile=temp_zip, mode="wb")
unzip(temp_zip, exdir="corpora")
unlink(temp_zip)
}
##Read the files
con_blogs <- file("corpora/final/en_US/en_US.blogs.txt", open="r")
US_blogs <- readLines(con_blogs)
close(con_blogs)
con_news <- file("corpora/final/en_US/en_US.news.txt", open="r")
US_news <- readLines(con_news)
close(con_news)
con_twitter <- file("corpora/final/en_US/en_US.twitter.txt", open="r")
US_twitter <- readLines(con_twitter)
close(con_twitter)
Once the corpora is downloaded and read into R, we examine some basics of the three files, such as numbers of lines, words and characters.
##count lines for each file:
lines_blogs <- length(US_blogs)
lines_news <- length(US_news)
lines_twitter <- length(US_twitter)
##breakdown each file by words:
words_blogs <- words_blogs <- unlist(strsplit(US_blogs, "\\s+"))
words_news <- unlist(strsplit(US_news, "\\s+"))
words_twitter <- unlist(strsplit(US_twitter, "\\s+"))
##count words in each file:
wordcount_blogs <- length(words_blogs)
wordcount_news <- length(words_news)
wordcount_twitter <- length(words_twitter)
##calculate average numbers of words per line
WPL_blogs <- lapply(strsplit(US_blogs, "\\s+"), unlist)
WPL_news <- lapply(strsplit(US_news, "\\s+"), unlist)
WPL_twitter <- lapply(strsplit(US_twitter, "\\s+"), unlist)
lineLength_blogs <- mean(sapply(WPL_blogs, length))
lineLength_news <- mean(sapply(WPL_news, length))
lineLength_twitter <- mean(sapply(WPL_twitter, length))
##count characters in each file:
chr_blogs <- sum(nchar(US_blogs))
chr_news <- sum(nchar(US_news))
chr_twitter <- sum(nchar(US_news))
##calculate average word length
wordlength_blogs <- mean(nchar(words_blogs))
wordlength_news <- mean(nchar(words_news))
wordlength_twitter <- mean(nchar(words_twitter))
##summary table
library(knitr)
files_summary <- data.frame(
source_file= c("blogs", "news", "twitter"),
line_count = c(lines_blogs, lines_news, lines_twitter),
word_count = c(wordcount_blogs, wordcount_news, wordcount_twitter),
mean_line_length = c(lineLength_blogs, lineLength_news, lineLength_twitter),
character_count = c(chr_blogs, chr_news, chr_twitter),
mean_word_length = c(wordlength_blogs, wordlength_news, wordlength_twitter)
)
kable(files_summary, align="c", caption = "Summary Statstics of English blogs, news and twitter files")
| source_file | line_count | word_count | mean_line_length | character_count | mean_word_length |
|---|---|---|---|---|---|
| blogs | 899288 | 37334131 | 41.51521 | 206824505 | 4.563911 |
| news | 1010242 | 34372530 | 34.02406 | 203223159 | 4.941762 |
| 2360148 | 30373543 | 12.86934 | 203223159 | 4.414455 |
From the summary table above, it is notable that texts blogs have the largest mean number of words per line, followed by news, and twitter the least. This is expected as it is consistent with the nature of these types of texts. The mean lengths of words are relatively similar at between 4 to 5 characters per word, with news texts having the largest mean word length.
As we are interested in predicting subsequent words based on words entered, we need to inspect how frequent words and phrases appear together.
To perform n-gram analysis on the texts, we will first need to clean and standardize the data by converting all letters to lower case and remove punctuation, numbers, extra whitespaces and stop words. Due to the size of the files, we will only analyze a sample size of 1% of the original files. We take 1 % from each text sources and combine the sample to 1 corpus
library(tm)
library(quanteda)
##Create samples of 1/100 of the original texts
set.seed(1234)
sample_blogs <- sample(US_blogs, lines_blogs/100)
sample_news <- sample(US_news, lines_news/100)
sample_twitter <- sample(US_twitter, lines_twitter/100)
##Create combined corpus
BNT_corpus <- Corpus(VectorSource(c(sample_blogs, sample_news, sample_twitter)))
##converting to lower case
BNT_corpus <- tm_map(BNT_corpus, content_transformer(tolower))
##removing punctuation
BNT_corpus <- tm_map(BNT_corpus, removePunctuation)
##removing numbers
BNT_corpus <- tm_map(BNT_corpus, removeNumbers)
##removing stop words
BNT_corpus <- tm_map(BNT_corpus, removeWords, stopwords("en"))
##removing extra whitespaces
BNT_corpus <- tm_map(BNT_corpus, stripWhitespace)
##convert back to plain text
BNT_tidy <- sapply(BNT_corpus, as.character)
We can see an example of what the texts look like before and after preprocessiing:
Original Texts:
sample_blogs[1:3]
## [1] "He looked back at me, his eyes were as dark as coal,"
## [2] "You've set up a problem without stakes. Why does she care who the voice on the phone is? Why would she even listen to him past \"hello?\""
## [3] "Yvonne Strahovski … Peg Mooring"
Texts after pre-processing:
BNT_tidy[1:3]
## [1] " looked back eyes dark coal"
## [2] "youve set problem without stakes care voice phone even listen past hello"
## [3] "yvonne strahovski … peg mooring"
After sampling and cleanning our texts, we can proceed to tokenization and n-gram analysis.
##Tokenization
BNT_tokens <- tokens(BNT_tidy, what = "word", remove_punct = TRUE)
##create unigrams
BNT_uni <- tokens_ngrams(BNT_tokens, n=1)
##convert to DFM
BNT_uniDFM <- dfm(BNT_uni)
##Get top 10 unigrams
BNT_uniTop <- topfeatures(BNT_uniDFM, n=10)
##create a summary dataframe for plotting
uni_Top10 <- data.frame(
unigram = names(BNT_uniTop),
frequency=BNT_uniTop)
##make barplot
library(ggplot2)
ggplot(uni_Top10, aes(x=unigram, y=frequency))+
geom_bar(stat = "identity") +
labs(title = "Top 10 Unigrams",
x = "unigram", y = "frequency")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
BNT_bi <- tokens_ngrams(BNT_tokens, n=2)
BNT_biDFM <- dfm(BNT_bi)
BNT_biTop <- topfeatures(BNT_biDFM, n=10)
bi_Top10 <- data.frame(
bigram = names(BNT_biTop),
frequency = BNT_biTop
)
ggplot(bi_Top10, aes(x=bigram, y=frequency))+
geom_bar(stat = "identity") +
labs(title = "Top 10 Bigrams",
x = "bigram", y = "frequency")+
theme(axis.text.x = element_text(angle = 45, hjust = 1)
)
BNT_tri <- tokens_ngrams(BNT_tokens, n=3)
BNT_triDFM <- dfm(BNT_tri)
BNT_triTop <- topfeatures(BNT_triDFM, n=10)
tri_Top10 <- data.frame(trigram = names(BNT_triTop), frequency = BNT_triTop)
ggplot(tri_Top10, aes(x=trigram, y=frequency))+
geom_bar(stat = "identity") +
labs(title = "Top 10 Trigrams", x = "trigram", y = "frequency")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
We notice that profane languages are present at high frequency in these texts. We do not remove them from analysis at this step of the project, but will develop a plan to handle them in our final model.
In this step of the project, we downloaded the dataset, examined the files and performed exploratory analysis. We also performed pre-processing of the data including sampling, combining and text transformation create a corpus ready for further analysis. Our next step is to build prediction model based on N-gram analysis.