This report contains the milestone report rubric for the Data Science Capstone project. The report is divided in three parts. First, we load and clean the data from SwiftKey. Second, we perform an exploratory data analysis. Finally, we describe the future work of this project.
In this report we use the english documents of the SwiftKey dataset: en_US.blogs.txt, en_US.news.txt and en_US.twitter.txt. To load and analyse these documents with R [1], we use the library “TM” [2] and “Quanteda” [3]. As a first step, we load the three documents and we show a short summary.
texts <- VCorpus(DirSource(paste(BASE_PATH,"en_US_sample",sep=""), encoding = "UTF-8"))
In this section, we create a function to sample the documents. This function takes an argument that indicates the percentage of lines from the original document that are included in the sampled document.
sample_document <- function(path_input_doc,path_output_doc,percent_sampling_lines){
con_in <- file(path_input_doc, open="rb")
if (file.exists(path_output_doc)){
file.remove(path_output_doc)
}
con_out <- file(path_output_doc, open="w")
text_lines <- readLines(con_in, encoding="UTF-8")
for(i in 1:length(text_lines)){
if(rbinom(1,1,percent_sampling_lines)==1) {
write(text_lines[i], con_out, append="TRUE")
}
}
close(con_in)
close(con_out)
}
if(DO_SAMPLING){
percent_sampling_lines <- 0.01
sample_document("./en_US/en_US.blogs.txt","./en_US_sample/en_US.blogs.txt",percent_sampling_lines)
sample_document("./en_US/en_US.news.txt","./en_US_sample/en_US.news.txt",percent_sampling_lines)
sample_document("./en_US/en_US.twitter.txt","./en_US_sample/en_US.twitter.txt",percent_sampling_lines)
}
sample_texts <- VCorpus(DirSource(paste(BASE_PATH,"en_US_sample",sep=""), encoding = "UTF-8"))
In this section, we tokenize the texts and we remove the profane words.
We use this function to divide the text in tokens and remove the punctuation, the numbers, the whitespaces, etc. We also stem the words of the documents in order to not make distinction between the singular and the plural form of the same word, between different conjugations of the same verb and so on.
tokenize_file <- function(text_file){
text_trans <- tm_map(text_file, content_transformer(tolower))
text_trans <- tm_map(text_trans, removeWords, stopwords("english"))
text_trans <- tm_map(text_trans, removePunctuation)
text_trans <- tm_map(text_trans, removeNumbers)
text_trans <- tm_map(text_trans, stemDocument)
text_trans <- tm_map(text_trans, stripWhitespace)
text_trans
}
This function removes the profane words in the text. The profane word list was extracted from [4].
SWEAR_WORDS <- scan(paste(BASE_PATH,"google_bad_words_utf.txt",sep=""), what="", sep="\n")
filter_profane_words <- function(text_file){
text_trans <- tm_map(text_file, removeWords, SWEAR_WORDS)
text_trans <- tm_map(text_trans, stripWhitespace)
}
We execute the actions described in the previous sections. First, we tokenize the data and, second, we filter the profane words. We use the sample data in the following sections.
sample_texts <- tokenize_file(sample_texts)
sample_texts <- filter_profane_words(sample_texts)
In the following table and plot, we compute the basic metrics for the three files and we compare their dimensions in terms of lines, words and characters. These metrics are computed over the complete files, not over the sampled files. As the plots show, although the twitter file has many more lines, the other two files have more words and more characters.
get_summary_stats <- function(text_file){
id <- text_file$meta$id
text_content <- text_file$content
total_lines <- length(text_content)
words_per_line <- sapply(strsplit(text_content, "\\s+"), length)
words_longest_line <- max(words_per_line)
total_words <- sum(words_per_line)
chars_per_line <- sapply(text_content,nchar)
chars_longest_line <- max(chars_per_line)
total_chars <- sum(chars_per_line)
c(id,total_lines,total_words,total_chars,words_longest_line,chars_longest_line)
}
summary_tab <- data.frame(rbind(get_summary_stats(texts[[1]]),
get_summary_stats(texts[[2]]),
get_summary_stats(texts[[3]])))
colnames(summary_tab) <- c("file","lines","words","chars","words longest line","chars longest line")
summary_tab$lines <- as.numeric(as.character(summary_tab$lines))
summary_tab$words <- as.numeric(as.character(summary_tab$words))
summary_tab$chars <- as.numeric(as.character(summary_tab$chars))
summary_tab$"words longest line" <- as.numeric(as.character(summary_tab$"words longest line"))
summary_tab$"chars longest line" <- as.numeric(as.character(summary_tab$"chars longest line"))
ggplot(data=summary_tab, aes(x=file, y=lines, fill=file)) +
geom_bar(stat="identity", position=position_dodge(), colour="black")
ggplot(data=summary_tab, aes(x=file, y=words, fill=file)) +
geom_bar(stat="identity", position=position_dodge(), colour="black")
ggplot(data=summary_tab, aes(x=file, y=chars, fill=file)) +
geom_bar(stat="identity", position=position_dodge(), colour="black")
For this project, we want to make predictions for the next word that a user could write in a sentence. Therefore, it is important to also give a preliminary analysis on the word frequency. In this section, we use the sampled data, which gives a good picture of the situation and it reduces the computing requirements. For this task we mainly use the TermDocumentMatrix from “TM” [2] package. This is a matrix, which for each file, it stores the number of occurrences of each word.
tdm <- TermDocumentMatrix(sample_texts)
m <- as.matrix(tdm)
col_names <- colnames(m)
m <- cbind(m,rowSums(m))
colnames(m) <- c(col_names,"frequency")
freq_df <- data.frame(m)
freq_df <- freq_df[order(freq_df$frequency,decreasing = TRUE),]
In the following table, we show the 10 most repeated words in all the three documents.
head(freq_df,10)
## en_US.blogs.txt en_US.news.txt en_US.twitter.txt frequency
## one 1375 908 855 3138
## will 1098 1126 882 3106
## just 956 587 1512 3055
## like 1124 606 1285 3015
## get 928 585 1474 2987
## said 378 2370 174 2922
## can 1033 606 969 2608
## time 973 664 860 2497
## day 705 430 1104 2239
## year 622 1117 389 2128
In the following histograms, we count the number of times (frequency) that each word appears in all the text and we show the 10 largest word frequencies. We see that the most common situations is words appearing a single time.
ggplot(data=freq_df, aes(freq_df$frequency)) + geom_histogram() +
scale_x_continuous(breaks=1:10,limits=c(0, 10))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
However, most of the used words are repeated several times. In the following code, we compute the percentage of text that is covered by the most repeated words.
freq_df$cum_freq <- 0
num_total_words <- sum(freq_df$frequency)
i <- nrow(freq_df)
while(i>0){
freq_df[i,]$cum_freq <- num_total_words
num_total_words <- num_total_words - freq_df[i,]$frequency
i <- i - 1
}
num_total_words <- sum(freq_df$frequency)
freq_df$percent_total_words <- freq_df$cum_freq/num_total_words
freq_df_50 <- freq_df[freq_df$percent_total_words<0.5,]
freq_df_90 <- freq_df[freq_df$percent_total_words<0.9,]
We see that 580 words cover 50% of all the total text of the samples and 7176 words cover 90% of the total text of the sample.
In this section, we compute the frequencies of the bigrams and the trigrams of the corpus of the sample texts. The bigrams are the sets of two words that appear together in the corpus of the text. Similarly, the trigrams are the set of three words that appear together in the corpus of the text.
corp_blogs <- corpus(sample_texts[[1]]$content)
## Non-UTF-8 encoding (possibly) detected :ISO-8859-9.
corp_news <- corpus(sample_texts[[2]]$content)
corp_twitter <- corpus(sample_texts[[3]]$content)
## Non-UTF-8 encoding (possibly) detected :ISO-8859-2.
corp_all <- corp_blogs + corp_news + corp_twitter
NUM_TOP_EXPRESSIONS <- 10
Frequency of the 10 most frequent bigrams of the complete corpus of the samples:
bigram_all <- tokenize(corp_all, ngrams = 2)
bigram_all <- unlist(bigram_all, use.names = FALSE)
dfm_bigram_all <- dfm(bigram_all)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 506,610 documents
## ... indexing features: 397,651 feature types
## ... created a 506610 x 397651 sparse dfm
## ... complete.
## Elapsed time: 9.648 seconds.
top_bigram_all <- topfeatures(dfm_bigram_all, n=NUM_TOP_EXPRESSIONS)
top_bigram_all
## right_now last_year new_york look_like year_ago
## 271 225 214 205 170
## look_forward feel_like last_night last_week high_school
## 168 165 163 138 130
Frequency of the 10 most frequent trigrams of the complete corpus of the samples:
trigram_all <- tokenize(corp_all, ngrams = 3)
trigram_all <- unlist(trigram_all, use.names = FALSE)
dfm_trigram_all <- dfm(trigram_all)
##
## ... lowercasing
## ... tokenizing
## ... indexing documents: 465,254 documents
## ... indexing features: 459,277 feature types
## ... created a 465254 x 459277 sparse dfm
## ... complete.
## Elapsed time: 8.627 seconds.
top_trigram_all <- topfeatures(dfm_trigram_all, n=NUM_TOP_EXPRESSIONS)
top_trigram_all
## new_york_citi happi_mother_day presid_barack_obama
## 31 28 25
## happi_new_year let_us_know look_forward_see
## 20 19 17
## two_year_ago world_war_ii cinco_de_mayo
## 16 15 15
## _s
## 15
The main outcome for this project is a predictive application, where the user enter words and the application suggests the next word. The steps to build this application are as follows:
[1]R Core Team, R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing, 2015 [Online]. Available: https://www.R-project.org/
[2]I. Feinerer and K. Hornik, Tm: Text mining package. 2015 [Online]. Available: http://CRAN.R-project.org/package=tm
[3]K. Benoit and P. Nulty, Quanteda: Quantitative analysis of textual data. 2015 [Online]. Available: https://CRAN.R-project.org/package=quanteda
[4]“Full list of bad words banned by google.” http://www.freewebheaders.com/full-list-of-bad-words-banned-by-google/.