This document summarizes the exploration of the dataset provided by Coursera for the Data Science Capstone project in the Specialization course.
Three different groups of data files are available for the analysis. I have chosen to work with the english dataset: 3 different files with text examples coming from blogs, news and twitter.
#Read data files
data_twitter<-scan("../Coursera-SwiftKey/final/en_US/en_US.twitter.txt", what="character",sep="\n")
data_blogs<-scan("../Coursera-SwiftKey/final/en_US/en_US.blogs.txt", what="character",sep="\n")
data_news<-scan("../Coursera-SwiftKey/final/en_US/en_US.news.txt", what="character",sep="\n")
A basic description of each file follows:
The number of different texts examples of each file goes from aproximatelty 77K to 2.3M:
#Number of rows
length(data_twitter)
## [1] 2360148
length(data_blogs)
## [1] 899288
length(data_news)
## [1] 77259
Text examples coming from the blogs are in general the longest ones while, as expected, those coming from twitter are the shorter ones. In fact, text from twitter are limited to 213 in these examples. Blog and News texts are similar in average lenght, but the longest text from the Blog files is much larger.
#Summary of texts per source
summary(sapply(data_twitter,nchar,simplify=T))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.0 37.0 64.0 68.8 100.0 213.0
summary(sapply(data_blogs,nchar,simplify=T))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.0 47.0 157.0 231.7 331.0 40835.0
summary(sapply(data_news,nchar,simplify=T))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 111 186 203 270 5760
An interesting question that will determine which is the best aproach for our predictive algorithm is how the word frequency distribution is. I will follow the next steps to analyze the data:
#Merge of 3 samples of 10,000 texts into a single vector
text_sample=c(
data_twitter[sample(1:length(data_twitter),10000,replace = FALSE)],
data_blogs[sample(1:length(data_blogs),10000,replace = FALSE)],
data_news[sample(1:length(data_news),10000,replace = FALSE)]
)
#Unique characters
unique_characters <- unique(strsplit(toString(text_sample),"")[[1]])
#Number of different characters
length(unique_characters)
## [1] 180
#Display characters
unique_characters[order(unique_characters)]
## [1] "Â" "'" "-" "–" "—" " " " " "!" "\"" "#" "$" "%" "&" "("
## [15] ")" "*" "," "." "/" ":" ";" "?" "@" "[" "\\" "]" "^" "ˆ"
## [29] "_" "`" "{" "|" "}" "~" "¡" "¦" "¨" "¯" "´" "¸" "¿" "˜"
## [43] "‘" "’" "‚" "“" "”" "„" "‹" "›" "¢" "£" "¤" "¥" "€" "+"
## [57] "<" "=" ">" "±" "«" "»" "§" "©" "¬" "®" "°" "µ" "¶" "·"
## [71] "Â…" "†" "‡" "•" "‰" "Â" "Â" "Â" "Â" "Â" "0" "¼" "½" "1"
## [85] "¹" "2" "²" "3" "³" "4" "5" "6" "7" "8" "9" "a" "A" "ª"
## [99] "á" "â" "Â" "Ä" "ã" "Ã" "å" "æ" "b" "B" "c" "C" "ç" "d"
## [113] "D" "ð" "e" "E" "é" "ê" "ë" "Ë" "f" "F" "ƒ" "g" "G" "h"
## [127] "H" "i" "I" "Ã" "Ã" "ì" "î" "ï" "Ã" "j" "J" "k" "K" "l"
## [141] "L" "m" "M" "n" "N" "o" "O" "º" "Ø" "œ" "Œ" "p" "P" "q"
## [155] "Q" "r" "R" "s" "S" "š" "Š" "t" "T" "™" "u" "U" "Ù" "v"
## [169] "V" "w" "W" "x" "X" "y" "Y" "Ÿ" "z" "Z" "ž" "Ž"
After inspecting the characters present in the data, I will apply the following rules:
#Common contractions
contractions<- c("aren't","can't","couldn't","didn't","doesn't","hadn't","hasn't","haven't","hadn't","he'll","he's","I'd","I'll","I'm","I've","isn't","let's","mightn't","shan't","she'd","she's","shouldn't","that's","there's","they'd","they'll","they're","we'd","we're","we've","weren't","what'll","what're","what's","what've","where's","who'd","who'll","who're","who's","who've","won't","wouldn't","you'd","you'll","you're","you've")
#Create a vector with all the possible words
words<-unlist(lapply(text_sample,strsplit,"[ .,;:)(?!]"))
#Eliminate empty words
words<-words[words!=""]
#Eliminate all the words with non-valid characters
words <- words[ grepl("[a-zA-Z']",words) ]
#Eliminate words containing "'" that are not frequenct contractions
words <- words[(words %in% contractions) | !grepl("'",words)]
#Number of valid words
length(words)
## [1] 853888
After this cleaning process, I can evaluate the most common words.
word_freq<-as.data.table(table(words))
names(word_freq)<-c("word","freq")
word_freq<-word_freq[order(-freq),]
word_freq[,freq_perc:=freq/sum(freq)]
word_freq[,freq_acum_perc:=cumsum(freq)/sum(freq)]
word_freq[,word:=factor(word,levels=word)]
gnews<-ggplot(data=word_freq[1:50,],aes(x=word,y=freq_perc)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle(label="Most frequent words")
gnews
num_words<-nrow(word_freq)
percent_80 <-nrow(word_freq[freq_acum_perc<0.8,])
percent_98 <- nrow(word_freq[freq_acum_perc<0.95,])
percent_99 <- nrow(word_freq[freq_acum_perc<0.99,])
I see that “the”,“to” and “and” the top3 most frequent words. I can also see how diverse our dictionary results.
The predictive algorithm will use for sure information about how often one word comes after the other. In this section I show the top 50 pairs of words that appear more often.
To do so, I extract pairs of words from each sentence.
#Split each text in words, following the previous criteria
words_text<-sapply(text_sample,strsplit," |[.,;:)(?!]")
#Process data, claenin invalid characters, allowing for contractions...
words_text<-sapply(words_text, function(e) {
e<-e[e!=""]
e <- e[ grepl("[a-zA-Z']",e) ]
e <- e[(e %in% contractions) | !grepl("'",e)]
})
#Create pairs of words
word_pairs<-sapply(words_text, function(e) {
if (length(e)>1) {
e1<-e[1:length(e)-1]
e2<-e[2:length(e)]
paste(e1,e2,sep="-")
}
})
#Calculate frequency of each pair of words
word_pairs<-unlist(word_pairs)
word_pairs_freq<-as.data.table(table(word_pairs))
names(word_pairs_freq)<-c("word","freq")
word_pairs_freq<-word_pairs_freq[order(-freq),]
word_pairs_freq[,freq_perc:=freq/sum(freq)]
word_pairs_freq[,freq_acum_perc:=cumsum(freq)/sum(freq)]
word_pairs_freq[,word:=factor(word,levels=word)]
gnews<-ggplot(data=word_pairs_freq[1:50,],aes(x=word,y=freq_perc)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle(label="Most frequent pair of words")
gnews
num_word_pairs<-nrow(word_pairs_freq)
percent_80_pairs <-nrow(word_pairs_freq[freq_acum_perc<0.8,])
percent_98_pairs <- nrow(word_pairs_freq[freq_acum_perc<0.95,])
percent_99_pairs <- nrow(word_pairs_freq[freq_acum_perc<0.99,])
It seems clear that some pairs of words are really frequent, so I could use this property to predict the next word that a human is going to introduce.
I see that “of-the”,“in-the” and “to-the” are the top3 most frequent pairs of words. I can also see how diverse our dictionary results.
Next steps of this project: