Summary

This document summarizes the exploration of the dataset provided by Coursera for the Data Science Capstone project in the Specialization course.

Dataset

Three different groups of data files are available for the analysis. I have chosen to work with the english dataset: 3 different files with text examples coming from blogs, news and twitter.

#Read data files
data_twitter<-scan("../Coursera-SwiftKey/final/en_US/en_US.twitter.txt", what="character",sep="\n")
data_blogs<-scan("../Coursera-SwiftKey/final/en_US/en_US.blogs.txt", what="character",sep="\n")
data_news<-scan("../Coursera-SwiftKey/final/en_US/en_US.news.txt", what="character",sep="\n")

A basic description of each file follows:

The number of different texts examples of each file goes from aproximatelty 77K to 2.3M:

#Number of rows
length(data_twitter)
## [1] 2360148
length(data_blogs)
## [1] 899288
length(data_news)
## [1] 77259

Text examples coming from the blogs are in general the longest ones while, as expected, those coming from twitter are the shorter ones. In fact, text from twitter are limited to 213 in these examples. Blog and News texts are similar in average lenght, but the longest text from the Blog files is much larger.

#Summary of texts per source
summary(sapply(data_twitter,nchar,simplify=T))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     2.0    37.0    64.0    68.8   100.0   213.0
summary(sapply(data_blogs,nchar,simplify=T))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     1.0    47.0   157.0   231.7   331.0 40835.0
summary(sapply(data_news,nchar,simplify=T))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       2     111     186     203     270    5760

Word analysis

An interesting question that will determine which is the best aproach for our predictive algorithm is how the word frequency distribution is. I will follow the next steps to analyze the data:

  1. I get a small sample of each dataset to reduce the time required for the analysis: 10,000 texts per each type of source.
  2. I merge the 3 sources in a single vector of texts.
  3. I identify unique characters that appear in the texts and I decide how I are going to define what a word is (=which characters are allowed and how I treat special characters like commas, periods, etc.)
#Merge of 3 samples of 10,000 texts into a single vector
text_sample=c(
  data_twitter[sample(1:length(data_twitter),10000,replace = FALSE)],
  data_blogs[sample(1:length(data_blogs),10000,replace = FALSE)],
  data_news[sample(1:length(data_news),10000,replace = FALSE)]
)
#Unique characters
unique_characters <- unique(strsplit(toString(text_sample),"")[[1]])
#Number of different characters
length(unique_characters)
## [1] 180
#Display characters
unique_characters[order(unique_characters)]
##   [1] "­"  "'"  "-"  "–"  "—"  " "  " "  "!"  "\"" "#"  "$"  "%"  "&"  "(" 
##  [15] ")"  "*"  ","  "."  "/"  ":"  ";"  "?"  "@"  "["  "\\" "]"  "^"  "ˆ" 
##  [29] "_"  "`"  "{"  "|"  "}"  "~"  "¡"  "¦"  "¨"  "¯"  "´"  "¸"  "¿"  "˜" 
##  [43] "‘"  "’"  "‚"  "“"  "”"  "„"  "‹"  "›"  "¢"  "£"  "¤"  "¥"  "€"  "+" 
##  [57] "<"  "="  ">"  "±"  "«"  "»"  "§"  "©"  "¬"  "®"  "°"  "µ"  "¶"  "·" 
##  [71] "…"  "†"  "‡"  "•"  "‰"  ""  ""  ""  ""  ""  "0"  "¼"  "½"  "1" 
##  [85] "¹"  "2"  "²"  "3"  "³"  "4"  "5"  "6"  "7"  "8"  "9"  "a"  "A"  "ª" 
##  [99] "á"  "â"  "Â"  "Ä"  "ã"  "Ã"  "å"  "æ"  "b"  "B"  "c"  "C"  "ç"  "d" 
## [113] "D"  "ð"  "e"  "E"  "é"  "ê"  "ë"  "Ë"  "f"  "F"  "ƒ"  "g"  "G"  "h" 
## [127] "H"  "i"  "I"  "í"  "Í"  "ì"  "î"  "ï"  "Ï"  "j"  "J"  "k"  "K"  "l" 
## [141] "L"  "m"  "M"  "n"  "N"  "o"  "O"  "º"  "Ø"  "œ"  "Œ"  "p"  "P"  "q" 
## [155] "Q"  "r"  "R"  "s"  "S"  "š"  "Š"  "t"  "T"  "™"  "u"  "U"  "Ù"  "v" 
## [169] "V"  "w"  "W"  "x"  "X"  "y"  "Y"  "Ÿ"  "z"  "Z"  "ž"  "Ž"

After inspecting the characters present in the data, I will apply the following rules:

  1. I consider as valid only words that contain alphabetic characters, both in capital (A,B,C…) and non-capital letters (a,b,c…). I also consider the character “’” as a valid character. By doing so, I am implicitly deciding that contractions such as “I’ve”,“I’m”,“You’re” will be considered as words. It seems easier than dealing with each part of the contraction separately, but we will need then to remove words with this character that are not a contraction.
  2. I split the texts in the sample using the blank space as a separator, as well as other potential separators: .,;:)(?!.
  3. I eliminate words that contain characters other than the valid ones.
  4. I eliminate also words containing the special character “’” that are not in a list of frequent contractions.
#Common contractions
contractions<- c("aren't","can't","couldn't","didn't","doesn't","hadn't","hasn't","haven't","hadn't","he'll","he's","I'd","I'll","I'm","I've","isn't","let's","mightn't","shan't","she'd","she's","shouldn't","that's","there's","they'd","they'll","they're","we'd","we're","we've","weren't","what'll","what're","what's","what've","where's","who'd","who'll","who're","who's","who've","won't","wouldn't","you'd","you'll","you're","you've")
#Create a vector with all the possible words
words<-unlist(lapply(text_sample,strsplit,"[ .,;:)(?!]"))
#Eliminate empty words
words<-words[words!=""]
#Eliminate all the words with non-valid characters
words <- words[ grepl("[a-zA-Z']",words) ]
#Eliminate words containing "'" that are not frequenct contractions
words <- words[(words %in% contractions) | !grepl("'",words)]
#Number of valid words
length(words)
## [1] 853888

After this cleaning process, I can evaluate the most common words.

word_freq<-as.data.table(table(words))
names(word_freq)<-c("word","freq")
word_freq<-word_freq[order(-freq),]
word_freq[,freq_perc:=freq/sum(freq)]
word_freq[,freq_acum_perc:=cumsum(freq)/sum(freq)]
word_freq[,word:=factor(word,levels=word)]
gnews<-ggplot(data=word_freq[1:50,],aes(x=word,y=freq_perc)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle(label="Most frequent words")
gnews

num_words<-nrow(word_freq)
percent_80 <-nrow(word_freq[freq_acum_perc<0.8,])
percent_98 <- nrow(word_freq[freq_acum_perc<0.95,])
percent_99 <- nrow(word_freq[freq_acum_perc<0.99,])

I see that “the”,“to” and “and” the top3 most frequent words. I can also see how diverse our dictionary results.

Pairs of words

The predictive algorithm will use for sure information about how often one word comes after the other. In this section I show the top 50 pairs of words that appear more often.

To do so, I extract pairs of words from each sentence.

#Split each text in words, following the previous criteria
words_text<-sapply(text_sample,strsplit," |[.,;:)(?!]")
#Process data, claenin invalid characters, allowing for contractions...
words_text<-sapply(words_text, function(e) {
  e<-e[e!=""]
  e <- e[ grepl("[a-zA-Z']",e) ]
  e <- e[(e %in% contractions) | !grepl("'",e)]  
  })
#Create pairs of words
word_pairs<-sapply(words_text, function(e) {
  if (length(e)>1) {
    e1<-e[1:length(e)-1]
    e2<-e[2:length(e)]
    paste(e1,e2,sep="-")
  }
  })
#Calculate frequency of each pair of words
word_pairs<-unlist(word_pairs)
word_pairs_freq<-as.data.table(table(word_pairs))
names(word_pairs_freq)<-c("word","freq")
word_pairs_freq<-word_pairs_freq[order(-freq),]
word_pairs_freq[,freq_perc:=freq/sum(freq)]
word_pairs_freq[,freq_acum_perc:=cumsum(freq)/sum(freq)]
word_pairs_freq[,word:=factor(word,levels=word)]
gnews<-ggplot(data=word_pairs_freq[1:50,],aes(x=word,y=freq_perc)) + 
  geom_bar(stat="identity") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle(label="Most frequent pair of words")
gnews

num_word_pairs<-nrow(word_pairs_freq)
percent_80_pairs <-nrow(word_pairs_freq[freq_acum_perc<0.8,])
percent_98_pairs <- nrow(word_pairs_freq[freq_acum_perc<0.95,])
percent_99_pairs <- nrow(word_pairs_freq[freq_acum_perc<0.99,])

It seems clear that some pairs of words are really frequent, so I could use this property to predict the next word that a human is going to introduce.

I see that “of-the”,“in-the” and “to-the” are the top3 most frequent pairs of words. I can also see how diverse our dictionary results.

Next steps of this project:

  1. I could classify together all the pairs of words that starts with the same word (in-the,in-a,…) and prioritize the most probable pair.
  2. I can extend the pairs to triplets, in order to be more accurate.
  3. I can use some contextual information to better select which pair of words to offer when someone writes a word.