This is a milestone report for Data Science Capstone project by Coursera. The goal of this project is just to display that I have gotten used to working with the data and that I am on track to create a prediction algorithm.

1. Dataset

Data was taken from a corpus called HC Corpora. Data was collected in four languages (English,German,Russian and Finnish) but I will use only the American Engish text. I will use data from three sources (news, blogs and twitter messages). Data are already downloaded and stored in my personal PC.

blogs_raw <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news_raw <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter_raw <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8")

data_stat<-data.frame(Lines = c( length(blogs_raw),
                                 length(news_raw),
                                 length(twitter_raw) ),
                      Characters = c( sum(nchar(blogs_raw)),
                                     sum(nchar(news_raw)),
                                     sum(nchar(twitter_raw))),
                      Words = c( length(tokenize_words(blogs_raw)), 
                                 length(tokenize_words(news_raw)),
                                 length(tokenize_words(twitter_raw))))
row.names(data_stat)<-c("Blogs","News","Twitter")
data<-data.frame(type=c(rep('b',length(blogs_raw)),rep('n',length(news_raw)),rep('t',length(twitter_raw))),
                   text=c(blogs_raw,news_raw,twitter_raw))
rm(blogs_raw,news_raw,twitter_raw)
print(data_stat)
##           Lines Characters   Words
## Blogs    899288  206824505  899288
## News      77259   15639408   77259
## Twitter 2360148  162096031 2360148

2. Data selection

After data loading, a random selection of 3000 elements of each source was made. This is because I do not need to push too much computational effort to test algorithms. I preferred to keep three sources balanced, but this could be easily changed later on. From now, I go on analyzing only a training set in “data_train” frame.

set.seed(0)
data_train<-data %>% group_by(type) %>% sample_n(size = 3000)

3. Data cleaning

A series of selection is made in order to remove useless characters and symbols. Comments describe each removal.

# Setlowercase
data_train$text <- tolower(data_train$text)
# mentions
data_train$text <- gsub("@\\w+", "", data_train$text)
# urls
data_train$text <- gsub("https?://.+", "", data_train$text)
# emojis
data_train$text <- gsub("\\d+\\w*\\d*", "", data_train$text)
data_train$text <- gsub("#\\w+", "", data_train$text)
# numbers
data_train$text <- gsub("[^\x01-\x7F]", "", data_train$text)
# punctuation
data_train$text <- gsub("[[:punct:]]", " ", data_train$text)
# spaces and new lines
data_train$text <- gsub("\n", " ", data_train$text)
data_train$text <- gsub("^\\s+", "", data_train$text)
data_train$text <- gsub("\\s+$", "", data_train$text)
data_train$text <- gsub("[ |\t]+", " ", data_train$text)

4. Data exploration

The first step in building a predictive model for text is understanding the distribution and relationship between the words and their frequencies. In the following three sections I show the frequencies of the 20 most frequent objects of:

Before computing each statistics, I remove from the lists of words, words that are not significant for the meaning of the speech. This is done using a stopwords set (different for each language) commonly distributed and classified.

4.1 Distribution of Word frequencies (1-gram)

words<-unlist(tokenize_words(data_train$text))
words<-words[!(words %in% stopwords(source = "stopwords-iso"))]
ord<-order(table(words),decreasing=TRUE)
words20<-as.data.frame(table(words)[ord][1:20])

ggplot(words20,aes(words,Freq)) +
  geom_bar(stat="identity",fill="red") +
  ggtitle("Top-20 Single word Frequencies") +
  xlab("Words") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

4.2 Distribution of Bigrams frequencies (2-gram)

bigrams<-unlist(tokenize_ngrams(data_train$text, n = 2))
term1<-sapply(bigrams, function(x) unlist(tokenize_words(x))[1])
term2<-sapply(bigrams, function(x) unlist(tokenize_words(x))[2])
fil<-(term1 %in% stopwords(source = "stopwords-iso"))|(term2 %in% stopwords(source = "stopwords-iso"))
bigrams<-bigrams[!fil]
ord<-order(table(bigrams),decreasing=TRUE)
bigrams20<-as.data.frame(table(bigrams)[ord][1:20])

ggplot(bigrams20,aes(bigrams,Freq)) +
  geom_bar(stat="identity",fill="blue") +
  ggtitle("Top-20 Bigrams Frequencies") +
  xlab("Bigrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

4.3 Distribution of Threegrams frequencies (3-gram)

threegrams<-unlist(tokenize_ngrams(data_train$text, n = 3))
term1<-sapply(threegrams, function(x) unlist(tokenize_words(x))[1])
term2<-sapply(threegrams, function(x) unlist(tokenize_words(x))[2])
term3<-sapply(threegrams, function(x) unlist(tokenize_words(x))[3])
fil<-(term1 %in% stopwords(source = "stopwords-iso"))|
     (term2 %in% stopwords(source = "stopwords-iso"))|
     (term3 %in% stopwords(source = "stopwords-iso"))
threegrams<-threegrams[!fil]
ord<-order(table(threegrams),decreasing=TRUE)
threegrams20<-as.data.frame(table(threegrams)[ord][1:20])

ggplot(threegrams20,aes(threegrams,Freq)) +
  geom_bar(stat="identity",fill="yellow") +
  ggtitle("Top-20 Threegrams Frequencies") +
  xlab("Threegrams") + ylab("Frequency") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

5. Conclusion

After the step of exploratory analysis and statistic, I am now ready to face the choice of different algorithms for the main goal of the project, i.e. the prediction of the the next probable word in a speech sequence.