This is a milestone report for Data Science Capstone project by Coursera. The goal of this project is just to display that I have gotten used to working with the data and that I am on track to create a prediction algorithm.
Data was taken from a corpus called HC Corpora. Data was collected in four languages (English,German,Russian and Finnish) but I will use only the American Engish text. I will use data from three sources (news, blogs and twitter messages). Data are already downloaded and stored in my personal PC.
blogs_raw <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8")
news_raw <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8")
twitter_raw <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8")
data_stat<-data.frame(Lines = c( length(blogs_raw),
length(news_raw),
length(twitter_raw) ),
Characters = c( sum(nchar(blogs_raw)),
sum(nchar(news_raw)),
sum(nchar(twitter_raw))),
Words = c( length(tokenize_words(blogs_raw)),
length(tokenize_words(news_raw)),
length(tokenize_words(twitter_raw))))
row.names(data_stat)<-c("Blogs","News","Twitter")
data<-data.frame(type=c(rep('b',length(blogs_raw)),rep('n',length(news_raw)),rep('t',length(twitter_raw))),
text=c(blogs_raw,news_raw,twitter_raw))
rm(blogs_raw,news_raw,twitter_raw)
print(data_stat)
## Lines Characters Words
## Blogs 899288 206824505 899288
## News 77259 15639408 77259
## Twitter 2360148 162096031 2360148
After data loading, a random selection of 3000 elements of each source was made. This is because I do not need to push too much computational effort to test algorithms. I preferred to keep three sources balanced, but this could be easily changed later on. From now, I go on analyzing only a training set in “data_train” frame.
set.seed(0)
data_train<-data %>% group_by(type) %>% sample_n(size = 3000)
A series of selection is made in order to remove useless characters and symbols. Comments describe each removal.
# Setlowercase
data_train$text <- tolower(data_train$text)
# mentions
data_train$text <- gsub("@\\w+", "", data_train$text)
# urls
data_train$text <- gsub("https?://.+", "", data_train$text)
# emojis
data_train$text <- gsub("\\d+\\w*\\d*", "", data_train$text)
data_train$text <- gsub("#\\w+", "", data_train$text)
# numbers
data_train$text <- gsub("[^\x01-\x7F]", "", data_train$text)
# punctuation
data_train$text <- gsub("[[:punct:]]", " ", data_train$text)
# spaces and new lines
data_train$text <- gsub("\n", " ", data_train$text)
data_train$text <- gsub("^\\s+", "", data_train$text)
data_train$text <- gsub("\\s+$", "", data_train$text)
data_train$text <- gsub("[ |\t]+", " ", data_train$text)
The first step in building a predictive model for text is understanding the distribution and relationship between the words and their frequencies. In the following three sections I show the frequencies of the 20 most frequent objects of:
Before computing each statistics, I remove from the lists of words, words that are not significant for the meaning of the speech. This is done using a stopwords set (different for each language) commonly distributed and classified.
words<-unlist(tokenize_words(data_train$text))
words<-words[!(words %in% stopwords(source = "stopwords-iso"))]
ord<-order(table(words),decreasing=TRUE)
words20<-as.data.frame(table(words)[ord][1:20])
ggplot(words20,aes(words,Freq)) +
geom_bar(stat="identity",fill="red") +
ggtitle("Top-20 Single word Frequencies") +
xlab("Words") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
bigrams<-unlist(tokenize_ngrams(data_train$text, n = 2))
term1<-sapply(bigrams, function(x) unlist(tokenize_words(x))[1])
term2<-sapply(bigrams, function(x) unlist(tokenize_words(x))[2])
fil<-(term1 %in% stopwords(source = "stopwords-iso"))|(term2 %in% stopwords(source = "stopwords-iso"))
bigrams<-bigrams[!fil]
ord<-order(table(bigrams),decreasing=TRUE)
bigrams20<-as.data.frame(table(bigrams)[ord][1:20])
ggplot(bigrams20,aes(bigrams,Freq)) +
geom_bar(stat="identity",fill="blue") +
ggtitle("Top-20 Bigrams Frequencies") +
xlab("Bigrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
threegrams<-unlist(tokenize_ngrams(data_train$text, n = 3))
term1<-sapply(threegrams, function(x) unlist(tokenize_words(x))[1])
term2<-sapply(threegrams, function(x) unlist(tokenize_words(x))[2])
term3<-sapply(threegrams, function(x) unlist(tokenize_words(x))[3])
fil<-(term1 %in% stopwords(source = "stopwords-iso"))|
(term2 %in% stopwords(source = "stopwords-iso"))|
(term3 %in% stopwords(source = "stopwords-iso"))
threegrams<-threegrams[!fil]
ord<-order(table(threegrams),decreasing=TRUE)
threegrams20<-as.data.frame(table(threegrams)[ord][1:20])
ggplot(threegrams20,aes(threegrams,Freq)) +
geom_bar(stat="identity",fill="yellow") +
ggtitle("Top-20 Threegrams Frequencies") +
xlab("Threegrams") + ylab("Frequency") +
theme(axis.text.x=element_text(angle=45, hjust=1))
After the step of exploratory analysis and statistic, I am now ready to face the choice of different algorithms for the main goal of the project, i.e. the prediction of the the next probable word in a speech sequence.