This report shows some features of the data that we have for our capstone project. The data comes from three main sources: twitter, blogs and news articles. Here we are going to inspect some basic features of the data given and create plan for future prediction algorithm.
Loading data
con.twitter <- file("C:/R/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r")
con.blogs <- file("C:/R/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r")
con.news <- file("C:/R/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", "r")
twitter <- readLines(con.twitter)
blogs <- readLines(con.blogs)
news <- readLines(con.news)
close(con.twitter)
close(con.blogs)
close(con.news)
Now we need to find out main characteristics of our texts. Because we have 3 files good idea would be write function that get all necessary data from our variables and return summary table.
require(dplyr)
require(tm)
text.features <- function(text) {
# converting text into same encoding
text <- iconv(text, from = "utf-8", to = "ascii", sub = "")
# Calculating basic features of the data
lines.total <- length(text)
text <- na.omit(text)
sent <- tokenizers::count_sentences(text)
word <- tokenizers::count_words(text)
char <- tokenizers::count_characters(text)
sent.total <- sum(sent)
word.total <- sum(word)
char.total <- sum(char)
char.in.word <- mean(char/word, na.rm = TRUE)
word.in.sent <- mean(word/sent, na.rm = TRUE)
min.word.lenth <- min(char/word, na.rm = TRUE)
min.sent.length <- min(word/sent, na.rm = TRUE)
max.word.length <- max(char/word, na.rm = TRUE)
max.sent.length <- max(word/sent, na.rm = TRUE)
result <- c(lines.total, sent.total, word.total, char.total, char.in.word, word.in.sent, min.word.lenth, min.sent.length, max.word.length, max.sent.length)
result
}
twitter.features <- text.features(twitter)
blogs.features <- text.features(blogs)
news.features <- text.features(news)
table <- rbind(twitter.features, blogs.features, news.features)
colnames(table) <- c("Lines.sum","Sentences", "Words", "Characters", "Avg.char.in.word", "Avg.word.in.sentence", "Min.word.length", "Min.sent.length", "Max.word.length", "Max.sent.length")
rownames(table) <- c("Twiter", "Blogs", "News")
table
## Lines.sum Sentences Words Characters Avg.char.in.word
## Twiter 2360148 3760191 30088564 161961345 5.431806
## Blogs 899288 2375588 37510168 206043906 Inf
## News 77259 155332 2673480 15615538 5.876399
## Avg.word.in.sentence Min.word.length Min.sent.length
## Twiter 8.885926 1.5 0.5
## Blogs 15.110613 1.0 0.0
## News 18.741522 1.0 1.0
## Max.word.length Max.sent.length
## Twiter 140 47
## Blogs Inf 906
## News 34 1123
Now we see that we have two foiles with approximately the same amount of lines and sentrnces: twitter and blogs and one news are the shortest file with approximately ten times less words in it.
As we can see average word in our corpus consists of 5.5 letters and that is OK for English language. Maximum and minimum values are quite extreme and that is caused with some untidyness of the text.
One interesting feature is that in twitter messages we have on avarege almost twice shorter sentences than in blogs and news.
After some of the texts statistics it’s worth to take a quick look at a words, bigrams and trigrams in our corpuses.
tokens.stat <- function(text) {
words <- unlist(tokenizers::tokenize_words(text))
words <- data_frame(words = words)
words <- words %>% count(words, sort = TRUE)
words <- words %>% mutate(sum = 100*cumsum(n)/sum(n))
words.full <- nrow(words)
words.90 <- nrow(words[words$sum < 90, ])
words.50 <- nrow(words[words$sum < 50, ])
bigrams <- unlist(tokenizers::tokenize_ngrams(text, n = 2))
bigrams <- data_frame(bigrams = bigrams)
bigrams <- bigrams %>% count(bigrams, sort = TRUE)
bigrams <- bigrams %>% mutate(sum = 100*cumsum(n)/sum(n))
bigrams.full <- nrow(bigrams)
bigrams.90 <- nrow(bigrams[bigrams$sum < 90, ])
bigrams.50 <- nrow(bigrams[bigrams$sum < 50, ])
trigrams <- unlist(tokenizers::tokenize_ngrams(text, n = 3))
trigrams <- data_frame(trigrams = trigrams)
trigrams <- trigrams %>% count(trigrams, sort = TRUE)
trigrams <- trigrams %>% mutate(sum = 100*cumsum(n)/sum(n))
trigrams.full <- nrow(trigrams)
trigrams.90 <- nrow(trigrams[trigrams$sum < 90, ])
trigrams.50 <- nrow(trigrams[trigrams$sum < 50, ])
stat <- c(words.full, words.90, words.50, bigrams.full, bigrams.90, bigrams.50, trigrams.full, trigrams.90, trigrams.50)
list(stat = stat, unigrams = words, bigrams = bigrams, trigrams = trigrams)
}
twitter.tokens <- tokens.stat(twitter[1:100000])
blogs.tokens <- tokens.stat(blogs[1:50000])
news.tokens <- tokens.stat(news[1:10000])
tokens.table <- rbind(twitter.tokens$stat, blogs.tokens$stat, news.tokens$stat)
colnames(tokens.table) <- c("Unique.words", "90%.W.Variability", "50%.W.Variability", "Unique.bigrams", "90%.B.Variability", "50%.B.Variability", "Unique.trigrams", "90%.T.Variability", "50%.T.Variability")
rownames(tokens.table) <- c("Twitter", "Blogs", "News")
tokens.table
## Unique.words 90%.W.Variability 50%.W.Variability Unique.bigrams
## Twitter 59921 5462 131 485394
## Blogs 77721 6920 113 757795
## News 31524 7787 217 198146
## 90%.B.Variability 50%.B.Variability Unique.trigrams
## Twitter 367594 24072 855030
## Blogs 551878 28251 1544311
## News 164095 31687 299974
## 90%.T.Variability 50%.T.Variability
## Twitter 746942 314592
## Blogs 1343193 538723
## News 266909 134653
From the table we can see that in our files we see from 30000 to almost 80000 unique words. However, 90% of variability in this files ensured by 10% of the words and 50% of variability ensured by 100-200 words. Number of bigrams and trigrams is much greater than number of words.
Now let’s look at the most frequent words in all files.
It’s interesting that in blogs and twitter messages ten most frequent words are the same, while in news these words are slightly different. There is no word “I” in 10 most popular words in news and word “the” is much more frequent than other words.
And the best way to present frequency of words is to create wordcloud. Wordcloud will present word frequencies among all three files(twitter, blogs and news)
require(dplyr)
require(wordcloud2)
merged <- rbind(twitter.tokens$unigrams, blogs.tokens$unigrams, news.tokens$unigrams)
merged <- merged %>% group_by(words) %>% summarise(freq = sum(n)) %>% arrange(desc(freq))
merged$words <- as.character(merged$words)
wordcloud2::wordcloud2(merged[1:500, ], color = "random-light")
For future analysis and creation of prediction algorithm we will need to appply Knesser-Ney algorithm which is quite powerful method for language prediction. In order to apply this method first of all we need to clean our text, get rid of unnecessary symbols, numbers, punctuation, profanities and other unnecessary words. Realization of algorithm will be presented in the next report.