Summary

This report shows some features of the data that we have for our capstone project. The data comes from three main sources: twitter, blogs and news articles. Here we are going to inspect some basic features of the data given and create plan for future prediction algorithm.

Data

Loading data

con.twitter <- file("C:/R/Capstone/Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "r")
con.blogs <- file("C:/R/Capstone/Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "r")
con.news <- file("C:/R/Capstone/Coursera-SwiftKey/final/en_US/en_US.news.txt", "r")
twitter <- readLines(con.twitter)
blogs <- readLines(con.blogs)
news <- readLines(con.news)
close(con.twitter)
close(con.blogs)
close(con.news)

Now we need to find out main characteristics of our texts. Because we have 3 files good idea would be write function that get all necessary data from our variables and return summary table.

require(dplyr)
require(tm)
text.features <- function(text) {
  # converting text into same encoding
  text <- iconv(text, from =  "utf-8", to = "ascii", sub = "")
  # Calculating basic features of the data
  lines.total <- length(text)
  text <- na.omit(text)
  sent <- tokenizers::count_sentences(text)
  word <- tokenizers::count_words(text)
  char <- tokenizers::count_characters(text)
  
  sent.total <- sum(sent)
  word.total <- sum(word)
  char.total <- sum(char)
  
  char.in.word <- mean(char/word, na.rm = TRUE)
  word.in.sent <- mean(word/sent, na.rm = TRUE)
  
  min.word.lenth <- min(char/word, na.rm = TRUE)
  min.sent.length <- min(word/sent, na.rm = TRUE)
  
  max.word.length <- max(char/word, na.rm = TRUE)
  max.sent.length <- max(word/sent, na.rm = TRUE)
  result <- c(lines.total, sent.total, word.total, char.total, char.in.word, word.in.sent, min.word.lenth, min.sent.length, max.word.length, max.sent.length)
  result
}

twitter.features <- text.features(twitter)
blogs.features <- text.features(blogs)
news.features <- text.features(news)

table <- rbind(twitter.features, blogs.features, news.features)
colnames(table) <- c("Lines.sum","Sentences", "Words", "Characters", "Avg.char.in.word", "Avg.word.in.sentence", "Min.word.length", "Min.sent.length", "Max.word.length", "Max.sent.length")
rownames(table) <- c("Twiter", "Blogs", "News")
table
##        Lines.sum Sentences    Words Characters Avg.char.in.word
## Twiter   2360148   3760191 30088564  161961345         5.431806
## Blogs     899288   2375588 37510168  206043906              Inf
## News       77259    155332  2673480   15615538         5.876399
##        Avg.word.in.sentence Min.word.length Min.sent.length
## Twiter             8.885926             1.5             0.5
## Blogs             15.110613             1.0             0.0
## News              18.741522             1.0             1.0
##        Max.word.length Max.sent.length
## Twiter             140              47
## Blogs              Inf             906
## News                34            1123

Now we see that we have two foiles with approximately the same amount of lines and sentrnces: twitter and blogs and one news are the shortest file with approximately ten times less words in it.

As we can see average word in our corpus consists of 5.5 letters and that is OK for English language. Maximum and minimum values are quite extreme and that is caused with some untidyness of the text.

One interesting feature is that in twitter messages we have on avarege almost twice shorter sentences than in blogs and news.

After some of the texts statistics it’s worth to take a quick look at a words, bigrams and trigrams in our corpuses.

tokens.stat <- function(text) {
  words <- unlist(tokenizers::tokenize_words(text))
  words <- data_frame(words = words)
  words <- words %>% count(words, sort = TRUE)
  words <- words %>% mutate(sum = 100*cumsum(n)/sum(n))
  words.full <- nrow(words)
  words.90 <- nrow(words[words$sum < 90, ])
  words.50 <- nrow(words[words$sum < 50, ])
  bigrams <- unlist(tokenizers::tokenize_ngrams(text, n = 2))
  bigrams <- data_frame(bigrams = bigrams)
  bigrams <- bigrams %>% count(bigrams, sort = TRUE)
  bigrams <- bigrams %>% mutate(sum = 100*cumsum(n)/sum(n))
  bigrams.full <- nrow(bigrams)
  bigrams.90 <- nrow(bigrams[bigrams$sum < 90, ])
  bigrams.50 <- nrow(bigrams[bigrams$sum < 50, ])
  trigrams <- unlist(tokenizers::tokenize_ngrams(text, n = 3))
  trigrams <- data_frame(trigrams = trigrams)
  trigrams <- trigrams %>% count(trigrams, sort = TRUE)
  trigrams <- trigrams %>% mutate(sum = 100*cumsum(n)/sum(n))
  trigrams.full <- nrow(trigrams)
  trigrams.90 <- nrow(trigrams[trigrams$sum < 90, ])
  trigrams.50 <- nrow(trigrams[trigrams$sum < 50, ])
  
  stat <- c(words.full, words.90, words.50, bigrams.full, bigrams.90, bigrams.50, trigrams.full, trigrams.90, trigrams.50)
  list(stat = stat, unigrams = words, bigrams = bigrams, trigrams = trigrams)
}
twitter.tokens <- tokens.stat(twitter[1:100000])
blogs.tokens <- tokens.stat(blogs[1:50000])
news.tokens <- tokens.stat(news[1:10000])

tokens.table <- rbind(twitter.tokens$stat, blogs.tokens$stat, news.tokens$stat)

colnames(tokens.table) <- c("Unique.words", "90%.W.Variability", "50%.W.Variability", "Unique.bigrams", "90%.B.Variability", "50%.B.Variability", "Unique.trigrams", "90%.T.Variability", "50%.T.Variability")
rownames(tokens.table) <- c("Twitter", "Blogs", "News")
tokens.table
##         Unique.words 90%.W.Variability 50%.W.Variability Unique.bigrams
## Twitter        59921              5462               131         485394
## Blogs          77721              6920               113         757795
## News           31524              7787               217         198146
##         90%.B.Variability 50%.B.Variability Unique.trigrams
## Twitter            367594             24072          855030
## Blogs              551878             28251         1544311
## News               164095             31687          299974
##         90%.T.Variability 50%.T.Variability
## Twitter            746942            314592
## Blogs             1343193            538723
## News               266909            134653

From the table we can see that in our files we see from 30000 to almost 80000 unique words. However, 90% of variability in this files ensured by 10% of the words and 50% of variability ensured by 100-200 words. Number of bigrams and trigrams is much greater than number of words.

Now let’s look at the most frequent words in all files.

It’s interesting that in blogs and twitter messages ten most frequent words are the same, while in news these words are slightly different. There is no word “I” in 10 most popular words in news and word “the” is much more frequent than other words.

And the best way to present frequency of words is to create wordcloud. Wordcloud will present word frequencies among all three files(twitter, blogs and news)

require(dplyr)
require(wordcloud2)
merged <- rbind(twitter.tokens$unigrams, blogs.tokens$unigrams, news.tokens$unigrams)
merged <- merged %>% group_by(words) %>% summarise(freq = sum(n)) %>% arrange(desc(freq))
merged$words <- as.character(merged$words)
wordcloud2::wordcloud2(merged[1:500, ], color = "random-light")

For future analysis and creation of prediction algorithm we will need to appply Knesser-Ney algorithm which is quite powerful method for language prediction. In order to apply this method first of all we need to clean our text, get rid of unnecessary symbols, numbers, punctuation, profanities and other unnecessary words. Realization of algorithm will be presented in the next report.