Introduction

In this project we’re downloading and cleansing data, making EDA and checking top 10 the most frequent words in all three files

Setting data

We’ll take 10% of all data

set.seed(123)

blogs <- readLines("en_US.blogs.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 167155 содержит встроенный nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 268547 содержит встроенный nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 1274086 содержит встроенный nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 1759032 содержит встроенный nul
sample_size <- 0.01

blogs <- sample(blogs, length(blogs) * sample_size)
news <- sample(news, length(news) * sample_size)
twitter <- sample(twitter, length(twitter) * sample_size)

Exploratory Data Analysis

basic_data_table <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  "File Size (MB)" =  c( file.info("en_US.blogs.txt")$size / 1024^2,
                      file.info("en_US.news.txt")$size / 1024^2,
                      file.info("en_US.twitter.txt")$size / 1024^2),
  "Length (number of rows)" = c (length(blogs),
                length(news),
                length(twitter)),
  "Number of words" = c(sum(sapply(strsplit(blogs, "\\s+"), length)),
                        sum(sapply(strsplit(news, "\\s+"), length)),
                        sum(sapply(strsplit(twitter, "\\s+"), length))),
  "Number of characters" = c (sum(nchar(blogs)),
                         sum(nchar(news)),
                         sum(nchar(twitter)))
  
  )

Top 10 Most Frequent Words

  1. Making one dataset for the analysis
data <- c(blogs, news, twitter)
  1. Converting all data into lowercases; splitting text into words and making one vector from all rows of words; removing empty strings and numbers; removing stop words (pronouns, articles, etc)
library(tm)
## Загрузка требуемого пакета: NLP
data_words <- unlist(strsplit(tolower(data), "\\W+"))
data_words <- data_words[data_words != ""]
data_words <- data_words[!grepl("^[0-9]+$", data_words)]
data_words <- data_words[!data_words %in% stopwords("en")]
  1. Removing additional tokens resulting from contractions (e.g., “I’m”, “don’t”)
data_words <- data_words[!data_words %in% c(
  "s", "t", "m", "ll", "ve", "re", "d"
)]
  1. Making a table for words and their frequency in decreasing order
word_freq <- sort(table(data_words), decreasing = TRUE)
  1. Making a user-friendly table for this daat frame and taking top-10 of all list
top10 <- data.frame(
  Word = names(word_freq)[1:10],
  Frequency = as.numeric(word_freq[1:10])
)

top10
##    Word Frequency
## 1   can      3215
## 2  will      3167
## 3  just      3109
## 4  said      3084
## 5   one      3071
## 6  like      2686
## 7   get      2304
## 8  time      2202
## 9   new      1949
## 10  now      1814
  1. Making a plot for the table
library(ggplot2)
## 
## Присоединяю пакет: 'ggplot2'
## Следующий объект скрыт от 'package:NLP':
## 
##     annotate
ggplot(top10,
       aes(x = reorder(Word, Frequency),
           y = Frequency)) +
  geom_bar(stat = "identity",fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 10 Most Frequent Words",
    x = "Word",
    y = "Frequency"
  )

Future Shiny App

Shiny App for prediction of the next word will be build in the future - we will analize input and suggest top-3 the most possible next words (Prediction function will be created based on n-gram models)