In this project we’re downloading and cleansing data, making EDA and checking top 10 the most frequent words in all three files
We’ll take 10% of all data
set.seed(123)
blogs <- readLines("en_US.blogs.txt", encoding="UTF-8")
news <- readLines("en_US.news.txt", encoding="UTF-8")
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8")
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 167155 содержит встроенный nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 268547 содержит встроенный nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 1274086 содержит встроенный nul
## Warning in readLines("en_US.twitter.txt", encoding = "UTF-8"): похоже, строка
## 1759032 содержит встроенный nul
sample_size <- 0.01
blogs <- sample(blogs, length(blogs) * sample_size)
news <- sample(news, length(news) * sample_size)
twitter <- sample(twitter, length(twitter) * sample_size)
basic_data_table <- data.frame(
Source = c("Blogs", "News", "Twitter"),
"File Size (MB)" = c( file.info("en_US.blogs.txt")$size / 1024^2,
file.info("en_US.news.txt")$size / 1024^2,
file.info("en_US.twitter.txt")$size / 1024^2),
"Length (number of rows)" = c (length(blogs),
length(news),
length(twitter)),
"Number of words" = c(sum(sapply(strsplit(blogs, "\\s+"), length)),
sum(sapply(strsplit(news, "\\s+"), length)),
sum(sapply(strsplit(twitter, "\\s+"), length))),
"Number of characters" = c (sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter)))
)
data <- c(blogs, news, twitter)
library(tm)
## Загрузка требуемого пакета: NLP
data_words <- unlist(strsplit(tolower(data), "\\W+"))
data_words <- data_words[data_words != ""]
data_words <- data_words[!grepl("^[0-9]+$", data_words)]
data_words <- data_words[!data_words %in% stopwords("en")]
data_words <- data_words[!data_words %in% c(
"s", "t", "m", "ll", "ve", "re", "d"
)]
word_freq <- sort(table(data_words), decreasing = TRUE)
top10 <- data.frame(
Word = names(word_freq)[1:10],
Frequency = as.numeric(word_freq[1:10])
)
top10
## Word Frequency
## 1 can 3215
## 2 will 3167
## 3 just 3109
## 4 said 3084
## 5 one 3071
## 6 like 2686
## 7 get 2304
## 8 time 2202
## 9 new 1949
## 10 now 1814
library(ggplot2)
##
## Присоединяю пакет: 'ggplot2'
## Следующий объект скрыт от 'package:NLP':
##
## annotate
ggplot(top10,
aes(x = reorder(Word, Frequency),
y = Frequency)) +
geom_bar(stat = "identity",fill = "steelblue") +
coord_flip() +
labs(
title = "Top 10 Most Frequent Words",
x = "Word",
y = "Frequency"
)
Shiny App for prediction of the next word will be build in the future - we will analize input and suggest top-3 the most possible next words (Prediction function will be created based on n-gram models)