In this milestone report I checked the english text datasets given
for this project.
Before building the prediction model, I wanted to see the data first and
understand it in a simple way.
The dataset has text from three sources, which are blogs, news and
twitter.
By checking number of lines, number of words, file size and some graphs,
it gives a basic idea about the dataset.
This step is mainly useful to get familiar with the data before doing the actual prediction part.
First I loaded the three text files into R.
blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
After running this code, the text from each file is stored in variables called blogs, news and twitter.
Then I checked how many lines and words are there in each dataset.
line_count <- c(length(blogs), length(news), length(twitter))
word_count <- c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
avg_words <- round(word_count / line_count, 2)
summary_table <- data.frame(
Dataset = c("Blogs","News","Twitter"),
Lines = line_count,
Words = word_count,
Avg_Words_Per_Line = avg_words
)
summary_table
## Dataset Lines Words Avg_Words_Per_Line
## 1 Blogs 899288 37546806 41.75
## 2 News 1010206 34761151 34.41
## 3 Twitter 2360148 30096649 12.75
This table shows total lines, total words and average words per line for each dataset.
I also checked how big each file is.
sizes <- file.info(c("en_US.blogs.txt",
"en_US.news.txt",
"en_US.twitter.txt"))$size
sizes_mb <- round(sizes / 1024^2, 2)
size_table <- data.frame(
Dataset = c("Blogs","News","Twitter"),
Size_MB = sizes_mb
)
size_table
## Dataset Size_MB
## 1 Blogs 200.42
## 2 News 196.28
## 3 Twitter 159.36
This table shows file size in MB.
Since the full datasets are very large, I took smaller samples for checking.
set.seed(123)
sample_blogs <- sample(blogs, 1000)
sample_news <- sample(news, 1000)
sample_twitter <- sample(twitter, 1000)
Here I selected 1000 lines from each dataset randomly.
To understand line length in blogs, I counted the words in each sampled line.
blog_words <- stri_count_words(sample_blogs)
hist(blog_words,
main = "words per line in blogs sample",
xlab = "number of words",
col = "lightblue")
This graph shows that many blog lines are longer.
Next I checked number of words in each sampled news line.
news_words <- stri_count_words(sample_news)
hist(news_words,
main = "words per line in news sample",
xlab = "number of words",
col = "lightgreen")
This graph gives an idea about the length of news lines.
Then I checked number of words in each sampled twitter line.
twitter_words <- stri_count_words(sample_twitter)
hist(twitter_words,
main = "words per line in twitter sample",
xlab = "number of words",
col = "lightpink")
This graph shows that twitter lines are smaller in general.
Below is a simple bar plot comparing total words in the three datasets.
barplot(word_count,
names.arg = c("Blogs","News","Twitter"),
col = "lightyellow",
main = "total words in each dataset",
ylab = "number of words")
this plot makes it easy to compare the total text in all three datasets.
after checking the data, I noticed some points.
blogs have longer lines compared to the other two datasets. and
Twitter has many lines, but most of them are short why because tweets
are short messages.
News data is different but also similar type to blogs and twitter.
From the summary table also, average words per line are higher for
blogs and lower for twitter.
The datasets are large and contain many words, so it looks enough for
building a next word prediction model.
In the next step, I will prepare the text for building the prediction model.
first the text will be cleaned by converting all text into lowercase
and removing punctuation.
After that I will create n-grams like two-word and three-word
combinations.
These combinations can help in guessing the next word when a user types some words.
after that I will build a shiny app where the user enters a phrase or words and the app will try to suggest the next word.