Introduction

In this milestone report I checked the english text datasets given for this project.
Before building the prediction model, I wanted to see the data first and understand it in a simple way.

The dataset has text from three sources, which are blogs, news and twitter.
By checking number of lines, number of words, file size and some graphs, it gives a basic idea about the dataset.

This step is mainly useful to get familiar with the data before doing the actual prediction part.

Loading the data

First I loaded the three text files into R.

blogs <- readLines("en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

After running this code, the text from each file is stored in variables called blogs, news and twitter.

Basic information about the data

Then I checked how many lines and words are there in each dataset.

line_count <- c(length(blogs), length(news), length(twitter))

word_count <- c(
  sum(stri_count_words(blogs)),
  sum(stri_count_words(news)),
  sum(stri_count_words(twitter))
)

avg_words <- round(word_count / line_count, 2)

summary_table <- data.frame(
  Dataset = c("Blogs","News","Twitter"),
  Lines = line_count,
  Words = word_count,
  Avg_Words_Per_Line = avg_words
)

summary_table
##   Dataset   Lines    Words Avg_Words_Per_Line
## 1   Blogs  899288 37546806              41.75
## 2    News 1010206 34761151              34.41
## 3 Twitter 2360148 30096649              12.75

This table shows total lines, total words and average words per line for each dataset.

File sizes

I also checked how big each file is.

sizes <- file.info(c("en_US.blogs.txt",
                     "en_US.news.txt",
                     "en_US.twitter.txt"))$size

sizes_mb <- round(sizes / 1024^2, 2)

size_table <- data.frame(
  Dataset = c("Blogs","News","Twitter"),
  Size_MB = sizes_mb
)

size_table
##   Dataset Size_MB
## 1   Blogs  200.42
## 2    News  196.28
## 3 Twitter  159.36

This table shows file size in MB.

Sampling the data

Since the full datasets are very large, I took smaller samples for checking.

set.seed(123)

sample_blogs <- sample(blogs, 1000)
sample_news <- sample(news, 1000)
sample_twitter <- sample(twitter, 1000)

Here I selected 1000 lines from each dataset randomly.

Histogram for blogs

To understand line length in blogs, I counted the words in each sampled line.

blog_words <- stri_count_words(sample_blogs)

hist(blog_words,
     main = "words per line in blogs sample",
     xlab = "number of words",
     col = "lightblue")

This graph shows that many blog lines are longer.

Histogram for news

Next I checked number of words in each sampled news line.

news_words <- stri_count_words(sample_news)

hist(news_words,
     main = "words per line in news sample",
     xlab = "number of words",
     col = "lightgreen")

This graph gives an idea about the length of news lines.

Histogram for twitter

Then I checked number of words in each sampled twitter line.

twitter_words <- stri_count_words(sample_twitter)

hist(twitter_words,
     main = "words per line in twitter sample",
     xlab = "number of words",
     col = "lightpink")

This graph shows that twitter lines are smaller in general.

Word count comparison

Below is a simple bar plot comparing total words in the three datasets.

barplot(word_count,
        names.arg = c("Blogs","News","Twitter"),
        col = "lightyellow",
        main = "total words in each dataset",
        ylab = "number of words")

this plot makes it easy to compare the total text in all three datasets.

Observations

after checking the data, I noticed some points.

blogs have longer lines compared to the other two datasets. and Twitter has many lines, but most of them are short why because tweets are short messages.
News data is different but also similar type to blogs and twitter.

From the summary table also, average words per line are higher for blogs and lower for twitter.
The datasets are large and contain many words, so it looks enough for building a next word prediction model.

Plan for next step

In the next step, I will prepare the text for building the prediction model.

first the text will be cleaned by converting all text into lowercase and removing punctuation.
After that I will create n-grams like two-word and three-word combinations.

These combinations can help in guessing the next word when a user types some words.

after that I will build a shiny app where the user enters a phrase or words and the app will try to suggest the next word.