Introduction

This report presents an exploratory analysis of three large text datasets: blogs, news articles, and Twitter posts. The objective is to understand the basic characteristics of each corpus, including line counts, word distributions, and frequency patterns. This analysis serves as a foundational milestone for building a predictive text model that can suggest the next word based on user input.

The project context involves natural language processing and text mining techniques applied to real-world English language data. Understanding these datasets will inform the development of efficient algorithms for text prediction, which has applications in mobile keyboards, search engines, and writing assistants.

Data Loading

The following code loads the three text datasets from the working directory:

# Load the three datasets
blogs <- readLines('en_US.blogs.txt', encoding = 'UTF-8')
news <- readLines('en_US.news.txt', encoding = 'UTF-8')
twitter <- readLines('en_US.twitter.txt', encoding = 'UTF-8')

Summary Statistics

Below is a summary table showing the number of lines and average words per line for each dataset:

# Calculate line counts
blogs_lines <- length(blogs)
news_lines <- length(news)
twitter_lines <- length(twitter)

# Sample and calculate average words per line
set.seed(42)
blogs_sample <- blogs[sample(length(blogs), 10000)]
news_sample <- news[sample(length(news), 10000)]
twitter_sample <- twitter[sample(length(twitter), 10000)]

blogs_wc <- sapply(strsplit(blogs_sample, '\\s+'), length)
news_wc <- sapply(strsplit(news_sample, '\\s+'), length)
twitter_wc <- sapply(strsplit(twitter_sample, '\\s+'), length)

blogs_avg <- round(mean(blogs_wc), 2)
news_avg <- round(mean(news_wc), 2)
twitter_avg <- round(mean(twitter_wc), 2)

# Create summary table
summary_df <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(blogs_lines, news_lines, twitter_lines),
  Avg_Words_Per_Line = c(blogs_avg, news_avg, twitter_avg)
)

knitr::kable(summary_df, caption = "Dataset Summary Statistics")
Dataset Summary Statistics
Dataset Lines Avg_Words_Per_Line
Blogs 899288 42.19
News 1010242 34.63
Twitter 2360148 12.80

Sample Lines from Each Dataset

Here are random samples from each dataset to illustrate the type of content:

set.seed(123)
cat("Sample from Blogs:\n", blogs[sample(length(blogs), 1)], "\n\n")
## Sample from Blogs:
##  The bruschetta however, missed the mark. Instead of manageable two-bite crostini, these were huge slices of grilled bread and heaped with toppings of tomato, cannellini beans and roasted peppers with goat cheese.
cat("Sample from News:\n", news[sample(length(news), 1)], "\n\n")
## Sample from News:
##  A four-star lineman, the 6-foot-4, 250-pound son of Greyhounds coach Biff Poggi earned first-team All-Metro honors last fall after making 49 tackles, 11 for a loss, and finishing the season with 10 sacks.
cat("Sample from Twitter:\n", twitter[sample(length(twitter), 1)], "\n")
## Sample from Twitter:
##  I tell ion gaf so why test my tolerance?

Line Length Distributions

The histograms below show the distribution of line lengths (in characters) for each dataset:

par(mfrow=c(1,3))

hist(nchar(blogs_sample), 
     main='Blogs Line Length', 
     xlab='Characters', 
     col='lightblue', 
     breaks=50)

hist(nchar(news_sample), 
     main='News Line Length', 
     xlab='Characters', 
     col='lightgreen', 
     breaks=50)

hist(nchar(twitter_sample), 
     main='Twitter Line Length', 
     xlab='Characters', 
     col='salmon', 
     breaks=50)

Word Frequency Analysis

This section analyzes the most frequent words in the blogs dataset after removing common stopwords:

# Define stopwords
stopwords_en <- c('the','and','for','are','but','not','you','all','can','her',
                  'was','one','our','out','day','get','has','him','his','how',
                  'its','may','new','now','old','see','two','who','boy','did',
                  'let','put','say','she','too','use','with','from','have','this',
                  'will','your','what','that','been','into','when','make','like',
                  'time','just','than','them','only','more','some','said','each',
                  'which','their','there','would','other','about','after','first',
                  'could','where','these','being','before','through','because',
                  'between','without','against','during','another','himself','herself')

# Process blogs text
blogs_words <- tolower(unlist(strsplit(paste(blogs_sample, collapse=' '), '[^a-z]+')))
blogs_words <- blogs_words[nchar(blogs_words) > 0 & !blogs_words %in% stopwords_en]
blogs_top10 <- sort(table(blogs_words), decreasing=TRUE)[1:10]

# Display top 10 words
print(blogs_top10)
## blogs_words
##    to     a    of    in    is    it     s    he     t    on 
## 11947  9851  9803  6287  4787  4323  3947  3676  3462  3084
# Create bar plot
barplot(blogs_top10, 
        main='Top 10 Most Frequent Words - Blogs', 
        col='steelblue', 
        las=2, 
        cex.names=0.8,
        ylab='Frequency')

Interpretation of Results

The analysis reveals several interesting patterns across the three datasets:

  1. Dataset Size: The Twitter dataset is substantially larger with over 2.3 million lines, compared to approximately 1 million for news and 900,000 for blogs. This reflects Twitter’s high-volume, short-form nature.

  2. Content Length: Blogs contain the longest content per line (averaging around 41 words), followed by news articles (around 34 words), and Twitter posts are notably shorter (around 13 words). This aligns with Twitter’s historical character limitations.

  3. Line Length Distribution: The histograms show that blogs have a wider distribution with longer tails, indicating more variable content length. Twitter shows a tighter distribution clustered at lower character counts, which is expected given platform constraints.

  4. Word Frequency: After removing stopwords, the most common words in blogs include prepositions and articles like “to”, “a”, and “of”, along with the verb “is” and pronoun “it”. This suggests conversational and descriptive language typical of blog writing.

  5. Data Quality: Some warnings appeared during Twitter data loading regarding embedded null characters, which is common with large social media datasets. This will need to be addressed during preprocessing.

Next Steps and Project Goals

The ultimate goal of this project is to develop a predictive text application that can suggest the next word(s) based on the user’s input. To achieve this, the following steps are planned:

  1. Text Preprocessing: Clean the data by removing special characters, numbers, profanity, and handling contractions. Normalize text to lowercase.

  2. Tokenization: Create n-grams (bigrams, trigrams, and quadgrams) to capture word sequences and context.

  3. Model Development: Build a statistical language model using n-gram frequencies and implement smoothing techniques (such as Katz backoff or interpolation) to handle unseen word combinations.

  4. Optimization: Given the large dataset size, implement sampling strategies and data structures that enable fast lookup and prediction while managing memory constraints.

  5. Application Development: Create a Shiny web application that provides real-time word prediction as users type, similar to smartphone keyboard suggestions.

  6. Evaluation: Test the model’s accuracy and speed, adjusting parameters to balance prediction quality with computational efficiency.

This exploratory analysis has provided valuable insights into the structure and characteristics of the text data, which will inform the design decisions for the prediction algorithm and ensure it performs well across different types of text input.