Analyzing Real-World Text Data for Predictive Language Modeling

Introduction

The purpose of this analysis is to understand the structure and behavior of natural language present in the Swift-Key data-set. Before building any predictive text model, it is important to study how people actually write across different platforms. The data-set contains text collected from blogs, news articles, and Twitter posts. These sources represent different writing styles, sentence structures, and vocabulary usage, which directly influence how a prediction model should be designed.

Loading Libraries

library(stringi)
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Loading the Data

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary

summary_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter))),
  Size_MB = c(file.info("final/en_US/en_US.blogs.txt")$size,
              file.info("final/en_US/en_US.news.txt")$size,
              file.info("final/en_US/en_US.twitter.txt")$size) / (1024^2)
)

summary_data

##    Source   Lines    Words  Size_MB
## 1   Blogs  899288 37546250 200.4242
## 2    News 1010242 34762395 196.2775
## 3 Twitter 2360148 30093413 159.3641

Sampling the Data

set.seed(123)

sample_blogs <- sample(blogs, 3000)
sample_news <- sample(news, 3000)
sample_twitter <- sample(twitter, 3000)

sample_text <- c(sample_blogs, sample_news, sample_twitter)

Sentence Length Distribution

sentence_lengths <- stri_count_words(sample_text)

qplot(sentence_lengths, bins = 50) +
  ggtitle("Distribution of Sentence Lengths") +
  xlab("Number of Words in a Sentence") +
  ylab("Frequency")

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Word Frequency and Vocabulary Richness

all_words <- unlist(strsplit(tolower(sample_text), "\\W+"))
all_words <- all_words[all_words != ""]

word_table <- table(all_words)
word_freq <- sort(word_table, decreasing = TRUE)

length(word_freq)

## [1] 25419

This shows how many unique words are used in the sampled data.

Coverage Analysis (Important for Prediction Model)

coverage <- cumsum(word_freq) / sum(word_freq)

words_50 <- which(coverage >= 0.5)[1]
words_90 <- which(coverage >= 0.9)[1]

words_50

## state 
##   137

words_90

## ricotta 
##    6394

This tells how many words are needed to cover 50% and 90% of the language usage.

Most Frequent Words

top_words <- head(word_freq, 20)

barplot(top_words, las = 2,
        main = "Most Frequent Words",
        ylab = "Frequency")

Observations

The three sources show noticeable differences in writing patterns. Twitter contains shorter sentences and informal expressions, while blogs and news contain longer and more structured text. A small portion of the vocabulary accounts for a large portion of total word usage, which is very useful when designing a predictive text model. This means the model does not need to store every word to make accurate predictions.

Plan for Prediction Model

The insights from this analysis will be used to build n-gram models that predict the next word based on previous words. Since only a small vocabulary covers most of the text, the model can be optimized for speed and memory usage. Different writing styles observed across sources also suggest that the model should be flexible enough to handle both formal and informal text. A Shiny application will be developed where users can input text and receive real-time next-word predictions.

This exploration provides a deeper understanding of how natural language appears in real data. These findings form the foundation for building an efficient and accurate predictive text system.