The purpose of this analysis is to understand the structure and behavior of natural language present in the Swift-Key data-set. Before building any predictive text model, it is important to study how people actually write across different platforms. The data-set contains text collected from blogs, news articles, and Twitter posts. These sources represent different writing styles, sentence structures, and vocabulary usage, which directly influence how a prediction model should be designed.
library(stringi)
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
summary_data <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))),
Size_MB = c(file.info("final/en_US/en_US.blogs.txt")$size,
file.info("final/en_US/en_US.news.txt")$size,
file.info("final/en_US/en_US.twitter.txt")$size) / (1024^2)
)
summary_data
## Source Lines Words Size_MB
## 1 Blogs 899288 37546250 200.4242
## 2 News 1010242 34762395 196.2775
## 3 Twitter 2360148 30093413 159.3641
set.seed(123)
sample_blogs <- sample(blogs, 3000)
sample_news <- sample(news, 3000)
sample_twitter <- sample(twitter, 3000)
sample_text <- c(sample_blogs, sample_news, sample_twitter)
sentence_lengths <- stri_count_words(sample_text)
qplot(sentence_lengths, bins = 50) +
ggtitle("Distribution of Sentence Lengths") +
xlab("Number of Words in a Sentence") +
ylab("Frequency")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once per session.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
all_words <- unlist(strsplit(tolower(sample_text), "\\W+"))
all_words <- all_words[all_words != ""]
word_table <- table(all_words)
word_freq <- sort(word_table, decreasing = TRUE)
length(word_freq)
## [1] 25419
This shows how many unique words are used in the sampled data.
coverage <- cumsum(word_freq) / sum(word_freq)
words_50 <- which(coverage >= 0.5)[1]
words_90 <- which(coverage >= 0.9)[1]
words_50
## state
## 137
words_90
## ricotta
## 6394
This tells how many words are needed to cover 50% and 90% of the language usage.
top_words <- head(word_freq, 20)
barplot(top_words, las = 2,
main = "Most Frequent Words",
ylab = "Frequency")
The three sources show noticeable differences in writing patterns. Twitter contains shorter sentences and informal expressions, while blogs and news contain longer and more structured text. A small portion of the vocabulary accounts for a large portion of total word usage, which is very useful when designing a predictive text model. This means the model does not need to store every word to make accurate predictions.
The insights from this analysis will be used to build n-gram models that predict the next word based on previous words. Since only a small vocabulary covers most of the text, the model can be optimized for speed and memory usage. Different writing styles observed across sources also suggest that the model should be flexible enough to handle both formal and informal text. A Shiny application will be developed where users can input text and receive real-time next-word predictions.
This exploration provides a deeper understanding of how natural language appears in real data. These findings form the foundation for building an efficient and accurate predictive text system.