The goal of this project is to build a predictive text model capable of suggesting the next word based on user input. The data used in this analysis comes from blogs, news articles, and Twitter posts contained in the HC Corpora dataset.
blogs <- readLines("final/en_US/en_US.blogs.txt",
encoding = "UTF-8",
skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt",
encoding = "UTF-8",
skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt",
encoding = "UTF-8",
skipNul = TRUE)
stats <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs),
length(news),
length(twitter))
)
stats
## Dataset Lines
## 1 Blogs 899288
## 2 News 1010206
## 3 Twitter 2360148
set.seed(123)
sampleData <- c(
sample(blogs, 1000),
sample(news, 1000),
sample(twitter, 1000)
)
length(sampleData)
## [1] 3000
library(tm)
## Loading required package: NLP
corpus <- Corpus(VectorSource(sampleData))
corpus <- tm_map(corpus, content_transformer(tolower))
## Warning in tm_map.SimpleCorpus(corpus, content_transformer(tolower)):
## transformation drops documents
corpus <- tm_map(corpus, removePunctuation)
## Warning in tm_map.SimpleCorpus(corpus, removePunctuation): transformation drops
## documents
corpus <- tm_map(corpus, removeNumbers)
## Warning in tm_map.SimpleCorpus(corpus, removeNumbers): transformation drops
## documents
corpus <- tm_map(corpus, stripWhitespace)
## Warning in tm_map.SimpleCorpus(corpus, stripWhitespace): transformation drops
## documents
tdm <- TermDocumentMatrix(corpus)
m <- as.matrix(tdm)
freq <- sort(rowSums(m),
decreasing = TRUE)
head(freq, 20)
## the and that for you with was have this are but not from
## 4319 2205 990 953 723 599 599 465 448 417 412 376 354
## said will his one they about all
## 291 287 279 268 264 264 256
barplot(freq[1:20],
las = 2,
main = "Top 20 Most Frequent Words")
The Twitter dataset contains the largest number of text entries, while blog posts contain longer text documents. A small number of words account for a large percentage of the total word usage. These results are consistent with typical natural language datasets.
The final prediction algorithm will use an n-gram language model. The model will search for matching trigrams first, followed by bigrams and then unigrams using a backoff strategy. This approach balances prediction accuracy with computational efficiency.
The Shiny application will allow users to enter text and receive next-word predictions in real time. The application will display the most likely predicted word along with alternative suggestions.
This exploratory analysis successfully loaded and examined the HC Corpora dataset. The data has been sampled, cleaned, and analyzed to identify common word patterns. Future work will focus on developing an efficient next-word prediction model and deploying it through a Shiny application.