Capstone Milestone Report

1. Introduction

The goal of this project is to build a predictive text model that can suggest the next word based on previously typed words. This is a foundational step in developing a text prediction application similar to mobile keyboard autocomplete systems.

The final objective is to implement this model in a Shiny application.

2. libraries

The following libraries were used for text processing, analysis, and visualization:

library(tm)

## Warning: package 'tm' was built under R version 4.5.3

## Loading required package: NLP

library(stringi)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.5.3

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidytext)

## Warning: package 'tidytext' was built under R version 4.5.3

library(data.table)

## Warning: package 'data.table' was built under R version 4.5.3

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(quanteda)

## Warning: package 'quanteda' was built under R version 4.5.3

## Package version: 4.4
## Unicode version: 15.1
## ICU version: 74.1

## Parallel computing: 8 of 8 threads used.

## See https://quanteda.io for tutorials and examples.

## 
## Attaching package: 'quanteda'

## The following object is masked from 'package:tm':
## 
##     stopwords

## The following objects are masked from 'package:NLP':
## 
##     meta, meta<-

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.5.3

library(quanteda.textstats)

## Warning: package 'quanteda.textstats' was built under R version 4.5.3

3. data loading

The dataset consists of three text sources:

-Blogs -News articles -Twitter posts

blogs <- readLines("en_US.blogs.txt", encoding="UTF-8", skipNul=TRUE)
news <- readLines("en_US.news.txt", encoding="UTF-8", skipNul=TRUE)
twitter <- readLines("en_US.twitter.txt", encoding="UTF-8", skipNul=TRUE)

4. SAMPLE DATA (1%)

Due to dataset size, a 1% sample was used for efficiency

set.seed(123)

sample_data <- c(
  sample(blogs, length(blogs) * 0.01),
  sample(news, length(news) * 0.01),
  sample(twitter, length(twitter) * 0.01)
)

5. clean data

The text was cleaned by:

-converting to lowercase -removing numbers -removing punctuation -removing extra whitespace

clean_text <- tolower(sample_data)
clean_text <- removeNumbers(clean_text)
clean_text <- removePunctuation(clean_text)
clean_text <- stripWhitespace(clean_text)
clean_text <- clean_text[clean_text != ""]

6. TOKENIZATION (FIXED)

tokens_data <- tokens(clean_text)

7. exploratory analysis & unigrams

dfm_uni <- dfm(tokens_data)
freq_unigram <- textstat_frequency(dfm_uni)

top_unigrams <- head(freq_unigram, 20)

ggplot(top_unigrams, aes(x = reorder(feature, frequency), y = frequency)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  labs(title = "Top 20 Unigrams", x = "Words", y = "Frequency")

8. bigrams

bigrams <- tokens_ngrams(tokens_data, n = 2)
dfm_bigrams <- dfm(bigrams)
freq_bigram <- textstat_frequency(dfm_bigrams)

bigram_df <- as.data.frame(freq_bigram)
colnames(bigram_df) <- c("ngram", "freq")

bigram_df <- bigram_df %>%
  separate(ngram, into = c("w1", "w2"), sep = " ")

## Warning: Expected 2 pieces. Missing pieces filled with `NA` in 445583 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

9. trigrams

trigrams <- tokens_ngrams(tokens_data, n = 3)
dfm_trigrams <- dfm(trigrams)
freq_trigram <- textstat_frequency(dfm_trigrams)

trigram_df <- as.data.frame(freq_trigram)
colnames(trigram_df) <- c("ngram", "freq")

trigram_df <- trigram_df %>%
  separate(ngram, into = c("w1", "w2", "w3"), sep = " ")

## Warning: Expected 3 pieces. Missing pieces filled with `NA` in 773674 rows [1, 2, 3, 4,
## 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

10. n-gram prediction model

A predictive model was built using n-grams:

-Unigrams: single words -Bigrams: word pairs -Trigrams: word triplets

These models estimate the probability of the next word based on previous words. ## 11. prediction functions ## bigram prediction

predict_bigram <- function(word) {
  result <- bigram_df %>%
    filter(w1 == word) %>%
    arrange(desc(freq)) %>%
    head(3)
  
  return(result)
}

trigram prediction

predict_trigram <- function(word1, word2) {
  result <- trigram_df %>%
    filter(w1 == word1, w2 == word2) %>%
    arrange(desc(freq)) %>%
    head(3)
  
  return(result)
}

12. backoff model

to handle unseen word combinations, a backoff strategy was implemented

predict_next_word <- function(word1, word2) {
  
  result <- trigram_df %>%
    filter(w1 == word1, w2 == word2) %>%
    arrange(desc(freq)) %>%
    head(1)
  
  if (nrow(result) > 0) {
    return(result)
  }
  
  result <- bigram_df %>%
    filter(w1 == word2) %>%
    arrange(desc(freq)) %>%
    head(1)
  
  return(result)
}

conclusion

A basic predictive text model was successfully developed. This model forms the foundation for a real-time text prediction application.