Introduction

The purpose of this milestone report is to explore the SwiftKey text data that will be used to build a predictive text model. This report summarizes the characteristics of the three English-language data sets, presents exploratory analyses, and outlines plans for developing the prediction algorithm and Shiny application.

Load Required Packages

library(stringi)
library(dplyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)

Load the Data

blogsFile <- "en_US.blogs.txt"
newsFile <- "en_US.news.txt"
twitterFile <- "en_US.twitter.txt"

blogs <- readLines(blogsFile,
                   encoding = "UTF-8",
                   skipNul = TRUE)

news <- readLines(newsFile,
                  encoding = "UTF-8",
                  skipNul = TRUE)

twitter <- readLines(twitterFile,
                     encoding = "UTF-8",
                     skipNul = TRUE)

Summary Statistics

summaryData <- data.frame(

  File = c("Blogs",
           "News",
           "Twitter"),

  File_Size_MB = round(file.info(c(
    blogsFile,
    newsFile,
    twitterFile
  ))$size / 1024^2, 2),

  Lines = c(
    length(blogs),
    length(news),
    length(twitter)
  ),

  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),

  Characters = c(
    sum(nchar(blogs)),
    sum(nchar(news)),
    sum(nchar(twitter))
  )
)

kable(summaryData,
      caption = "Summary Statistics for the SwiftKey Data Sets")

Summary Statistics for the SwiftKey Data Sets
File	File_Size_MB	Lines	Words	Characters
Blogs	200.42	899288	37546250	206824505
News	196.28	1010242	34762395	203223159
Twitter	159.36	2360148	30093413	162096241

Observations

The Twitter data set contains the largest number of individual documents.
Blog and News entries contain substantially longer text.
All three files provide a rich source of natural language suitable for predictive modeling.

Sampling the Data

Because the complete data set contains millions of lines of text, a random sample is used for exploratory analysis. Sampling greatly reduces computation time while preserving representative language patterns.

set.seed(12345)

sampleBlogs <- sample(blogs, 5000)
sampleNews <- sample(news, 5000)
sampleTwitter <- sample(twitter, 5000)

sampleText <- c(
  sampleBlogs,
  sampleNews,
  sampleTwitter
)

Text Cleaning

corpus <- Corpus(VectorSource(sampleText))

corpus <- tm_map(corpus,
                 content_transformer(tolower))

corpus <- tm_map(corpus,
                 removePunctuation)

corpus <- tm_map(corpus,
                 removeNumbers)

corpus <- tm_map(corpus,
                 removeWords,
                 stopwords("english"))

corpus <- tm_map(corpus,
                 stripWhitespace)

Word Frequency Analysis

dtm <- DocumentTermMatrix(corpus)

freq <- colSums(as.matrix(dtm))
freq <- sort(freq, decreasing = TRUE)

wordFreq <- data.frame(
  Word = names(freq),
  Frequency = unname(freq)
)

top20 <- head(wordFreq, 20)

kable(top20,
      caption = "Top 20 Most Frequent Words")

Top 20 Most Frequent Words
Word	Frequency
said	1544
will	1434
one	1188
just	1159
like	1019
can	1015
time	866
get	846
new	776
people	698
also	696
now	660
day	630
good	624
first	603
know	592
back	586
last	553
see	519
make	517

Histogram of Most Frequent Words

ggplot(top20,
       aes(
         x = reorder(Word, Frequency),
         y = Frequency
       )) +

  geom_col(fill = "steelblue") +

  coord_flip() +

  labs(
    title = "Top 20 Most Frequent Words",
    x = "Word",
    y = "Frequency"
  )

Word Cloud

wordcloud(
  words = wordFreq$Word,
  freq = wordFreq$Frequency,
  max.words = 100,
  random.order = FALSE,
  colors = rainbow(8)
)

Findings

Several characteristics of the data were observed during the exploratory analysis:

Twitter contains the largest number of text entries, but each entry is relatively short.
Blog and News documents contain longer sentences and greater vocabulary diversity.
A relatively small number of words account for a large proportion of the text.
Sampling provides an efficient way to analyze language patterns while reducing computational requirements.

Future Plans

The final predictive text application will use Natural Language Processing (NLP) techniques to predict the next word in a sequence.

The planned workflow includes:

Cleaning and preprocessing the text.
Tokenizing the text into words.
Building unigram, bigram, trigram, and four-gram models.
Calculating n-gram frequencies.
Using a back-off prediction algorithm when an exact match is unavailable.
Developing an interactive Shiny application that predicts the next word as users type.

Conclusion

This exploratory analysis demonstrates that the SwiftKey data set has been successfully loaded and summarized. Basic statistics and visualizations provide insight into the structure of the text corpus and establish a strong foundation for building a predictive text algorithm and Shiny application.

Data Science Capstone Milestone Report

LaKeya King

2026-06-16