Introduction

The purpose of this milestone report is to explore the SwiftKey text data that will be used to build a predictive text model. This report summarizes the characteristics of the three English-language data sets, presents exploratory analyses, and outlines plans for developing the prediction algorithm and Shiny application.

Load Required Packages

library(stringi)
library(dplyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)

Load the Data

blogsFile <- "en_US.blogs.txt"
newsFile <- "en_US.news.txt"
twitterFile <- "en_US.twitter.txt"

blogs <- readLines(blogsFile,
                   encoding = "UTF-8",
                   skipNul = TRUE)

news <- readLines(newsFile,
                  encoding = "UTF-8",
                  skipNul = TRUE)

twitter <- readLines(twitterFile,
                     encoding = "UTF-8",
                     skipNul = TRUE)

Summary Statistics

summaryData <- data.frame(

  File = c("Blogs",
           "News",
           "Twitter"),

  File_Size_MB = round(file.info(c(
    blogsFile,
    newsFile,
    twitterFile
  ))$size / 1024^2, 2),

  Lines = c(
    length(blogs),
    length(news),
    length(twitter)
  ),

  Words = c(
    sum(stri_count_words(blogs)),
    sum(stri_count_words(news)),
    sum(stri_count_words(twitter))
  ),

  Characters = c(
    sum(nchar(blogs)),
    sum(nchar(news)),
    sum(nchar(twitter))
  )
)

kable(summaryData,
      caption = "Summary Statistics for the SwiftKey Data Sets")
Summary Statistics for the SwiftKey Data Sets
File File_Size_MB Lines Words Characters
Blogs 200.42 899288 37546250 206824505
News 196.28 1010242 34762395 203223159
Twitter 159.36 2360148 30093413 162096241

Observations

  • The Twitter data set contains the largest number of individual documents.
  • Blog and News entries contain substantially longer text.
  • All three files provide a rich source of natural language suitable for predictive modeling.

Sampling the Data

Because the complete data set contains millions of lines of text, a random sample is used for exploratory analysis. Sampling greatly reduces computation time while preserving representative language patterns.

set.seed(12345)

sampleBlogs <- sample(blogs, 5000)
sampleNews <- sample(news, 5000)
sampleTwitter <- sample(twitter, 5000)

sampleText <- c(
  sampleBlogs,
  sampleNews,
  sampleTwitter
)

Text Cleaning

corpus <- Corpus(VectorSource(sampleText))

corpus <- tm_map(corpus,
                 content_transformer(tolower))

corpus <- tm_map(corpus,
                 removePunctuation)

corpus <- tm_map(corpus,
                 removeNumbers)

corpus <- tm_map(corpus,
                 removeWords,
                 stopwords("english"))

corpus <- tm_map(corpus,
                 stripWhitespace)

Word Frequency Analysis

dtm <- DocumentTermMatrix(corpus)

freq <- colSums(as.matrix(dtm))
freq <- sort(freq, decreasing = TRUE)

wordFreq <- data.frame(
  Word = names(freq),
  Frequency = unname(freq)
)

top20 <- head(wordFreq, 20)

kable(top20,
      caption = "Top 20 Most Frequent Words")
Top 20 Most Frequent Words
Word Frequency
said 1544
will 1434
one 1188
just 1159
like 1019
can 1015
time 866
get 846
new 776
people 698
also 696
now 660
day 630
good 624
first 603
know 592
back 586
last 553
see 519
make 517

Histogram of Most Frequent Words

ggplot(top20,
       aes(
         x = reorder(Word, Frequency),
         y = Frequency
       )) +

  geom_col(fill = "steelblue") +

  coord_flip() +

  labs(
    title = "Top 20 Most Frequent Words",
    x = "Word",
    y = "Frequency"
  )

Word Cloud

wordcloud(
  words = wordFreq$Word,
  freq = wordFreq$Frequency,
  max.words = 100,
  random.order = FALSE,
  colors = rainbow(8)
)

Findings

Several characteristics of the data were observed during the exploratory analysis:

Future Plans

The final predictive text application will use Natural Language Processing (NLP) techniques to predict the next word in a sequence.

The planned workflow includes:

Conclusion

This exploratory analysis demonstrates that the SwiftKey data set has been successfully loaded and summarized. Basic statistics and visualizations provide insight into the structure of the text corpus and establish a strong foundation for building a predictive text algorithm and Shiny application.