The purpose of this milestone report is to explore the SwiftKey text data that will be used to build a predictive text model. This report summarizes the characteristics of the three English-language data sets, presents exploratory analyses, and outlines plans for developing the prediction algorithm and Shiny application.
library(stringi)
library(dplyr)
library(ggplot2)
library(tm)
library(wordcloud)
library(knitr)
blogsFile <- "en_US.blogs.txt"
newsFile <- "en_US.news.txt"
twitterFile <- "en_US.twitter.txt"
blogs <- readLines(blogsFile,
encoding = "UTF-8",
skipNul = TRUE)
news <- readLines(newsFile,
encoding = "UTF-8",
skipNul = TRUE)
twitter <- readLines(twitterFile,
encoding = "UTF-8",
skipNul = TRUE)
summaryData <- data.frame(
File = c("Blogs",
"News",
"Twitter"),
File_Size_MB = round(file.info(c(
blogsFile,
newsFile,
twitterFile
))$size / 1024^2, 2),
Lines = c(
length(blogs),
length(news),
length(twitter)
),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
),
Characters = c(
sum(nchar(blogs)),
sum(nchar(news)),
sum(nchar(twitter))
)
)
kable(summaryData,
caption = "Summary Statistics for the SwiftKey Data Sets")
| File | File_Size_MB | Lines | Words | Characters |
|---|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546250 | 206824505 |
| News | 196.28 | 1010242 | 34762395 | 203223159 |
| 159.36 | 2360148 | 30093413 | 162096241 |
Because the complete data set contains millions of lines of text, a random sample is used for exploratory analysis. Sampling greatly reduces computation time while preserving representative language patterns.
set.seed(12345)
sampleBlogs <- sample(blogs, 5000)
sampleNews <- sample(news, 5000)
sampleTwitter <- sample(twitter, 5000)
sampleText <- c(
sampleBlogs,
sampleNews,
sampleTwitter
)
corpus <- Corpus(VectorSource(sampleText))
corpus <- tm_map(corpus,
content_transformer(tolower))
corpus <- tm_map(corpus,
removePunctuation)
corpus <- tm_map(corpus,
removeNumbers)
corpus <- tm_map(corpus,
removeWords,
stopwords("english"))
corpus <- tm_map(corpus,
stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
freq <- colSums(as.matrix(dtm))
freq <- sort(freq, decreasing = TRUE)
wordFreq <- data.frame(
Word = names(freq),
Frequency = unname(freq)
)
top20 <- head(wordFreq, 20)
kable(top20,
caption = "Top 20 Most Frequent Words")
| Word | Frequency |
|---|---|
| said | 1544 |
| will | 1434 |
| one | 1188 |
| just | 1159 |
| like | 1019 |
| can | 1015 |
| time | 866 |
| get | 846 |
| new | 776 |
| people | 698 |
| also | 696 |
| now | 660 |
| day | 630 |
| good | 624 |
| first | 603 |
| know | 592 |
| back | 586 |
| last | 553 |
| see | 519 |
| make | 517 |
ggplot(top20,
aes(
x = reorder(Word, Frequency),
y = Frequency
)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Frequent Words",
x = "Word",
y = "Frequency"
)
wordcloud(
words = wordFreq$Word,
freq = wordFreq$Frequency,
max.words = 100,
random.order = FALSE,
colors = rainbow(8)
)
Several characteristics of the data were observed during the exploratory analysis:
The final predictive text application will use Natural Language Processing (NLP) techniques to predict the next word in a sequence.
The planned workflow includes:
This exploratory analysis demonstrates that the SwiftKey data set has been successfully loaded and summarized. Basic statistics and visualizations provide insight into the structure of the text corpus and establish a strong foundation for building a predictive text algorithm and Shiny application.