Introduction

The primary goal of this presentation is to show the current prograss done to decipher the Swiftkey Data Set. Deeming that English is the most convenient language for the writer, it was the language chosen to be subject to analysis. Because of the significant text size which would take up a lot of the device’s CPU capacity, only the twitter.txt data was analyzed.

Data Loading

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

The file is loaded using the method above via readLines()

Basic TXT Statistics

get_summary <- function(data) {
  lines <- length(data)
  words <- sum(str_count(data, "\\S+"))
  characters <- sum(nchar(data))
  list(lines = lines, words = words, characters = characters)
}

summary_df <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(str_count(blogs, "\\S+")), sum(str_count(news, "\\S+")), sum(str_count(twitter, "\\S+"))),
  Characters = c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter)))
)

knitr::kable(summary_df, caption = "Basic Summary of Text Datasets")
Basic Summary of Text Datasets
Dataset Lines Words Characters
Blogs 899288 37334131 206824505
News 1010206 34371031 203214543
Twitter 2360148 30373583 162096241

Above are the basic data sets for each of the data txt files acquired. However, on subsequent analyses, for the purpose of runtime and CPU capacity the text from twitter was focused and emphasized for extended analysis.

Distribution of Line Lengths

twitter_lengths <- nchar(twitter)

hist(twitter_lengths, breaks = 50,
     main = "Line Lengths in Twitter Dataset",
     xlab = "Characters per Line", col = "lightblue", border = "white")

It can clearly be indicated that the line distribution is slightly skewed to to the left until interestingly, there was a significant number of lines with close to 140,000 charactesr in a line.

Most Common/Frequent Words in Twitter Data

text_df <- tibble(line = twitter)

word_counts <- text_df %>%
  unnest_tokens(word, line) %>%
  count(word, sort  = TRUE)

data("stop_words")

filtered_counts <- word_counts %>%
  anti_join(stop_words, by = "word")

head(filtered_counts, 10)
## # A tibble: 10 × 2
##    word        n
##    <chr>   <int>
##  1 love   106732
##  2 day     91748
##  3 rt      89601
##  4 time    76803
##  5 lol     70162
##  6 3       54940
##  7 people  52047
##  8 happy   49009
##  9 follow  48108
## 10 2       45515

First, the code eliminates any stop words such as “the”, “is”, “I” simply because it will skew which words are the most frequent as regardless of settings, stop words are often the most used words in regular English.When such words are eliminated, interesting, there are a lot of colloquial words on the top indicating the casual nature of social media in terms of language.

Common Bigrams in the Data

bigrams <- text_df %>%
  unnest_tokens(bigram, line, token = "ngrams", n = 2)

bigram_counts <- bigrams %>%
  count(bigram, sort = TRUE)

bigram_counts %>%
  filter(n > 5) %>%
  top_n(10) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top Bigrams in Twitter Data", x = "Bigram", y = "Frequency")
## Selecting by n

Finally, as a preview for the prediction model, Bigrams were analyzed and as expected, most of the Bigrams that are frequent were standard patterns in English.

Conclusion

This report demonstrates successful data acquisition and initial Exploratory Data Analysis. Overall, the bigrams indicate that while most phrases are in line with what is expected out of a normal sentence, the data needed removal of stop words and also potentially in the future, removing any foreign language terms or words not appropriate in a more formal setting.