The primary goal of this presentation is to show the current prograss done to decipher the Swiftkey Data Set. Deeming that English is the most convenient language for the writer, it was the language chosen to be subject to analysis. Because of the significant text size which would take up a lot of the device’s CPU capacity, only the twitter.txt data was analyzed.
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
The file is loaded using the method above via readLines()
get_summary <- function(data) {
lines <- length(data)
words <- sum(str_count(data, "\\S+"))
characters <- sum(nchar(data))
list(lines = lines, words = words, characters = characters)
}
summary_df <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(str_count(blogs, "\\S+")), sum(str_count(news, "\\S+")), sum(str_count(twitter, "\\S+"))),
Characters = c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter)))
)
knitr::kable(summary_df, caption = "Basic Summary of Text Datasets")
| Dataset | Lines | Words | Characters |
|---|---|---|---|
| Blogs | 899288 | 37334131 | 206824505 |
| News | 1010206 | 34371031 | 203214543 |
| 2360148 | 30373583 | 162096241 |
Above are the basic data sets for each of the data txt files acquired. However, on subsequent analyses, for the purpose of runtime and CPU capacity the text from twitter was focused and emphasized for extended analysis.
twitter_lengths <- nchar(twitter)
hist(twitter_lengths, breaks = 50,
main = "Line Lengths in Twitter Dataset",
xlab = "Characters per Line", col = "lightblue", border = "white")
It can clearly be indicated that the line distribution is slightly skewed to to the left until interestingly, there was a significant number of lines with close to 140,000 charactesr in a line.
text_df <- tibble(line = twitter)
word_counts <- text_df %>%
unnest_tokens(word, line) %>%
count(word, sort = TRUE)
data("stop_words")
filtered_counts <- word_counts %>%
anti_join(stop_words, by = "word")
head(filtered_counts, 10)
## # A tibble: 10 × 2
## word n
## <chr> <int>
## 1 love 106732
## 2 day 91748
## 3 rt 89601
## 4 time 76803
## 5 lol 70162
## 6 3 54940
## 7 people 52047
## 8 happy 49009
## 9 follow 48108
## 10 2 45515
First, the code eliminates any stop words such as “the”, “is”, “I” simply because it will skew which words are the most frequent as regardless of settings, stop words are often the most used words in regular English.When such words are eliminated, interesting, there are a lot of colloquial words on the top indicating the casual nature of social media in terms of language.
bigrams <- text_df %>%
unnest_tokens(bigram, line, token = "ngrams", n = 2)
bigram_counts <- bigrams %>%
count(bigram, sort = TRUE)
bigram_counts %>%
filter(n > 5) %>%
top_n(10) %>%
ggplot(aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Top Bigrams in Twitter Data", x = "Bigram", y = "Frequency")
## Selecting by n
Finally, as a preview for the prediction model, Bigrams were analyzed and as expected, most of the Bigrams that are frequent were standard patterns in English.
This report demonstrates successful data acquisition and initial Exploratory Data Analysis. Overall, the bigrams indicate that while most phrases are in line with what is expected out of a normal sentence, the data needed removal of stop words and also potentially in the future, removing any foreign language terms or words not appropriate in a more formal setting.