This report presents an exploratory analysis of the Twitter dataset from the SwiftKey corpus.
The goal is to demonstrate that the data has been successfully loaded and understood, and to outline a clear path toward building a predictive text algorithm and Shiny app. The report is written for a general audience and highlights only the most important findings.
twitter_path <- "D:/NGP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
twitter_lines <- read_lines(twitter_path)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
word_counts <- stri_count_words(twitter_lines)
# Summary statistics
summary_table <- tibble(
File = "en_US.twitter.txt",
Lines = length(twitter_lines),
TotalWords = sum(word_counts),
AvgWordsPerLine = round(mean(word_counts), 2)
)
kable(summary_table, caption = "Summary Statistics of Twitter Dataset")
File | Lines | TotalWords | AvgWordsPerLine |
---|---|---|---|
en_US.twitter.txt | 2360148 | 30096649 | 12.75 |
set.seed(123)
sample_twitter <- sample(twitter_lines, 10000)
corpus <- VCorpus(VectorSource(sample_twitter)) %>%
tm_map(content_transformer(tolower)) %>%
tm_map(removePunctuation) %>%
tm_map(removeNumbers) %>%
tm_map(removeWords, stopwords("en")) %>%
tm_map(stripWhitespace)
dtm <- DocumentTermMatrix(corpus)
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
freq_df <- data.frame(word = names(freq), freq = freq)
ggplot(freq_df[1:20,], aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Most Frequent Words in Twitter Sample",
x = "Words", y = "Frequency")
set.seed(123)
wordcloud(words = freq_df$word, freq = freq_df$freq,
min.freq = 50, max.words = 100,
colors = brewer.pal(8, "Dark2"))
kable(head(freq_df, 10), caption = "Top 10 Most Frequent Words")
word | freq | |
---|---|---|
just | just | 618 |
like | like | 522 |
get | get | 463 |
love | love | 448 |
good | good | 438 |
dont | dont | 409 |
will | will | 409 |
can | can | 399 |
day | day | 362 |
know | know | 361 |
The Twitter dataset contains over 2 million lines and approximately 30 million words, with an average of 12.8 words per tweet. Tweets are short and informal, often featuring slang, abbreviations, and emojis. After cleaning and sampling 10,000 tweets, the most frequent words include “love”, “good”, “day”, and “thanks”, suggesting a generally positive tone.
The histogram and word cloud highlight the dominance of everyday expressions and emotional language. These patterns will inform the design of the prediction algorithm, which will rely on frequent word sequences (n-grams) to suggest likely next words.