Introduction

This report presents an exploratory analysis of the Twitter dataset from the SwiftKey corpus.

Goal

The goal is to demonstrate that the data has been successfully loaded and understood, and to outline a clear path toward building a predictive text algorithm and Shiny app. The report is written for a general audience and highlights only the most important findings.

Setup and Data Loading

Load Twitter data

twitter_path <- "D:/NGP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
twitter_lines <- read_lines(twitter_path)
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)
word_counts <- stri_count_words(twitter_lines)

# Summary statistics
summary_table <- tibble(
  File = "en_US.twitter.txt",
  Lines = length(twitter_lines),
  TotalWords = sum(word_counts),
  AvgWordsPerLine = round(mean(word_counts), 2)
)

kable(summary_table, caption = "Summary Statistics of Twitter Dataset")
Summary Statistics of Twitter Dataset
File Lines TotalWords AvgWordsPerLine
en_US.twitter.txt 2360148 30096649 12.75

Sample and clean

set.seed(123)
sample_twitter <- sample(twitter_lines, 10000)

corpus <- VCorpus(VectorSource(sample_twitter)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(stripWhitespace)

Document-Term Matrix and frequency table

dtm <- DocumentTermMatrix(corpus)
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
freq_df <- data.frame(word = names(freq), freq = freq)


ggplot(freq_df[1:20,], aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words in Twitter Sample",
       x = "Words", y = "Frequency")

World Cloud

set.seed(123)
wordcloud(words = freq_df$word, freq = freq_df$freq,
          min.freq = 50, max.words = 100,
          colors = brewer.pal(8, "Dark2"))

kable(head(freq_df, 10), caption = "Top 10 Most Frequent Words")
Top 10 Most Frequent Words
word freq
just just 618
like like 522
get get 463
love love 448
good good 438
dont dont 409
will will 409
can can 399
day day 362
know know 361

Summary and Next Steps

The Twitter dataset contains over 2 million lines and approximately 30 million words, with an average of 12.8 words per tweet. Tweets are short and informal, often featuring slang, abbreviations, and emojis. After cleaning and sampling 10,000 tweets, the most frequent words include “love”, “good”, “day”, and “thanks”, suggesting a generally positive tone.

The histogram and word cloud highlight the dominance of everyday expressions and emotional language. These patterns will inform the design of the prediction algorithm, which will rely on frequent word sequences (n-grams) to suggest likely next words.

Next Steps: