Introduction

This report presents an exploratory analysis of the Twitter dataset from the SwiftKey corpus.

Goal

The goal is to demonstrate that the data has been successfully loaded and understood, and to outline a clear path toward building a predictive text algorithm and Shiny app. The report is written for a general audience and highlights only the most important findings.

Setup and Data Loading

Load Twitter data

twitter_path <- "D:/NGP/Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
twitter_lines <- read_lines(twitter_path)

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

word_counts <- stri_count_words(twitter_lines)

# Summary statistics
summary_table <- tibble(
  File = "en_US.twitter.txt",
  Lines = length(twitter_lines),
  TotalWords = sum(word_counts),
  AvgWordsPerLine = round(mean(word_counts), 2)
)

kable(summary_table, caption = "Summary Statistics of Twitter Dataset")

Summary Statistics of Twitter Dataset
File	Lines	TotalWords	AvgWordsPerLine
en_US.twitter.txt	2360148	30096649	12.75

Sample and clean

set.seed(123)
sample_twitter <- sample(twitter_lines, 10000)

corpus <- VCorpus(VectorSource(sample_twitter)) %>%
  tm_map(content_transformer(tolower)) %>%
  tm_map(removePunctuation) %>%
  tm_map(removeNumbers) %>%
  tm_map(removeWords, stopwords("en")) %>%
  tm_map(stripWhitespace)

Document-Term Matrix and frequency table

dtm <- DocumentTermMatrix(corpus)
freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
freq_df <- data.frame(word = names(freq), freq = freq)


ggplot(freq_df[1:20,], aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words in Twitter Sample",
       x = "Words", y = "Frequency")

World Cloud

set.seed(123)
wordcloud(words = freq_df$word, freq = freq_df$freq,
          min.freq = 50, max.words = 100,
          colors = brewer.pal(8, "Dark2"))

kable(head(freq_df, 10), caption = "Top 10 Most Frequent Words")

Top 10 Most Frequent Words
	word	freq
just	just	618
like	like	522
get	get	463
love	love	448
good	good	438
dont	dont	409
will	will	409
can	can	399
day	day	362
know	know	361

Summary and Next Steps

The Twitter dataset contains over 2 million lines and approximately 30 million words, with an average of 12.8 words per tweet. Tweets are short and informal, often featuring slang, abbreviations, and emojis. After cleaning and sampling 10,000 tweets, the most frequent words include “love”, “good”, “day”, and “thanks”, suggesting a generally positive tone.

The histogram and word cloud highlight the dominance of everyday expressions and emotional language. These patterns will inform the design of the prediction algorithm, which will rely on frequent word sequences (n-grams) to suggest likely next words.

Next Steps:

Tokenize text into unigrams, bigrams, and trigrams
Build a predictive model using n-gram probabilities
eploy a Shiny app for interactive word prediction

Exploratory Analysis of Twitter Data

Hewan Demissie Degu

2025-10-01