This report presents an exploratory data analysis on three text sets: Twitter, Blogs, and News. Descriptive statistics are calculated and key patterns are identified to build a predictive text model.
# Defining file paths
files <- list(
twitter = "data/final/en_US/en_US.twitter.txt",
blogs = "data/final/en_US/en_US.blogs.txt",
news = "data/final/en_US/en_US.news.txt"
)
# Function to load and sample data
load_sample_text <- function(file, sample_size = 0.004) {
text <- readLines(file, warn = FALSE, encoding = "UTF-8")
sampled_text <- sample(text, size = max(1, round(length(text) * sample_size)), replace = FALSE)
list(
text = sampled_text,
lines = length(sampled_text),
words = sum(str_count(sampled_text, "\\S+")),
characters = sum(nchar(sampled_text))
)
}
# Extract samples
samples <- lapply(files, load_sample_text)
# Load data with sampling
stats <- map(files, load_sample_text)
stats_df <- tibble(
Dataset = names(stats),
Lines = map_int(stats, "lines"),
Words = map_int(stats, "words"),
Characters = map_int(stats, "characters")
)
# Show statistics in table
kable(stats_df, caption = "Data Set Statistics (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Dataset | Lines | Words | Characters |
|---|---|---|---|
| 9441 | 121347 | 647143 | |
| blogs | 3597 | 150946 | 835418 |
| news | 309 | 10051 | 59261 |
In this analysis, a sample of 0.4% of each dataset (Twitter, Blogs, and News) is used for processing. This sample size is chosen to make the analysis more manageable while still providing a representative overview of the larger datasets. By working with a smaller sample, we can efficiently analyze the data without overwhelming memory or computational resources.
# Function to tokenize and extract n-grams
extract_ngrams <- function(text, n) {
tokens <- tokens(text, what = "word", remove_punct = TRUE, remove_numbers = TRUE)
dfm_tokens <- dfm(tokens_ngrams(tokens, n = n))
textstat_frequency(dfm_tokens, n = 10) %>%
select(feature, frequency)
}
# Obtain bigrams and trigrams from the sample
all_text <- unlist(map(stats, "text"))
bigrams <- extract_ngrams(all_text, 2)
trigrams <- extract_ngrams(all_text, 3)
# Replace the underscore symbol with a space in the features
bigrams$feature <- gsub("_", " ", bigrams$feature)
trigrams$feature <- gsub("_", " ", trigrams$feature)
# Show tables
kable(bigrams, caption = "Most Frequent Bigrams (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| feature | frequency |
|---|---|
| of the | 975 |
| in the | 891 |
| for the | 584 |
| to the | 541 |
| on the | 521 |
| to be | 490 |
| at the | 341 |
| i have | 323 |
| i was | 318 |
| it was | 313 |
kable(trigrams, caption = "Most Frequent Trigrams (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| feature | frequency |
|---|---|
| thanks for the | 96 |
| one of the | 76 |
| a lot of | 70 |
| looking forward to | 58 |
| to be a | 53 |
| going to be | 51 |
| it was a | 48 |
| i want to | 45 |
| i need to | 42 |
| some of the | 42 |
The most frequent bigrams and trigrams are common grammatical structures in the English language, which include prepositions and auxiliary verbs such as “to”, “in”, and “of”.
High-frequency phrases, such as “thanks_for_the” or “one_of_the”, reflect the conversational style on platforms like Twitter and Blogs.
# Find non-alphabetic characters
total_special_chars <- map_int(stats, ~sum(str_count(.x$text, "[^[:alnum:][:space:]]")))
# Create table
symbol_counts <- tibble(
Dataset = names(stats),
SpecialChars = total_special_chars
)
# Display in table
kable(symbol_counts, caption = "Number of Strange Symbols (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Dataset | SpecialChars |
|---|---|
| 32374 | |
| blogs | 26393 |
| news | 1985 |
Twitter has the most strange symbols, which is expected given the informal nature and tendency to use emoticons, hashtags, and other special characters. Blogs also contain a significant number of symbols, although fewer than Twitter, which may indicate the use of punctuation and other special characters. News shows a low number of extraneous characters, reflecting the formality and structured style of the texts.
# Function to count the strange characters and return the most frequent ones
get_most_frequent_special_chars <- function(text) {
special_chars <- str_extract_all(text, "[^[:alnum:][:space:]]") %>%
unlist() %>%
table() %>%
as.data.frame()
colnames(special_chars) <- c("Character", "Frequency")
special_chars <- special_chars %>% arrange(desc(Frequency)) %>% head(10) # Top 10 most frequent
return(special_chars)
}
# Apply the function to each dataset
special_chars_twitter <- get_most_frequent_special_chars(paste(samples$twitter$text, collapse = " "))
special_chars_blog <- get_most_frequent_special_chars(paste(samples$blog$text, collapse = " "))
special_chars_news <- get_most_frequent_special_chars(paste(samples$news$text, collapse = " "))
# Join the three tables into one
combined_special_chars <- merge(special_chars_twitter, special_chars_blog, by = "Character")
combined_special_chars <- merge(combined_special_chars, special_chars_news, by = "Character")
# Rename the columns for each data set
colnames(combined_special_chars) <- c("Character", "Twitter Frequency", "Blog Frequency", "News Frequency")
# Show the final table
kable(combined_special_chars, caption = "Common Strange Characters on Twitter, Blog and News", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
| Character | Twitter Frequency | Blog Frequency | News Frequency |
|---|---|---|---|
| ’ | 3486 | 1538 | 185 |
|
|
1246 | 1340 | 217 |
| ” | 1028 | 581 | 212 |
| ) | 797 | 733 | 27 |
| , | 3009 | 6983 | 618 |
| . | 10431 | 8737 | 668 |
# Get sentence lengths in the sample
sentence_lengths <- unlist(map(stats, ~str_count(.x$text, "\\S+") ))
# Plot distribution
ggplot(data.frame(Length = sentence_lengths), aes(x = Length)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.7) +
labs(title = "Sentence Length Distribution (Sample 0.4%)", x = "Number of Words", y = "Frequency")
# Create list of tokens from samples
tokens_list <- lapply(samples, function(s) tokens(s$text, what = "word", remove_punct = TRUE, remove_numbers = TRUE))
# Frequency of common words
word_freq <- lapply(tokens_list, function(t) {
dfm_word <- dfm(t)
topfeatures(dfm_word, 10)
})
# Convert to a data frame
word_freq_df <- bind_rows(lapply(word_freq, function(x) data.frame(Word = names(x), Count = x)), .id = "Source")
# Comparative bar chart with pastel colors
ggplot(word_freq_df, aes(x = reorder(Word, Count), y = Count, fill = Source)) +
geom_bar(stat = "identity", position = "dodge") +
coord_flip() +
labs(title = "Comparison of Most Common Words between Twitter, Blogs and News",
x = "Word",
y = "Frequency") +
scale_fill_manual(values = c("#FFB3BA", "#FFDFBA", "#B3E0FF")) + # Pastel color palette
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Additional Functionality: The upcoming app will shine with an additional feature that will improve the user experience. It will include a text input field where users can type their message, and the system will provide three suggestions for the next word that could follow in the text, based on context and the model’s ongoing learning.
Users will have the option to select any of these suggestions, which will be automatically added to their input, optimizing typing and increasing efficiency. This approach will not only improve the accuracy of the prediction model, but also provide a more fluid and dynamic experience for the user, who can save time typing without having to select words from a keyboard.