Exploratory Data Analysis for Predictive Text Model

Introduction

This report presents an exploratory data analysis on three text sets: Twitter, Blogs, and News. Descriptive statistics are calculated and key patterns are identified to build a predictive text model.

Loading and Preprocessing Data

# Defining file paths
files <- list(
twitter = "data/final/en_US/en_US.twitter.txt",
blogs = "data/final/en_US/en_US.blogs.txt",
news = "data/final/en_US/en_US.news.txt"
)

# Function to load and sample data
load_sample_text <- function(file, sample_size = 0.004) {
text <- readLines(file, warn = FALSE, encoding = "UTF-8")
sampled_text <- sample(text, size = max(1, round(length(text) * sample_size)), replace = FALSE)
list(
text = sampled_text,
lines = length(sampled_text),
words = sum(str_count(sampled_text, "\\S+")),
 characters = sum(nchar(sampled_text))
 )
}

# Extract samples
samples <- lapply(files, load_sample_text)

# Load data with sampling
stats <- map(files, load_sample_text)
stats_df <- tibble(
 Dataset = names(stats),
 Lines = map_int(stats, "lines"),
 Words = map_int(stats, "words"),
 Characters = map_int(stats, "characters")
)

# Show statistics in table
kable(stats_df, caption = "Data Set Statistics (Sample 0.4%)", format = "html") %>%
 kable_styling(bootstrap_options = c("striped", "hover"))

Data Set Statistics (Sample 0.4%)
Dataset	Lines	Words	Characters
twitter	9441	121347	647143
blogs	3597	150946	835418
news	309	10051	59261

In this analysis, a sample of 0.4% of each dataset (Twitter, Blogs, and News) is used for processing. This sample size is chosen to make the analysis more manageable while still providing a representative overview of the larger datasets. By working with a smaller sample, we can efficiently analyze the data without overwhelming memory or computational resources.

Frequency Analysis Common Words and Phrases

# Function to tokenize and extract n-grams
extract_ngrams <- function(text, n) {
tokens <- tokens(text, what = "word", remove_punct = TRUE, remove_numbers = TRUE)
dfm_tokens <- dfm(tokens_ngrams(tokens, n = n))
textstat_frequency(dfm_tokens, n = 10) %>%
select(feature, frequency)
}

# Obtain bigrams and trigrams from the sample
all_text <- unlist(map(stats, "text"))
bigrams <- extract_ngrams(all_text, 2)
trigrams <- extract_ngrams(all_text, 3)

# Replace the underscore symbol with a space in the features
bigrams$feature <- gsub("_", " ", bigrams$feature)
trigrams$feature <- gsub("_", " ", trigrams$feature)

# Show tables
kable(bigrams, caption = "Most Frequent Bigrams (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Most Frequent Bigrams (Sample 0.4%)
feature	frequency
of the	975
in the	891
for the	584
to the	541
on the	521
to be	490
at the	341
i have	323
i was	318
it was	313

kable(trigrams, caption = "Most Frequent Trigrams (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Most Frequent Trigrams (Sample 0.4%)
feature	frequency
thanks for the	96
one of the	76
a lot of	70
looking forward to	58
to be a	53
going to be	51
it was a	48
i want to	45
i need to	42
some of the	42

The most frequent bigrams and trigrams are common grammatical structures in the English language, which include prepositions and auxiliary verbs such as “to”, “in”, and “of”.

High-frequency phrases, such as “thanks_for_the” or “one_of_the”, reflect the conversational style on platforms like Twitter and Blogs.

Identifying Strange Symbols

# Find non-alphabetic characters
total_special_chars <- map_int(stats, ~sum(str_count(.x$text, "[^[:alnum:][:space:]]")))

# Create table
symbol_counts <- tibble(
Dataset = names(stats),
SpecialChars = total_special_chars
)

# Display in table
kable(symbol_counts, caption = "Number of Strange Symbols (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Number of Strange Symbols (Sample 0.4%)
Dataset	SpecialChars
twitter	32374
blogs	26393
news	1985

Twitter has the most strange symbols, which is expected given the informal nature and tendency to use emoticons, hashtags, and other special characters. Blogs also contain a significant number of symbols, although fewer than Twitter, which may indicate the use of punctuation and other special characters. News shows a low number of extraneous characters, reflecting the formality and structured style of the texts.

Identifying Strange Characters

# Function to count the strange characters and return the most frequent ones
get_most_frequent_special_chars <- function(text) {
special_chars <- str_extract_all(text, "[^[:alnum:][:space:]]") %>%
unlist() %>%
table() %>%
as.data.frame()
colnames(special_chars) <- c("Character", "Frequency")
special_chars <- special_chars %>% arrange(desc(Frequency)) %>% head(10) # Top 10 most frequent
return(special_chars)
}

# Apply the function to each dataset
special_chars_twitter <- get_most_frequent_special_chars(paste(samples$twitter$text, collapse = " "))
special_chars_blog <- get_most_frequent_special_chars(paste(samples$blog$text, collapse = " "))
special_chars_news <- get_most_frequent_special_chars(paste(samples$news$text, collapse = " "))

# Join the three tables into one
combined_special_chars <- merge(special_chars_twitter, special_chars_blog, by = "Character")
combined_special_chars <- merge(combined_special_chars, special_chars_news, by = "Character")

# Rename the columns for each data set
colnames(combined_special_chars) <- c("Character", "Twitter Frequency", "Blog Frequency", "News Frequency")

# Show the final table
kable(combined_special_chars, caption = "Common Strange Characters on Twitter, Blog and News", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))

Common Strange Characters on Twitter, Blog and News
Character	Twitter Frequency	Blog Frequency	News Frequency
’	3486	1538	185
	1246	1340	217
”	1028	581	212
)	797	733	27
,	3009	6983	618
.	10431	8737	668

Symbols such as quotation marks (‘ ’) and commas are common in all datasets, with a notably high frequency in Twitter due to conversational writing styles

Visualizing Sentence Length Distribution

# Get sentence lengths in the sample
sentence_lengths <- unlist(map(stats, ~str_count(.x$text, "\\S+") ))

# Plot distribution
ggplot(data.frame(Length = sentence_lengths), aes(x = Length)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.7) +
labs(title = "Sentence Length Distribution (Sample 0.4%)", x = "Number of Words", y = "Frequency")

Visualizing Differences

# Create list of tokens from samples
tokens_list <- lapply(samples, function(s) tokens(s$text, what = "word", remove_punct = TRUE, remove_numbers = TRUE))

# Frequency of common words
word_freq <- lapply(tokens_list, function(t) {
dfm_word <- dfm(t)
topfeatures(dfm_word, 10)
})

# Convert to a data frame
word_freq_df <- bind_rows(lapply(word_freq, function(x) data.frame(Word = names(x), Count = x)), .id = "Source")

# Comparative bar chart with pastel colors
ggplot(word_freq_df, aes(x = reorder(Word, Count), y = Count, fill = Source)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Comparison of Most Common Words between Twitter, Blogs and News",
       x = "Word",
       y = "Frequency") +
  scale_fill_manual(values = c("#FFB3BA", "#FFDFBA", "#B3E0FF")) +  # Pastel color palette
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

The loading and analysis of the three data sets has been carried out.
A sample of 0.4% was extracted to avoid memory problems.
The most common words and n-grams in each source were identified.
Strange characters that could indicate noise in the data were detected.
Comparative visualizations between Twitter, Blogs and News were created.

Plan for the Prediction Algorithm and Shiny App

Prediction Algorithm: Based on n-grams with Laplace smoothing. The model would learn from n-grams extracted from datasets and user interactions.
Shiny App: The interactive app will allow users to enter text in real-time and receive predictions on the next word, similar to text prediction apps like SwiftKey.

Additional Functionality: The upcoming app will shine with an additional feature that will improve the user experience. It will include a text input field where users can type their message, and the system will provide three suggestions for the next word that could follow in the text, based on context and the model’s ongoing learning.

Users will have the option to select any of these suggestions, which will be automatically added to their input, optimizing typing and increasing efficiency. This approach will not only improve the accuracy of the prediction model, but also provide a more fluid and dynamic experience for the user, who can save time typing without having to select words from a keyboard.

Personalization: The application could store what each user types and use this information to improve future predictions, creating a model more tailored to each person. This approach not only improves the accuracy of the model, but also makes the experience more interactive and useful, adapting to each user’s unique style and vocabulary.