Introduction

This report presents an exploratory data analysis on three text sets: Twitter, Blogs, and News. Descriptive statistics are calculated and key patterns are identified to build a predictive text model.

Loading and Preprocessing Data

# Defining file paths
files <- list(
twitter = "data/final/en_US/en_US.twitter.txt",
blogs = "data/final/en_US/en_US.blogs.txt",
news = "data/final/en_US/en_US.news.txt"
)

# Function to load and sample data
load_sample_text <- function(file, sample_size = 0.004) {
text <- readLines(file, warn = FALSE, encoding = "UTF-8")
sampled_text <- sample(text, size = max(1, round(length(text) * sample_size)), replace = FALSE)
list(
text = sampled_text,
lines = length(sampled_text),
words = sum(str_count(sampled_text, "\\S+")),
 characters = sum(nchar(sampled_text))
 )
}

# Extract samples
samples <- lapply(files, load_sample_text)

# Load data with sampling
stats <- map(files, load_sample_text)
stats_df <- tibble(
 Dataset = names(stats),
 Lines = map_int(stats, "lines"),
 Words = map_int(stats, "words"),
 Characters = map_int(stats, "characters")
)

# Show statistics in table
kable(stats_df, caption = "Data Set Statistics (Sample 0.4%)", format = "html") %>%
 kable_styling(bootstrap_options = c("striped", "hover"))
Data Set Statistics (Sample 0.4%)
Dataset Lines Words Characters
twitter 9441 121347 647143
blogs 3597 150946 835418
news 309 10051 59261

In this analysis, a sample of 0.4% of each dataset (Twitter, Blogs, and News) is used for processing. This sample size is chosen to make the analysis more manageable while still providing a representative overview of the larger datasets. By working with a smaller sample, we can efficiently analyze the data without overwhelming memory or computational resources.

Frequency Analysis Common Words and Phrases

# Function to tokenize and extract n-grams
extract_ngrams <- function(text, n) {
tokens <- tokens(text, what = "word", remove_punct = TRUE, remove_numbers = TRUE)
dfm_tokens <- dfm(tokens_ngrams(tokens, n = n))
textstat_frequency(dfm_tokens, n = 10) %>%
select(feature, frequency)
}

# Obtain bigrams and trigrams from the sample
all_text <- unlist(map(stats, "text"))
bigrams <- extract_ngrams(all_text, 2)
trigrams <- extract_ngrams(all_text, 3)

# Replace the underscore symbol with a space in the features
bigrams$feature <- gsub("_", " ", bigrams$feature)
trigrams$feature <- gsub("_", " ", trigrams$feature)

# Show tables
kable(bigrams, caption = "Most Frequent Bigrams (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Most Frequent Bigrams (Sample 0.4%)
feature frequency
of the 975
in the 891
for the 584
to the 541
on the 521
to be 490
at the 341
i have 323
i was 318
it was 313
kable(trigrams, caption = "Most Frequent Trigrams (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Most Frequent Trigrams (Sample 0.4%)
feature frequency
thanks for the 96
one of the 76
a lot of 70
looking forward to 58
to be a 53
going to be 51
it was a 48
i want to 45
i need to 42
some of the 42

The most frequent bigrams and trigrams are common grammatical structures in the English language, which include prepositions and auxiliary verbs such as “to”, “in”, and “of”.

High-frequency phrases, such as “thanks_for_the” or “one_of_the”, reflect the conversational style on platforms like Twitter and Blogs.

Identifying Strange Symbols

# Find non-alphabetic characters
total_special_chars <- map_int(stats, ~sum(str_count(.x$text, "[^[:alnum:][:space:]]")))

# Create table
symbol_counts <- tibble(
Dataset = names(stats),
SpecialChars = total_special_chars
)

# Display in table
kable(symbol_counts, caption = "Number of Strange Symbols (Sample 0.4%)", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Number of Strange Symbols (Sample 0.4%)
Dataset SpecialChars
twitter 32374
blogs 26393
news 1985

Twitter has the most strange symbols, which is expected given the informal nature and tendency to use emoticons, hashtags, and other special characters. Blogs also contain a significant number of symbols, although fewer than Twitter, which may indicate the use of punctuation and other special characters. News shows a low number of extraneous characters, reflecting the formality and structured style of the texts.

Identifying Strange Characters

# Function to count the strange characters and return the most frequent ones
get_most_frequent_special_chars <- function(text) {
special_chars <- str_extract_all(text, "[^[:alnum:][:space:]]") %>%
unlist() %>%
table() %>%
as.data.frame()
colnames(special_chars) <- c("Character", "Frequency")
special_chars <- special_chars %>% arrange(desc(Frequency)) %>% head(10) # Top 10 most frequent
return(special_chars)
}

# Apply the function to each dataset
special_chars_twitter <- get_most_frequent_special_chars(paste(samples$twitter$text, collapse = " "))
special_chars_blog <- get_most_frequent_special_chars(paste(samples$blog$text, collapse = " "))
special_chars_news <- get_most_frequent_special_chars(paste(samples$news$text, collapse = " "))

# Join the three tables into one
combined_special_chars <- merge(special_chars_twitter, special_chars_blog, by = "Character")
combined_special_chars <- merge(combined_special_chars, special_chars_news, by = "Character")

# Rename the columns for each data set
colnames(combined_special_chars) <- c("Character", "Twitter Frequency", "Blog Frequency", "News Frequency")

# Show the final table
kable(combined_special_chars, caption = "Common Strange Characters on Twitter, Blog and News", format = "html") %>%
kable_styling(bootstrap_options = c("striped", "hover"))
Common Strange Characters on Twitter, Blog and News
Character Twitter Frequency Blog Frequency News Frequency
3486 1538 185
1246 1340 217
1028 581 212
) 797 733 27
, 3009 6983 618
. 10431 8737 668
Symbols such as quotation marks (‘ ’) and commas are common in all datasets, with a notably high frequency in Twitter due to conversational writing styles

Visualizing Sentence Length Distribution

# Get sentence lengths in the sample
sentence_lengths <- unlist(map(stats, ~str_count(.x$text, "\\S+") ))

# Plot distribution
ggplot(data.frame(Length = sentence_lengths), aes(x = Length)) +
geom_histogram(binwidth = 5, fill = "steelblue", color = "black", alpha = 0.7) +
labs(title = "Sentence Length Distribution (Sample 0.4%)", x = "Number of Words", y = "Frequency")

Visualizing Differences

# Create list of tokens from samples
tokens_list <- lapply(samples, function(s) tokens(s$text, what = "word", remove_punct = TRUE, remove_numbers = TRUE))

# Frequency of common words
word_freq <- lapply(tokens_list, function(t) {
dfm_word <- dfm(t)
topfeatures(dfm_word, 10)
})

# Convert to a data frame
word_freq_df <- bind_rows(lapply(word_freq, function(x) data.frame(Word = names(x), Count = x)), .id = "Source")

# Comparative bar chart with pastel colors
ggplot(word_freq_df, aes(x = reorder(Word, Count), y = Count, fill = Source)) +
  geom_bar(stat = "identity", position = "dodge") +
  coord_flip() +
  labs(title = "Comparison of Most Common Words between Twitter, Blogs and News",
       x = "Word",
       y = "Frequency") +
  scale_fill_manual(values = c("#FFB3BA", "#FFDFBA", "#B3E0FF")) +  # Pastel color palette
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Conclusions

  1. The loading and analysis of the three data sets has been carried out.
  2. A sample of 0.4% was extracted to avoid memory problems.
  3. The most common words and n-grams in each source were identified.
  4. Strange characters that could indicate noise in the data were detected.
  5. Comparative visualizations between Twitter, Blogs and News were created.

Plan for the Prediction Algorithm and Shiny App

  • Prediction Algorithm: Based on n-grams with Laplace smoothing. The model would learn from n-grams extracted from datasets and user interactions.
  • Shiny App: The interactive app will allow users to enter text in real-time and receive predictions on the next word, similar to text prediction apps like SwiftKey.

Additional Functionality: The upcoming app will shine with an additional feature that will improve the user experience. It will include a text input field where users can type their message, and the system will provide three suggestions for the next word that could follow in the text, based on context and the model’s ongoing learning.

Users will have the option to select any of these suggestions, which will be automatically added to their input, optimizing typing and increasing efficiency. This approach will not only improve the accuracy of the prediction model, but also provide a more fluid and dynamic experience for the user, who can save time typing without having to select words from a keyboard.

  • Personalization: The application could store what each user types and use this information to improve future predictions, creating a model more tailored to each person. This approach not only improves the accuracy of the model, but also makes the experience more interactive and useful, adapting to each user’s unique style and vocabulary.