This report presents an exploratory analysis of text data sourced from three primary platforms: Twitter, Blogs, and News articles. Predictive text models, such as the one we aim to develop, enhance user experience by providing autocomplete suggestions, speeding up digital communication. The overarching goal of this project is to use these datasets to build a predictive text algorithm that will be deployed in a Shiny app. The algorithm will predict the next word(s) based on the user’s input, leveraging techniques from natural language processing (NLP) to capture language patterns across diverse text sources. This document highlights key features of the data, initial findings, and outlines a preliminary plan for developing the prediction algorithm and Shiny app.
The dataset consists of text corpora in English from Twitter, Blogs, and News sources. Each data source presents unique linguistic patterns, providing valuable diversity for building a robust prediction model. Twitter data, for instance, features brevity due to character limits, whereas Blogs and News data offer longer, more formal language structures.
Data was loaded into R using read_lines()
to handle the
files efficiently. Given the substantial size of each dataset, initial
preprocessing focused on encoding to UTF-8 to ensure data integrity
during analysis.
library(readr)
library(tidyverse)
library(knitr)
file_path_twitter <- "Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
file_path_blogs <- "Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
file_path_news <- "Coursera-SwiftKey/final/en_US/en_US.news.txt"
twitter_data <- read_lines(file_path_twitter)
blogs_data <- read_lines(file_path_blogs)
news_data <- read_lines(file_path_news)
data_summary <- tibble(
Source = c("Twitter", "Blogs", "News"),
Lines = c(length(twitter_data), length(blogs_data), length(news_data)),
Words = c(sum(str_count(twitter_data, "\\S+")),
sum(str_count(blogs_data, "\\S+")),
sum(str_count(news_data, "\\S+"))
)
)
kable(data_summary, caption = "Summary of Data Sources")
Source | Lines | Words |
---|---|---|
2360148 | 30373543 | |
Blogs | 899288 | 37334131 |
News | 1010242 | 34372530 |
The dataset sizes may impact processing time and model accuracy. We will use efficient sampling and caching to optimize memory usage without compromising predictive performance.
Given the large size of each dataset, we opted for a random sample of 10,000 lines per source to manage memory usage and speed up processing. Sampling allows efficient exploratory analysis while preserving sufficient linguistic diversity to inform model development. This sampling size was selected to balance memory efficiency and model accuracy, providing a representative subset of each data source.
set.seed(52)
read_sample <- function(file_path, n = 10000) {
lines <- read_lines(file_path)
sample(lines, n)
}
twitter_sample <- read_sample(file_path_twitter)
blogs_sample <- read_sample(file_path_blogs)
news_sample <- read_sample(file_path_news)
We calculated basic statistics for each sample, including total line count, total word count, and average words per line. These metrics offer a foundational understanding of the structure of each data source, providing insights into typical content length and word density.
summary_stats <- function(text_sample) {
tibble(
Total_Lines = length(text_sample),
Total_Words = sum(str_count(text_sample, "\\S+")),
Avg_Words_Per_Line = mean(str_count(text_sample, "\\S+"))
)
}
summary_table <- bind_rows(
Twitter = summary_stats(twitter_sample),
Blogs = summary_stats(blogs_sample),
News = summary_stats(news_sample),
.id = "Source"
)
kable(summary_table, caption = "Summary Statistics for Sampled Data")
Source | Total_Lines | Total_Words | Avg_Words_Per_Line |
---|---|---|---|
10000 | 128529 | 12.8529 | |
Blogs | 10000 | 417160 | 41.7160 |
News | 10000 | 336796 | 33.6796 |
To understand line length distribution across datasets, we visualized the number of words per line in each source. Twitter’s distribution skews towards shorter lines due to character constraints, while Blogs and News data exhibit broader variations in line lengths, reflecting their content’s longer and more detailed nature.
line_lengths <- data.frame(
Source = rep(c("Twitter", "Blogs", "News"), each = 10000),
Line_Length = c(str_count(twitter_sample, "\\S+"),
str_count(blogs_sample, "\\S+"),
str_count(news_sample, "\\S+"))
)
ggplot(line_lengths, aes(x = Line_Length, fill = Source)) +
geom_histogram(binwidth = 5, position = "dodge") +
facet_wrap(~ Source, scales = "free_y") +
labs(title = "Distribution of Line Lengths", x = "Words per Line", y = "Frequency")
The histogram reveals platform-specific differences: Twitter’s shorter line lengths due to character limits and the greater variability in Blogs and News. Understanding these differences helps set parameters for text chunks and informs the design of the predictive model for handling varied text lengths.
To capture the most frequent words in each dataset, we calculated word frequencies after filtering out common stop words. Each dataset shows unique word usage patterns that reflect the typical style and content of each source. For instance, Twitter often features more colloquial terms, while News articles use more formal language.
library(tm)
top_words_df <- function(text_sample, n = 10) {
words <- unlist(str_split(tolower(text_sample), "\\W+"))
words <- words[!words %in% stopwords("en") & words != "" & nchar(words) > 2]
word_freq <- sort(table(words), decreasing = TRUE)[1:n]
data.frame(word = names(word_freq), frequency = as.numeric(word_freq))
}
twitter_top <- top_words_df(twitter_sample)
blogs_top <- top_words_df(blogs_sample)
news_top <- top_words_df(news_sample)
top_words <- bind_rows(
Twitter = twitter_top,
Blogs = blogs_top,
News = news_top,
.id = "Source"
)
ggplot(top_words, aes(x = reorder(word, frequency), y = frequency, fill = Source)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~ Source, scales = "free") +
coord_flip() +
labs(title = "Top 10 Words by Frequency", x = "Words", y = "Frequency")
This bar chart of top 10 words highlights the differences in word usage across platforms, which provides insights for n-gram modeling by focusing on the most common terms used in each source.
Summary Statistics: On average, blog entries have the highest word count per line, followed by news articles, with Twitter entries being the shortest due to the character limit.
Line Length Distribution: The distribution of line lengths shows Twitter’s skew towards shorter lines, whereas Blogs and News data have wider variations in line lengths. This information is useful for tailoring the model to each data source.
Common Words: Frequently used words vary slightly across datasets, with Twitter displaying more colloquial terms and News articles featuring formal language. These differences will impact the model’s word prediction accuracy.
Our strategy for the prediction algorithm involves:
Text Preprocessing: Convert text to lowercase, remove punctuation, and filter out stop words. We may also use stemming or lemmatization to help the model generalize.
Tokenization and N-grams: Construct n-grams (e.g., bigrams, trigrams) to identify common word sequences and focus on n-grams that capture contextual nuances.
Model Selection: A Markov Chain or n-gram model will likely be used to predict the next word based on prior words. Smoothing techniques, such as backoff models, will help handle unseen word combinations and improve prediction accuracy.
Performance Considerations: Managing memory and runtime will be critical. We will optimize memory usage to ensure efficient performance in the Shiny app.
The Shiny app will provide:
Input Field: Allow users to type a phrase or sentence.
Prediction Display: Show predictions for the next word as the user types.
User Interface: Offer a simple and intuitive design accessible to non-technical users.
This report provides an initial exploratory analysis of text data from Twitter, Blogs, and News sources, establishing a foundation for developing a predictive text model. Future work will include refining the prediction algorithm, optimizing it for performance, and deploying it as an interactive Shiny app. Key challenges anticipated are managing response times and handling uncommon words, which we will address through model optimization and algorithm testing.