Introduction

This report presents an exploratory analysis of text data sourced from three primary platforms: Twitter, Blogs, and News articles. Predictive text models, such as the one we aim to develop, enhance user experience by providing autocomplete suggestions, speeding up digital communication. The overarching goal of this project is to use these datasets to build a predictive text algorithm that will be deployed in a Shiny app. The algorithm will predict the next word(s) based on the user’s input, leveraging techniques from natural language processing (NLP) to capture language patterns across diverse text sources. This document highlights key features of the data, initial findings, and outlines a preliminary plan for developing the prediction algorithm and Shiny app.

Data Loading and Summary

The dataset consists of text corpora in English from Twitter, Blogs, and News sources. Each data source presents unique linguistic patterns, providing valuable diversity for building a robust prediction model. Twitter data, for instance, features brevity due to character limits, whereas Blogs and News data offer longer, more formal language structures.

Data was loaded into R using read_lines() to handle the files efficiently. Given the substantial size of each dataset, initial preprocessing focused on encoding to UTF-8 to ensure data integrity during analysis.

library(readr)
library(tidyverse)
library(knitr)

file_path_twitter <- "Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
file_path_blogs <- "Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
file_path_news <- "Coursera-SwiftKey/final/en_US/en_US.news.txt"

twitter_data <- read_lines(file_path_twitter)
blogs_data <- read_lines(file_path_blogs)
news_data <- read_lines(file_path_news)

data_summary <- tibble(
        Source = c("Twitter", "Blogs", "News"),
        Lines = c(length(twitter_data), length(blogs_data), length(news_data)),
        Words = c(sum(str_count(twitter_data, "\\S+")),
                  sum(str_count(blogs_data, "\\S+")),
                  sum(str_count(news_data, "\\S+"))
                  )
        )

kable(data_summary, caption = "Summary of Data Sources")

Summary of Data Sources
Source	Lines	Words
Twitter	2360148	30373543
Blogs	899288	37334131
News	1010242	34372530

The dataset sizes may impact processing time and model accuracy. We will use efficient sampling and caching to optimize memory usage without compromising predictive performance.

Data Sampling

Given the large size of each dataset, we opted for a random sample of 10,000 lines per source to manage memory usage and speed up processing. Sampling allows efficient exploratory analysis while preserving sufficient linguistic diversity to inform model development. This sampling size was selected to balance memory efficiency and model accuracy, providing a representative subset of each data source.

set.seed(52)

read_sample <- function(file_path, n = 10000) {
        lines <- read_lines(file_path)
        sample(lines, n)
        }

twitter_sample <- read_sample(file_path_twitter)
blogs_sample <- read_sample(file_path_blogs)
news_sample <- read_sample(file_path_news)

Exploratory Analysis

Summary Statistics

We calculated basic statistics for each sample, including total line count, total word count, and average words per line. These metrics offer a foundational understanding of the structure of each data source, providing insights into typical content length and word density.

summary_stats <- function(text_sample) {
        tibble(
                Total_Lines = length(text_sample),
                Total_Words = sum(str_count(text_sample, "\\S+")),
                Avg_Words_Per_Line = mean(str_count(text_sample, "\\S+"))
                )
        }

summary_table <- bind_rows(
        Twitter = summary_stats(twitter_sample),
        Blogs = summary_stats(blogs_sample),
        News = summary_stats(news_sample),
        .id = "Source"
        )

kable(summary_table, caption = "Summary Statistics for Sampled Data")

Summary Statistics for Sampled Data
Source	Total_Lines	Total_Words	Avg_Words_Per_Line
Twitter	10000	128529	12.8529
Blogs	10000	417160	41.7160
News	10000	336796	33.6796

Distribution of Line Lengths

To understand line length distribution across datasets, we visualized the number of words per line in each source. Twitter’s distribution skews towards shorter lines due to character constraints, while Blogs and News data exhibit broader variations in line lengths, reflecting their content’s longer and more detailed nature.

line_lengths <- data.frame(
        Source = rep(c("Twitter", "Blogs", "News"), each = 10000),
        Line_Length = c(str_count(twitter_sample, "\\S+"),
                        str_count(blogs_sample, "\\S+"),
                        str_count(news_sample, "\\S+"))
        )

ggplot(line_lengths, aes(x = Line_Length, fill = Source)) +
        geom_histogram(binwidth = 5, position = "dodge") +
        facet_wrap(~ Source, scales = "free_y") +
        labs(title = "Distribution of Line Lengths", x = "Words per Line", y = "Frequency")

The histogram reveals platform-specific differences: Twitter’s shorter line lengths due to character limits and the greater variability in Blogs and News. Understanding these differences helps set parameters for text chunks and informs the design of the predictive model for handling varied text lengths.

Word Frequency Analysis

To capture the most frequent words in each dataset, we calculated word frequencies after filtering out common stop words. Each dataset shows unique word usage patterns that reflect the typical style and content of each source. For instance, Twitter often features more colloquial terms, while News articles use more formal language.

library(tm)

top_words_df <- function(text_sample, n = 10) {
        words <- unlist(str_split(tolower(text_sample), "\\W+"))
        words <- words[!words %in% stopwords("en") & words != "" & nchar(words) > 2]
        word_freq <- sort(table(words), decreasing = TRUE)[1:n]
        data.frame(word = names(word_freq), frequency = as.numeric(word_freq))
        }

twitter_top <- top_words_df(twitter_sample)
blogs_top <- top_words_df(blogs_sample)
news_top <- top_words_df(news_sample)

top_words <- bind_rows(
        Twitter = twitter_top, 
        Blogs = blogs_top, 
        News = news_top, 
        .id = "Source"
        )

ggplot(top_words, aes(x = reorder(word, frequency), y = frequency, fill = Source)) +
        geom_bar(stat = "identity", position = "dodge") +
        facet_wrap(~ Source, scales = "free") +
        coord_flip() +
        labs(title = "Top 10 Words by Frequency", x = "Words", y = "Frequency")

This bar chart of top 10 words highlights the differences in word usage across platforms, which provides insights for n-gram modeling by focusing on the most common terms used in each source.

Findings and Observations

Summary Statistics: On average, blog entries have the highest word count per line, followed by news articles, with Twitter entries being the shortest due to the character limit.
Line Length Distribution: The distribution of line lengths shows Twitter’s skew towards shorter lines, whereas Blogs and News data have wider variations in line lengths. This information is useful for tailoring the model to each data source.
Common Words: Frequently used words vary slightly across datasets, with Twitter displaying more colloquial terms and News articles featuring formal language. These differences will impact the model’s word prediction accuracy.

Plan for Prediction Algorithm and Shiny App

Prediction Algorithm

Our strategy for the prediction algorithm involves:

Text Preprocessing: Convert text to lowercase, remove punctuation, and filter out stop words. We may also use stemming or lemmatization to help the model generalize.
Tokenization and N-grams: Construct n-grams (e.g., bigrams, trigrams) to identify common word sequences and focus on n-grams that capture contextual nuances.
Model Selection: A Markov Chain or n-gram model will likely be used to predict the next word based on prior words. Smoothing techniques, such as backoff models, will help handle unseen word combinations and improve prediction accuracy.
Performance Considerations: Managing memory and runtime will be critical. We will optimize memory usage to ensure efficient performance in the Shiny app.

Shiny App

The Shiny app will provide:

Input Field: Allow users to type a phrase or sentence.
Prediction Display: Show predictions for the next word as the user types.
User Interface: Offer a simple and intuitive design accessible to non-technical users.

Conclusion

This report provides an initial exploratory analysis of text data from Twitter, Blogs, and News sources, establishing a foundation for developing a predictive text model. Future work will include refining the prediction algorithm, optimizing it for performance, and deploying it as an interactive Shiny app. Key challenges anticipated are managing response times and handling uncommon words, which we will address through model optimization and algorithm testing.

Exploratory Analysis and Roadmap for Predictive Text Model Using Social Media and News Data

Artem Paprocki

November 07, 2024