Introduction

This report presents an exploratory analysis of text data sourced from three primary platforms: Twitter, Blogs, and News articles. Predictive text models, such as the one we aim to develop, enhance user experience by providing autocomplete suggestions, speeding up digital communication. The overarching goal of this project is to use these datasets to build a predictive text algorithm that will be deployed in a Shiny app. The algorithm will predict the next word(s) based on the user’s input, leveraging techniques from natural language processing (NLP) to capture language patterns across diverse text sources. This document highlights key features of the data, initial findings, and outlines a preliminary plan for developing the prediction algorithm and Shiny app.

Data Loading and Summary

The dataset consists of text corpora in English from Twitter, Blogs, and News sources. Each data source presents unique linguistic patterns, providing valuable diversity for building a robust prediction model. Twitter data, for instance, features brevity due to character limits, whereas Blogs and News data offer longer, more formal language structures.

Data was loaded into R using read_lines() to handle the files efficiently. Given the substantial size of each dataset, initial preprocessing focused on encoding to UTF-8 to ensure data integrity during analysis.

library(readr)
library(tidyverse)
library(knitr)

file_path_twitter <- "Coursera-SwiftKey/final/en_US/en_US.twitter.txt"
file_path_blogs <- "Coursera-SwiftKey/final/en_US/en_US.blogs.txt"
file_path_news <- "Coursera-SwiftKey/final/en_US/en_US.news.txt"

twitter_data <- read_lines(file_path_twitter)
blogs_data <- read_lines(file_path_blogs)
news_data <- read_lines(file_path_news)

data_summary <- tibble(
        Source = c("Twitter", "Blogs", "News"),
        Lines = c(length(twitter_data), length(blogs_data), length(news_data)),
        Words = c(sum(str_count(twitter_data, "\\S+")),
                  sum(str_count(blogs_data, "\\S+")),
                  sum(str_count(news_data, "\\S+"))
                  )
        )

kable(data_summary, caption = "Summary of Data Sources")
Summary of Data Sources
Source Lines Words
Twitter 2360148 30373543
Blogs 899288 37334131
News 1010242 34372530

The dataset sizes may impact processing time and model accuracy. We will use efficient sampling and caching to optimize memory usage without compromising predictive performance.

Data Sampling

Given the large size of each dataset, we opted for a random sample of 10,000 lines per source to manage memory usage and speed up processing. Sampling allows efficient exploratory analysis while preserving sufficient linguistic diversity to inform model development. This sampling size was selected to balance memory efficiency and model accuracy, providing a representative subset of each data source.

set.seed(52)

read_sample <- function(file_path, n = 10000) {
        lines <- read_lines(file_path)
        sample(lines, n)
        }

twitter_sample <- read_sample(file_path_twitter)
blogs_sample <- read_sample(file_path_blogs)
news_sample <- read_sample(file_path_news)

Exploratory Analysis

Summary Statistics

We calculated basic statistics for each sample, including total line count, total word count, and average words per line. These metrics offer a foundational understanding of the structure of each data source, providing insights into typical content length and word density.

summary_stats <- function(text_sample) {
        tibble(
                Total_Lines = length(text_sample),
                Total_Words = sum(str_count(text_sample, "\\S+")),
                Avg_Words_Per_Line = mean(str_count(text_sample, "\\S+"))
                )
        }

summary_table <- bind_rows(
        Twitter = summary_stats(twitter_sample),
        Blogs = summary_stats(blogs_sample),
        News = summary_stats(news_sample),
        .id = "Source"
        )

kable(summary_table, caption = "Summary Statistics for Sampled Data")
Summary Statistics for Sampled Data
Source Total_Lines Total_Words Avg_Words_Per_Line
Twitter 10000 128529 12.8529
Blogs 10000 417160 41.7160
News 10000 336796 33.6796

Distribution of Line Lengths

To understand line length distribution across datasets, we visualized the number of words per line in each source. Twitter’s distribution skews towards shorter lines due to character constraints, while Blogs and News data exhibit broader variations in line lengths, reflecting their content’s longer and more detailed nature.

line_lengths <- data.frame(
        Source = rep(c("Twitter", "Blogs", "News"), each = 10000),
        Line_Length = c(str_count(twitter_sample, "\\S+"),
                        str_count(blogs_sample, "\\S+"),
                        str_count(news_sample, "\\S+"))
        )

ggplot(line_lengths, aes(x = Line_Length, fill = Source)) +
        geom_histogram(binwidth = 5, position = "dodge") +
        facet_wrap(~ Source, scales = "free_y") +
        labs(title = "Distribution of Line Lengths", x = "Words per Line", y = "Frequency")

The histogram reveals platform-specific differences: Twitter’s shorter line lengths due to character limits and the greater variability in Blogs and News. Understanding these differences helps set parameters for text chunks and informs the design of the predictive model for handling varied text lengths.

Word Frequency Analysis

To capture the most frequent words in each dataset, we calculated word frequencies after filtering out common stop words. Each dataset shows unique word usage patterns that reflect the typical style and content of each source. For instance, Twitter often features more colloquial terms, while News articles use more formal language.

library(tm)

top_words_df <- function(text_sample, n = 10) {
        words <- unlist(str_split(tolower(text_sample), "\\W+"))
        words <- words[!words %in% stopwords("en") & words != "" & nchar(words) > 2]
        word_freq <- sort(table(words), decreasing = TRUE)[1:n]
        data.frame(word = names(word_freq), frequency = as.numeric(word_freq))
        }

twitter_top <- top_words_df(twitter_sample)
blogs_top <- top_words_df(blogs_sample)
news_top <- top_words_df(news_sample)

top_words <- bind_rows(
        Twitter = twitter_top, 
        Blogs = blogs_top, 
        News = news_top, 
        .id = "Source"
        )

ggplot(top_words, aes(x = reorder(word, frequency), y = frequency, fill = Source)) +
        geom_bar(stat = "identity", position = "dodge") +
        facet_wrap(~ Source, scales = "free") +
        coord_flip() +
        labs(title = "Top 10 Words by Frequency", x = "Words", y = "Frequency")

This bar chart of top 10 words highlights the differences in word usage across platforms, which provides insights for n-gram modeling by focusing on the most common terms used in each source.

Findings and Observations

  1. Summary Statistics: On average, blog entries have the highest word count per line, followed by news articles, with Twitter entries being the shortest due to the character limit.

  2. Line Length Distribution: The distribution of line lengths shows Twitter’s skew towards shorter lines, whereas Blogs and News data have wider variations in line lengths. This information is useful for tailoring the model to each data source.

  3. Common Words: Frequently used words vary slightly across datasets, with Twitter displaying more colloquial terms and News articles featuring formal language. These differences will impact the model’s word prediction accuracy.

Plan for Prediction Algorithm and Shiny App

Prediction Algorithm

Our strategy for the prediction algorithm involves:

  • Text Preprocessing: Convert text to lowercase, remove punctuation, and filter out stop words. We may also use stemming or lemmatization to help the model generalize.

  • Tokenization and N-grams: Construct n-grams (e.g., bigrams, trigrams) to identify common word sequences and focus on n-grams that capture contextual nuances.

  • Model Selection: A Markov Chain or n-gram model will likely be used to predict the next word based on prior words. Smoothing techniques, such as backoff models, will help handle unseen word combinations and improve prediction accuracy.

  • Performance Considerations: Managing memory and runtime will be critical. We will optimize memory usage to ensure efficient performance in the Shiny app.

Shiny App

The Shiny app will provide:

  1. Input Field: Allow users to type a phrase or sentence.

  2. Prediction Display: Show predictions for the next word as the user types.

  3. User Interface: Offer a simple and intuitive design accessible to non-technical users.

Conclusion

This report provides an initial exploratory analysis of text data from Twitter, Blogs, and News sources, establishing a foundation for developing a predictive text model. Future work will include refining the prediction algorithm, optimizing it for performance, and deploying it as an interactive Shiny app. Key challenges anticipated are managing response times and handling uncommon words, which we will address through model optimization and algorithm testing.