Executive Summary

This report presents an exploratory analysis of text data from three sources (blogs, news articles, and Twitter) to support the development of a text prediction application. The goal is to build a Shiny app that predicts the next word a user is likely to type, similar to smartphone keyboard suggestions. This analysis examines the basic characteristics of our training data and outlines the approach for building the prediction algorithm.


1. Data Overview

The training data consists of three text files containing English-language text from different sources:

  • Blogs: Text from blog posts
  • News: Text from news articles
  • Twitter: Text from Twitter messages

These files provide a diverse corpus of natural language that will be used to train our prediction model.

1.1 Loading the Data

# Load required libraries
library(tidyverse)
library(stringr)
library(knitr)
library(kableExtra)

# Define data file paths (adjust path as needed)
data_dir <- "final/en_US"
if (!dir.exists(data_dir)) {
  # Try alternative common locations
  data_dir <- "../final/en_US"
  if (!dir.exists(data_dir)) {
    data_dir <- "."
  }
}

# File paths
blog_file <- file.path(data_dir, "en_US.blogs.txt")
news_file <- file.path(data_dir, "en_US.news.txt")
twitter_file <- file.path(data_dir, "en_US.twitter.txt")

# Function to read file and return basic info
read_file_info <- function(file_path, source_name) {
  if (!file.exists(file_path)) {
    return(data.frame(
      Source = source_name,
      File_Exists = "No",
      Lines = NA,
      Characters = NA,
      Words = NA,
      Avg_Chars_Per_Line = NA,
      Avg_Words_Per_Line = NA
    ))
  }
  
  # Read file
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  
  # Calculate statistics
  num_lines <- length(lines)
  num_chars <- sum(nchar(lines))
  num_words <- sum(str_count(lines, "\\S+"))
  avg_chars_per_line <- round(num_chars / num_lines, 2)
  avg_words_per_line <- round(num_words / num_lines, 2)
  
  return(data.frame(
    Source = source_name,
    File_Exists = "Yes",
    Lines = num_lines,
    Characters = num_chars,
    Words = num_words,
    Avg_Chars_Per_Line = avg_chars_per_line,
    Avg_Words_Per_Line = avg_words_per_line
  ))
}

# Read all files
blog_info <- read_file_info(blog_file, "Blogs")
news_info <- read_file_info(news_file, "News")
twitter_info <- read_file_info(twitter_file, "Twitter")

# Combine into summary table
file_summary <- rbind(blog_info, news_info, twitter_info)

1.2 Basic File Statistics

The following table provides a summary of the three data files:

Summary Statistics for Training Data Files
Source File Exists Lines Characters Words Avg Chars/Line Avg Words/Line
Blogs Yes 899288 206824505 37334131 229.99 41.52
News Yes 1010242 203223159 34372530 201.16 34.02
Twitter Yes 2360148 162096031 30373543 68.68 12.87

Key Observations: - All three files have been successfully loaded - The files contain substantial amounts of text data suitable for training - Blogs typically have longer lines (more characters and words per line) - Twitter messages are shorter, as expected given the character limit - News articles fall somewhere in between


2. Detailed Data Analysis

2.1 Line Length Distribution

To understand the structure of our data, we examine the distribution of line lengths (in characters) for each source:

# Function to analyze line lengths
analyze_line_lengths <- function(file_path, source_name, sample_size = 10000) {
  if (!file.exists(file_path)) {
    return(data.frame(Source = character(), Length = numeric()))
  }
  
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  
  # Sample if file is very large
  if (length(lines) > sample_size) {
    lines <- sample(lines, sample_size)
  }
  
  line_lengths <- nchar(lines)
  
  return(data.frame(
    Source = rep(source_name, length(line_lengths)),
    Length = line_lengths
  ))
}

# Analyze all files (using samples for large files)
if (file.exists(blog_file)) {
  blog_lengths <- analyze_line_lengths(blog_file, "Blogs", 10000)
} else {
  blog_lengths <- data.frame(Source = character(), Length = numeric())
}

if (file.exists(news_file)) {
  news_lengths <- analyze_line_lengths(news_file, "News", 10000)
} else {
  news_lengths <- data.frame(Source = character(), Length = numeric())
}

if (file.exists(twitter_file)) {
  twitter_lengths <- analyze_line_lengths(twitter_file, "Twitter", 10000)
} else {
  twitter_lengths <- data.frame(Source = character(), Length = numeric())
}

# Combine data
all_lengths <- rbind(blog_lengths, news_lengths, twitter_lengths)

2.2 Histogram: Line Length Distribution

Observations: - Blogs: Show a wide range of line lengths, with many longer lines (typical of blog posts) - News: Moderate line lengths, relatively consistent distribution - Twitter: Concentrated at shorter lengths due to character limits, with a long tail

2.3 Word Count Distribution

# Function to analyze word counts per line
analyze_word_counts <- function(file_path, source_name, sample_size = 10000) {
  if (!file.exists(file_path)) {
    return(data.frame(Source = character(), WordCount = numeric()))
  }
  
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  
  # Sample if file is very large
  if (length(lines) > sample_size) {
    lines <- sample(lines, sample_size)
  }
  
  word_counts <- str_count(lines, "\\S+")
  
  return(data.frame(
    Source = rep(source_name, length(word_counts)),
    WordCount = word_counts
  ))
}

# Analyze all files
if (file.exists(blog_file)) {
  blog_words <- analyze_word_counts(blog_file, "Blogs", 10000)
} else {
  blog_words <- data.frame(Source = character(), WordCount = numeric())
}

if (file.exists(news_file)) {
  news_words <- analyze_word_counts(news_file, "News", 10000)
} else {
  news_words <- data.frame(Source = character(), WordCount = numeric())
}

if (file.exists(twitter_file)) {
  twitter_words <- analyze_word_counts(twitter_file, "Twitter", 10000)
} else {
  twitter_words <- data.frame(Source = character(), WordCount = numeric())
}

# Combine data
all_words <- rbind(blog_words, news_words, twitter_words)

2.4 Histogram: Words Per Line Distribution


3. Data Quality and Characteristics

3.1 Summary Statistics Table

Detailed Summary Statistics by Source
Source Mean Chars Median Chars Max Chars Min Chars Mean Words Median Words Max Words
Blogs 227.66 152 2690 1 41.61 28 602
News 201.61 184 1505 1 34.14 31 289
Twitter 68.86 65 140 5 12.82 12 36

3.2 Interesting Findings

  1. Data Volume: The corpus contains millions of words across all three sources, providing a rich foundation for training a prediction model.

  2. Source Diversity: Each source has distinct characteristics:

    • Blogs: Longer, more conversational text
    • News: Formal, structured language
    • Twitter: Short, informal messages with abbreviations and hashtags
  3. Line Length Variation: The wide variation in line lengths suggests we’ll need to handle different text patterns in our prediction algorithm.

  4. Data Quality: The data appears clean and well-formatted, which will simplify preprocessing steps.


4. Plans for Prediction Algorithm

4.1 Algorithm Approach

The text prediction algorithm will use an n-gram model approach:

  1. N-gram Extraction: Break down the text into sequences of 1-4 words (unigrams, bigrams, trigrams, and 4-grams)

  2. Frequency Analysis: Count how often each n-gram appears in the training data

  3. Prediction Strategy:

    • When a user types text, the algorithm will look at the last 1-3 words
    • Match these against stored n-grams
    • Suggest the most likely next word based on frequency
  4. Smoothing: Use techniques like backoff or interpolation to handle cases where specific n-grams haven’t been seen before

4.2 Shiny App Features

The Shiny application will provide:

  1. Text Input Box: Users can type text and see predictions in real-time

  2. Prediction Display: Show the top 3-5 most likely next words as buttons or suggestions

  3. Visualization: Optional charts showing prediction confidence or n-gram frequencies

  4. User-Friendly Interface: Clean, intuitive design that works on desktop and mobile devices

4.3 Implementation Steps

  1. Data Preprocessing: Clean and tokenize the text data
  2. N-gram Generation: Create frequency tables for different n-gram sizes
  3. Model Optimization: Balance accuracy with app performance (speed and memory)
  4. Shiny App Development: Build the user interface and server logic
  5. Testing and Refinement: Test with various inputs and optimize predictions

5. Next Steps

  1. Complete data preprocessing and n-gram extraction
  2. Build and test the prediction algorithm
  3. Develop the Shiny app interface
  4. Optimize for performance and user experience
  5. Deploy the application for public use

Conclusion

This exploratory analysis demonstrates that we have successfully loaded and examined the training data. The three text sources provide a diverse corpus suitable for building a text prediction model. The data shows expected patterns (longer blog posts, shorter tweets) and contains sufficient volume for training.

The next phase will focus on implementing the n-gram prediction algorithm and building the Shiny application to provide an interactive text prediction experience.


Report generated on 2025-11-29