Text Prediction App - Exploratory Data Analysis

Executive Summary

This report presents an exploratory analysis of text data from three sources (blogs, news articles, and Twitter) to support the development of a text prediction application. The goal is to build a Shiny app that predicts the next word a user is likely to type, similar to smartphone keyboard suggestions. This analysis examines the basic characteristics of our training data and outlines the approach for building the prediction algorithm.

1. Data Overview

The training data consists of three text files containing English-language text from different sources:

Blogs: Text from blog posts
News: Text from news articles
Twitter: Text from Twitter messages

These files provide a diverse corpus of natural language that will be used to train our prediction model.

1.1 Loading the Data

# Load required libraries
library(tidyverse)
library(stringr)
library(knitr)
library(kableExtra)

# Define data file paths (adjust path as needed)
data_dir <- "final/en_US"
if (!dir.exists(data_dir)) {
  # Try alternative common locations
  data_dir <- "../final/en_US"
  if (!dir.exists(data_dir)) {
    data_dir <- "."
  }
}

# File paths
blog_file <- file.path(data_dir, "en_US.blogs.txt")
news_file <- file.path(data_dir, "en_US.news.txt")
twitter_file <- file.path(data_dir, "en_US.twitter.txt")

# Function to read file and return basic info
read_file_info <- function(file_path, source_name) {
  if (!file.exists(file_path)) {
    return(data.frame(
      Source = source_name,
      File_Exists = "No",
      Lines = NA,
      Characters = NA,
      Words = NA,
      Avg_Chars_Per_Line = NA,
      Avg_Words_Per_Line = NA
    ))
  }
  
  # Read file
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  
  # Calculate statistics
  num_lines <- length(lines)
  num_chars <- sum(nchar(lines))
  num_words <- sum(str_count(lines, "\\S+"))
  avg_chars_per_line <- round(num_chars / num_lines, 2)
  avg_words_per_line <- round(num_words / num_lines, 2)
  
  return(data.frame(
    Source = source_name,
    File_Exists = "Yes",
    Lines = num_lines,
    Characters = num_chars,
    Words = num_words,
    Avg_Chars_Per_Line = avg_chars_per_line,
    Avg_Words_Per_Line = avg_words_per_line
  ))
}

# Read all files
blog_info <- read_file_info(blog_file, "Blogs")
news_info <- read_file_info(news_file, "News")
twitter_info <- read_file_info(twitter_file, "Twitter")

# Combine into summary table
file_summary <- rbind(blog_info, news_info, twitter_info)

1.2 Basic File Statistics

The following table provides a summary of the three data files:

Summary Statistics for Training Data Files
Source	File Exists	Lines	Characters	Words	Avg Chars/Line	Avg Words/Line
Blogs	Yes	899288	206824505	37334131	229.99	41.52
News	Yes	1010242	203223159	34372530	201.16	34.02
Twitter	Yes	2360148	162096031	30373543	68.68	12.87

Key Observations: - All three files have been successfully loaded - The files contain substantial amounts of text data suitable for training - Blogs typically have longer lines (more characters and words per line) - Twitter messages are shorter, as expected given the character limit - News articles fall somewhere in between

2. Detailed Data Analysis

2.1 Line Length Distribution

To understand the structure of our data, we examine the distribution of line lengths (in characters) for each source:

# Function to analyze line lengths
analyze_line_lengths <- function(file_path, source_name, sample_size = 10000) {
  if (!file.exists(file_path)) {
    return(data.frame(Source = character(), Length = numeric()))
  }
  
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  
  # Sample if file is very large
  if (length(lines) > sample_size) {
    lines <- sample(lines, sample_size)
  }
  
  line_lengths <- nchar(lines)
  
  return(data.frame(
    Source = rep(source_name, length(line_lengths)),
    Length = line_lengths
  ))
}

# Analyze all files (using samples for large files)
if (file.exists(blog_file)) {
  blog_lengths <- analyze_line_lengths(blog_file, "Blogs", 10000)
} else {
  blog_lengths <- data.frame(Source = character(), Length = numeric())
}

if (file.exists(news_file)) {
  news_lengths <- analyze_line_lengths(news_file, "News", 10000)
} else {
  news_lengths <- data.frame(Source = character(), Length = numeric())
}

if (file.exists(twitter_file)) {
  twitter_lengths <- analyze_line_lengths(twitter_file, "Twitter", 10000)
} else {
  twitter_lengths <- data.frame(Source = character(), Length = numeric())
}

# Combine data
all_lengths <- rbind(blog_lengths, news_lengths, twitter_lengths)

2.2 Histogram: Line Length Distribution

Observations: - Blogs: Show a wide range of line lengths, with many longer lines (typical of blog posts) - News: Moderate line lengths, relatively consistent distribution - Twitter: Concentrated at shorter lengths due to character limits, with a long tail

2.3 Word Count Distribution

# Function to analyze word counts per line
analyze_word_counts <- function(file_path, source_name, sample_size = 10000) {
  if (!file.exists(file_path)) {
    return(data.frame(Source = character(), WordCount = numeric()))
  }
  
  lines <- readLines(file_path, encoding = "UTF-8", warn = FALSE)
  
  # Sample if file is very large
  if (length(lines) > sample_size) {
    lines <- sample(lines, sample_size)
  }
  
  word_counts <- str_count(lines, "\\S+")
  
  return(data.frame(
    Source = rep(source_name, length(word_counts)),
    WordCount = word_counts
  ))
}

# Analyze all files
if (file.exists(blog_file)) {
  blog_words <- analyze_word_counts(blog_file, "Blogs", 10000)
} else {
  blog_words <- data.frame(Source = character(), WordCount = numeric())
}

if (file.exists(news_file)) {
  news_words <- analyze_word_counts(news_file, "News", 10000)
} else {
  news_words <- data.frame(Source = character(), WordCount = numeric())
}

if (file.exists(twitter_file)) {
  twitter_words <- analyze_word_counts(twitter_file, "Twitter", 10000)
} else {
  twitter_words <- data.frame(Source = character(), WordCount = numeric())
}

# Combine data
all_words <- rbind(blog_words, news_words, twitter_words)

2.4 Histogram: Words Per Line Distribution

3. Data Quality and Characteristics

3.1 Summary Statistics Table

Detailed Summary Statistics by Source
Source	Mean Chars	Median Chars	Max Chars	Min Chars	Mean Words	Median Words	Max Words
Blogs	227.66	152	2690	1	41.61	28	602
News	201.61	184	1505	1	34.14	31	289
Twitter	68.86	65	140	5	12.82	12	36

3.2 Interesting Findings

Data Volume: The corpus contains millions of words across all three sources, providing a rich foundation for training a prediction model.
Source Diversity: Each source has distinct characteristics:
- Blogs: Longer, more conversational text
- News: Formal, structured language
- Twitter: Short, informal messages with abbreviations and hashtags
Line Length Variation: The wide variation in line lengths suggests we’ll need to handle different text patterns in our prediction algorithm.
Data Quality: The data appears clean and well-formatted, which will simplify preprocessing steps.

4. Plans for Prediction Algorithm

4.1 Algorithm Approach

The text prediction algorithm will use an n-gram model approach:

N-gram Extraction: Break down the text into sequences of 1-4 words (unigrams, bigrams, trigrams, and 4-grams)
Frequency Analysis: Count how often each n-gram appears in the training data
Prediction Strategy:
- When a user types text, the algorithm will look at the last 1-3 words
- Match these against stored n-grams
- Suggest the most likely next word based on frequency
Smoothing: Use techniques like backoff or interpolation to handle cases where specific n-grams haven’t been seen before

4.2 Shiny App Features

The Shiny application will provide:

Text Input Box: Users can type text and see predictions in real-time
Prediction Display: Show the top 3-5 most likely next words as buttons or suggestions
Visualization: Optional charts showing prediction confidence or n-gram frequencies
User-Friendly Interface: Clean, intuitive design that works on desktop and mobile devices

4.3 Implementation Steps

Data Preprocessing: Clean and tokenize the text data
N-gram Generation: Create frequency tables for different n-gram sizes
Model Optimization: Balance accuracy with app performance (speed and memory)
Shiny App Development: Build the user interface and server logic
Testing and Refinement: Test with various inputs and optimize predictions

5. Next Steps

Complete data preprocessing and n-gram extraction
Build and test the prediction algorithm
Develop the Shiny app interface
Optimize for performance and user experience
Deploy the application for public use

Conclusion

This exploratory analysis demonstrates that we have successfully loaded and examined the training data. The three text sources provide a diverse corpus suitable for building a text prediction model. The data shows expected patterns (longer blog posts, shorter tweets) and contains sufficient volume for training.

The next phase will focus on implementing the n-gram prediction algorithm and building the Shiny application to provide an interactive text prediction experience.

Report generated on 2025-11-29