Prediction App - Exploratory Analysis

1. Introduction

This report documents the initial exploratory analysis of text data for building a text prediction application. The goal is to create an app that suggests the next word as users type, similar to smartphone keyboard predictions.

2. Data Loading and Overview

The data consists of three English text files from SwiftKey:

# Set working directory
setwd("C:/Users/purni/Desktop/Coursera-SwiftKey/final/en_US")

# Load the data files
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", warn = FALSE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", warn = FALSE, skipNul = TRUE)

cat("✅ Data successfully loaded!\n")

## ✅ Data successfully loaded!

Files Successfully Loaded: - en_US.blogs.txt - en_US.news.txt
- en_US.twitter.txt

3. Basic Summary Statistics

# Calculate basic statistics
summary_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Total_Words = c(
    sum(sapply(strsplit(blogs, "\\s+"), length)),
    sum(sapply(strsplit(news, "\\s+"), length)),
    sum(sapply(strsplit(twitter, "\\s+"), length))
  ),
  Avg_Characters_Per_Line = round(c(
    mean(nchar(blogs)),
    mean(nchar(news)),
    mean(nchar(twitter))
  ), 1)
)

# Display the table
knitr::kable(summary_data, caption = "Summary Statistics of Text Files")

Summary Statistics of Text Files
Source	Lines	Total_Words	Avg_Characters_Per_Line
Blogs	899288	37334131	230.0
News	1010206	34371031	201.2
Twitter	2360148	30373583	68.7

4. Visualizations

library(ggplot2)

# Plot 1: Comparison of file sizes
ggplot(summary_data, aes(x = Source, y = Lines/1000, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Lines in Each Text Source",
       y = "Thousands of Lines",
       x = "Data Source") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

# Plot 2: Average line length comparison
ggplot(summary_data, aes(x = Source, y = Avg_Characters_Per_Line, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Characters Per Line",
       y = "Characters",
       x = "Data Source") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

5. Interesting Findings

From the initial analysis:

Twitter has the most lines (2.3 million) but Blogs have the most words overall
Blog entries are much longer (average 230 characters per line) compared to Twitter (69 characters)
News articles have consistent medium-length text
The combined dataset contains over 100 million words, providing a rich source for training a prediction algorithm

# Show sample content from each source
cat("### Sample from Blogs:\n")

## ### Sample from Blogs:

cat(substr(blogs[1], 1, 100), "...\n\n")

## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. ...

cat("### Sample from News:\n")

## ### Sample from News:

cat(substr(news[1], 1, 100), "...\n\n")

## He wasn't home alone, apparently. ...

cat("### Sample from Twitter:\n")

## ### Sample from Twitter:

cat(substr(twitter[1], 1, 100), "...")

## How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way  ...

6. Plans for Prediction Algorithm

Phase 1: Data Preparation

Clean text (remove special characters, numbers, convert to lowercase)
Tokenize text into words and sentences
Create n-gram models (1-gram, 2-gram, 3-gram, 4-gram)

Phase 2: Algorithm Development

Build frequency tables for n-grams
Implement backoff strategy (if 4-gram not found, try 3-gram, etc.)
Apply smoothing techniques to handle unseen words
Optimize for speed and memory efficiency

Phase 3: Shiny App Development

Create user-friendly interface with text input box
Display top 3-5 word predictions
Add options for different n-gram models
Include sample text for testing

7. Next Steps

Week 1-2: Complete text cleaning and n-gram generation
Week 3-4: Build and test prediction algorithm
Week 5: Develop Shiny app interface
Week 6: Optimize performance and finalize app

8. Conclusion

This exploratory analysis confirms we have sufficient high-quality text data to build an effective prediction algorithm. The diversity of sources (blogs, news, tweets) will help create a robust model that handles various writing styles.

For the next milestone, I will present the cleaned n-gram models and a prototype prediction function.