1. Introduction

This report documents the initial exploratory analysis of text data for building a text prediction application. The goal is to create an app that suggests the next word as users type, similar to smartphone keyboard predictions.

2. Data Loading and Overview

The data consists of three English text files from SwiftKey:

# Set working directory
setwd("C:/Users/purni/Desktop/Coursera-SwiftKey/final/en_US")

# Load the data files
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", warn = FALSE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", warn = FALSE, skipNul = TRUE)

cat("✅ Data successfully loaded!\n")
## ✅ Data successfully loaded!

Files Successfully Loaded: - en_US.blogs.txt - en_US.news.txt
- en_US.twitter.txt

3. Basic Summary Statistics

# Calculate basic statistics
summary_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Total_Words = c(
    sum(sapply(strsplit(blogs, "\\s+"), length)),
    sum(sapply(strsplit(news, "\\s+"), length)),
    sum(sapply(strsplit(twitter, "\\s+"), length))
  ),
  Avg_Characters_Per_Line = round(c(
    mean(nchar(blogs)),
    mean(nchar(news)),
    mean(nchar(twitter))
  ), 1)
)

# Display the table
knitr::kable(summary_data, caption = "Summary Statistics of Text Files")
Summary Statistics of Text Files
Source Lines Total_Words Avg_Characters_Per_Line
Blogs 899288 37334131 230.0
News 1010206 34371031 201.2
Twitter 2360148 30373583 68.7

4. Visualizations

library(ggplot2)

# Plot 1: Comparison of file sizes
ggplot(summary_data, aes(x = Source, y = Lines/1000, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Number of Lines in Each Text Source",
       y = "Thousands of Lines",
       x = "Data Source") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set2")

# Plot 2: Average line length comparison
ggplot(summary_data, aes(x = Source, y = Avg_Characters_Per_Line, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Average Characters Per Line",
       y = "Characters",
       x = "Data Source") +
  theme_minimal() +
  scale_fill_brewer(palette = "Set3")

5. Interesting Findings

From the initial analysis:

  1. Twitter has the most lines (2.3 million) but Blogs have the most words overall
  2. Blog entries are much longer (average 230 characters per line) compared to Twitter (69 characters)
  3. News articles have consistent medium-length text
  4. The combined dataset contains over 100 million words, providing a rich source for training a prediction algorithm
# Show sample content from each source
cat("### Sample from Blogs:\n")
## ### Sample from Blogs:
cat(substr(blogs[1], 1, 100), "...\n\n")
## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. ...
cat("### Sample from News:\n")
## ### Sample from News:
cat(substr(news[1], 1, 100), "...\n\n")
## He wasn't home alone, apparently. ...
cat("### Sample from Twitter:\n")
## ### Sample from Twitter:
cat(substr(twitter[1], 1, 100), "...")
## How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way  ...

6. Plans for Prediction Algorithm

Phase 1: Data Preparation

  • Clean text (remove special characters, numbers, convert to lowercase)
  • Tokenize text into words and sentences
  • Create n-gram models (1-gram, 2-gram, 3-gram, 4-gram)

Phase 2: Algorithm Development

  • Build frequency tables for n-grams
  • Implement backoff strategy (if 4-gram not found, try 3-gram, etc.)
  • Apply smoothing techniques to handle unseen words
  • Optimize for speed and memory efficiency

Phase 3: Shiny App Development

  • Create user-friendly interface with text input box
  • Display top 3-5 word predictions
  • Add options for different n-gram models
  • Include sample text for testing

7. Next Steps

  1. Week 1-2: Complete text cleaning and n-gram generation
  2. Week 3-4: Build and test prediction algorithm
  3. Week 5: Develop Shiny app interface
  4. Week 6: Optimize performance and finalize app

8. Conclusion

This exploratory analysis confirms we have sufficient high-quality text data to build an effective prediction algorithm. The diversity of sources (blogs, news, tweets) will help create a robust model that handles various writing styles.

For the next milestone, I will present the cleaned n-gram models and a prototype prediction function.