This report documents the initial exploratory analysis of text data for building a text prediction application. The goal is to create an app that suggests the next word as users type, similar to smartphone keyboard predictions.
The data consists of three English text files from SwiftKey:
# Set working directory
setwd("C:/Users/purni/Desktop/Coursera-SwiftKey/final/en_US")
# Load the data files
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", warn = FALSE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", warn = FALSE, skipNul = TRUE)
cat("✅ Data successfully loaded!\n")
## ✅ Data successfully loaded!
Files Successfully Loaded: - en_US.blogs.txt -
en_US.news.txt
- en_US.twitter.txt
# Calculate basic statistics
summary_data <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Total_Words = c(
sum(sapply(strsplit(blogs, "\\s+"), length)),
sum(sapply(strsplit(news, "\\s+"), length)),
sum(sapply(strsplit(twitter, "\\s+"), length))
),
Avg_Characters_Per_Line = round(c(
mean(nchar(blogs)),
mean(nchar(news)),
mean(nchar(twitter))
), 1)
)
# Display the table
knitr::kable(summary_data, caption = "Summary Statistics of Text Files")
| Source | Lines | Total_Words | Avg_Characters_Per_Line |
|---|---|---|---|
| Blogs | 899288 | 37334131 | 230.0 |
| News | 1010206 | 34371031 | 201.2 |
| 2360148 | 30373583 | 68.7 |
library(ggplot2)
# Plot 1: Comparison of file sizes
ggplot(summary_data, aes(x = Source, y = Lines/1000, fill = Source)) +
geom_bar(stat = "identity") +
labs(title = "Number of Lines in Each Text Source",
y = "Thousands of Lines",
x = "Data Source") +
theme_minimal() +
scale_fill_brewer(palette = "Set2")
# Plot 2: Average line length comparison
ggplot(summary_data, aes(x = Source, y = Avg_Characters_Per_Line, fill = Source)) +
geom_bar(stat = "identity") +
labs(title = "Average Characters Per Line",
y = "Characters",
x = "Data Source") +
theme_minimal() +
scale_fill_brewer(palette = "Set3")
From the initial analysis:
# Show sample content from each source
cat("### Sample from Blogs:\n")
## ### Sample from Blogs:
cat(substr(blogs[1], 1, 100), "...\n\n")
## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”. ...
cat("### Sample from News:\n")
## ### Sample from News:
cat(substr(news[1], 1, 100), "...\n\n")
## He wasn't home alone, apparently. ...
cat("### Sample from Twitter:\n")
## ### Sample from Twitter:
cat(substr(twitter[1], 1, 100), "...")
## How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way ...
This exploratory analysis confirms we have sufficient high-quality text data to build an effective prediction algorithm. The diversity of sources (blogs, news, tweets) will help create a robust model that handles various writing styles.
For the next milestone, I will present the cleaned n-gram models and a prototype prediction function.