Exploratory Data Analysis for Text Prediction Project
Brief overview:
“This report summarizes initial exploration of the SwiftKey dataset (blogs, news, and Twitter texts). The goal is to understand the structure and content of the text data in preparation for building a text prediction algorithm and deploying it through a Shiny application.”
en_US.blogs.txt
en_US.news.txt
en_US.twitter.txt
Show that the data has been loaded, e.g.:
lines_blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) lines_news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) lines_twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)
Table of summary stats (example):
| File | Size (MB) | Line Count | Word Count | Max Line Length |
|---|---|---|---|---|
| Blogs | 210 | 899,288 | 37,334,690 | 40,835 |
| News | 205 | 1,010,242 | 34,372,720 | 11,384 |
| 167 | 2,360,148 | 30,373,583 | 140 |
Twitter data has the most lines but shortest line length.
Blogs contain longer, more expressive text; good for complex phrasing.
News text is formal and balanced in length.
Bar plot of top 10 most common words
Optional: Word cloud or bigram frequency chart
Use ggplot2 or base R plotting
hist(nchar(lines_twitter), breaks = 50, main = “Line Length Distribution - Twitter”)
“I plan to use an n-gram model (likely tri-gram) with a back-off strategy to predict the next word based on user input.”
Data cleaning:
“All text will be lowercased and cleaned of punctuation, profanity, and numbers.”
Shiny app functionality:
“The user will input text, and the app will suggest the next word based on the trained model.”
Data is loaded and summarized.
Patterns and structure of the datasets are understood.
The foundation for building a prediction model is in place.
✅ Evaluation Checklist (Scoring Template) Requirement Met? (Y/N) Notes
| Requirement | Met? (Y/N) | Notes |
|---|---|---|
| Link leads to valid, public RPubs HTML page | ||
Shows data successfully loaded (e.g. uses readLines,
shows samples) |
||
| Summarizes all 3 files (line count, word count, etc.) | ||
| Includes basic plots (histograms, bar plots, word cloud optional) | ||
| Written concisely and understandably for non-technical audience | ||
| Describes future plan for prediction model (e.g. n-gram) | ||
| Describes functionality of planned Shiny app |