Exploratory Data Analysis for Text Prediction Project

  1. Introduction Goal: Show you’re comfortable with the data and are progressing toward building a prediction algorithm.

Brief overview:

“This report summarizes initial exploration of the SwiftKey dataset (blogs, news, and Twitter texts). The goal is to understand the structure and content of the text data in preparation for building a text prediction algorithm and deploying it through a Shiny application.”

  1. Data Loading and Summary Mention the 3 data files:

en_US.blogs.txt

en_US.news.txt

en_US.twitter.txt

Show that the data has been loaded, e.g.:

lines_blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) lines_news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) lines_twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)

Table of summary stats (example):

File Size (MB) Line Count Word Count Max Line Length
Blogs 210 899,288 37,334,690 40,835
News 205 1,010,242 34,372,720 11,384
Twitter 167 2,360,148 30,373,583 140
  1. Interesting Findings Examples:

Twitter data has the most lines but shortest line length.

Blogs contain longer, more expressive text; good for complex phrasing.

News text is formal and balanced in length.

  1. Plots Histogram of line lengths or word frequencies

Bar plot of top 10 most common words

Optional: Word cloud or bigram frequency chart

Use ggplot2 or base R plotting

hist(nchar(lines_twitter), breaks = 50, main = “Line Length Distribution - Twitter”)

  1. Plans for the Prediction Algorithm and Shiny App Language model:

“I plan to use an n-gram model (likely tri-gram) with a back-off strategy to predict the next word based on user input.”

Data cleaning:

“All text will be lowercased and cleaned of punctuation, profanity, and numbers.”

Shiny app functionality:

“The user will input text, and the app will suggest the next word based on the trained model.”

  1. Conclusion Reiterate that:

Data is loaded and summarized.

Patterns and structure of the datasets are understood.

The foundation for building a prediction model is in place.

✅ Evaluation Checklist (Scoring Template) Requirement Met? (Y/N) Notes

Requirement Met? (Y/N) Notes
Link leads to valid, public RPubs HTML page
Shows data successfully loaded (e.g. uses readLines, shows samples)
Summarizes all 3 files (line count, word count, etc.)
Includes basic plots (histograms, bar plots, word cloud optional)
Written concisely and understandably for non-technical audience
Describes future plan for prediction model (e.g. n-gram)
Describes functionality of planned Shiny app