Exploratory-Data-Analysis-for-Text-Prediction-Project.knit

Exploratory Data Analysis for Text Prediction Project

Introduction Goal: Show you’re comfortable with the data and are progressing toward building a prediction algorithm.

Brief overview:

“This report summarizes initial exploration of the SwiftKey dataset (blogs, news, and Twitter texts). The goal is to understand the structure and content of the text data in preparation for building a text prediction algorithm and deploying it through a Shiny application.”

Data Loading and Summary Mention the 3 data files:

en_US.blogs.txt

en_US.news.txt

en_US.twitter.txt

Show that the data has been loaded, e.g.:

lines_blogs <- readLines(“en_US.blogs.txt”, encoding = “UTF-8”, skipNul = TRUE) lines_news <- readLines(“en_US.news.txt”, encoding = “UTF-8”, skipNul = TRUE) lines_twitter <- readLines(“en_US.twitter.txt”, encoding = “UTF-8”, skipNul = TRUE)

Table of summary stats (example):

File	Size (MB)	Line Count	Word Count	Max Line Length
Blogs	210	899,288	37,334,690	40,835
News	205	1,010,242	34,372,720	11,384
Twitter	167	2,360,148	30,373,583	140

Interesting Findings Examples:

Twitter data has the most lines but shortest line length.

Blogs contain longer, more expressive text; good for complex phrasing.

News text is formal and balanced in length.

Plots Histogram of line lengths or word frequencies

Bar plot of top 10 most common words

Optional: Word cloud or bigram frequency chart

Use ggplot2 or base R plotting

hist(nchar(lines_twitter), breaks = 50, main = “Line Length Distribution - Twitter”)

Plans for the Prediction Algorithm and Shiny App Language model:

“I plan to use an n-gram model (likely tri-gram) with a back-off strategy to predict the next word based on user input.”

Data cleaning:

“All text will be lowercased and cleaned of punctuation, profanity, and numbers.”

Shiny app functionality:

“The user will input text, and the app will suggest the next word based on the trained model.”

Conclusion Reiterate that:

Data is loaded and summarized.

Patterns and structure of the datasets are understood.

The foundation for building a prediction model is in place.

✅ Evaluation Checklist (Scoring Template) Requirement Met? (Y/N) Notes

Requirement	Met? (Y/N)	Notes
Link leads to valid, public RPubs HTML page
Shows data successfully loaded (e.g. uses `readLines`, shows samples)
Summarizes all 3 files (line count, word count, etc.)
Includes basic plots (histograms, bar plots, word cloud optional)
Written concisely and understandably for non-technical audience
Describes future plan for prediction model (e.g. n-gram)
Describes functionality of planned Shiny app