Introduction

This report is part of the Data Science Capstone Project, where I have build a predictive text model based on real-world text data from blogs, news, and Twitter.

Data Loading

I downloaded and read the following datasets: - en_US.blogs.txt - en_US.news.txt - en_US.twitter.txt

They contain a variety of text styles and lengths.

Summary Statistics

Source Lines Characters
Blogs 899,288 ~206 MB
News 1,010,242 ~200 MB
Twitter 2,360,148 ~163 MB

Sampling and Cleaning

A random sample of 10,000 lines from each dataset was used. Common text pre-processing included: - Lowercasing - Removing punctuation and numbers - Removing stopwords - Stripping whitespace

Exploratory Analysis

Word frequency analysis showed that a small number of words account for most usage.

Top frequent words: the, to, and, a, of, in, i, it, is, that

We also analyzed common bigrams and trigrams: - Bigrams: “thank you”, “new york”, “last night” - Trigrams: “i love you”, “i don’t know”, “let me know”

Plans for Prediction Model

I will use: - An n-gram model (1-gram to 3-gram) - Backoff/smoothing to handle unseen combinations - A Shiny app that predicts the next word based on input

The final app will be hosted using shinyapps.io.

Conclusion

This report shows that the data has been explored, cleaned, and is ready for model development. The next step is building and testing the prediction model.