This report is part of the Data Science Capstone Project, where I have build a predictive text model based on real-world text data from blogs, news, and Twitter.
I downloaded and read the following datasets: - en_US.blogs.txt - en_US.news.txt - en_US.twitter.txt
They contain a variety of text styles and lengths.
| Source | Lines | Characters |
|---|---|---|
| Blogs | 899,288 | ~206 MB |
| News | 1,010,242 | ~200 MB |
| 2,360,148 | ~163 MB |
A random sample of 10,000 lines from each dataset was used. Common text pre-processing included: - Lowercasing - Removing punctuation and numbers - Removing stopwords - Stripping whitespace
Word frequency analysis showed that a small number of words account for most usage.
Top frequent words: the, to, and, a, of, in, i, it, is, that
We also analyzed common bigrams and trigrams: - Bigrams: “thank you”, “new york”, “last night” - Trigrams: “i love you”, “i don’t know”, “let me know”
I will use: - An n-gram model (1-gram to 3-gram) - Backoff/smoothing to handle unseen combinations - A Shiny app that predicts the next word based on input
The final app will be hosted using shinyapps.io.
This report shows that the data has been explored, cleaned, and is ready for model development. The next step is building and testing the prediction model.