This report is part of the Data Science Capstone Project. The main objective is to demonstrate that the data has been successfully downloaded, loaded, cleaned, and explored. Ultimately, we will build a predictive model for text input, using natural language processing (NLP) techniques. The datasets include text from blogs, news articles, and Twitter in English.
We begin by reading the raw data and summarizing basic statistics: number of lines, words, and average words per line.
## # A tibble: 3 × 4
## Source Lines Words AvgWordsPerLine
## <chr> <int> <int> <dbl>
## 1 Blogs 899288 37334131 41.5
## 2 News 1010242 34372530 34.0
## 3 Twitter 2360148 30373583 12.9
We randomly sampled 10,000 lines from each file to reduce memory usage and processing time.
We cleaned the sampled data by: • Converting to lowercase • Removing punctuation, numbers, whitespace • Removing profanity (“bad-words.txt”)
We tokenize the cleaned data into unigrams, bigrams, and trigrams, and analyze word frequency.
How many unique words cover 50% and 90% of total word usage?
## # A tibble: 1 × 3
## word n cumulative
## <chr> <int> <dbl>
## 1 life 730 0.501
## # A tibble: 1 × 3
## word n cumulative
## <chr> <int> <dbl>
## 1 yesterdays 9 0.900
## # A tibble: 10 × 2
## bigram n
## <chr> <int>
## 1 of the 4122
## 2 in the 3938
## 3 to the 1988
## 4 on the 1734
## 5 for the 1704
## 6 to be 1425
## 7 and the 1267
## 8 at the 1179
## 9 in a 1065
## 10 with the 1016
## # A tibble: 10 × 2
## trigram n
## <chr> <int>
## 1 <NA> 1207
## 2 one of the 334
## 3 a lot of 263
## 4 the end of 168
## 5 to be a 151
## 6 out of the 138
## 7 some of the 138
## 8 going to be 137
## 9 as well as 136
## 10 it was a 129
We will build a prediction algorithm using an n-gram backoff model: • Trigrams: If two previous words are known, suggest the most likely third. • Bigrams: If only one previous word is known, suggest the most likely next. • Unigrams: If no match, fall back to the most frequent word.
This will be deployed using a Shiny app, where users type a phrase and receive word suggestions in real time.
We’ll use the tidytext, dplyr, and shiny packages and consider performance improvements such as token pre-filtering and hash tables for fast lookup.