Introduction

This report is part of the Data Science Capstone Project. The main objective is to demonstrate that the data has been successfully downloaded, loaded, cleaned, and explored. Ultimately, we will build a predictive model for text input, using natural language processing (NLP) techniques. The datasets include text from blogs, news articles, and Twitter in English.

Data Summary

We begin by reading the raw data and summarizing basic statistics: number of lines, words, and average words per line.

## # A tibble: 3 × 4
##   Source    Lines    Words AvgWordsPerLine
##   <chr>     <int>    <int>           <dbl>
## 1 Blogs    899288 37334131            41.5
## 2 News    1010242 34372530            34.0
## 3 Twitter 2360148 30373583            12.9

Sampling and Preprocessing

We randomly sampled 10,000 lines from each file to reduce memory usage and processing time.

We cleaned the sampled data by: • Converting to lowercase • Removing punctuation, numbers, whitespace • Removing profanity (“bad-words.txt”)

Exploratory Data Analysis

We tokenize the cleaned data into unigrams, bigrams, and trigrams, and analyze word frequency.

Coverage

How many unique words cover 50% and 90% of total word usage?

## # A tibble: 1 × 3
##   word      n cumulative
##   <chr> <int>      <dbl>
## 1 life    730      0.501
## # A tibble: 1 × 3
##   word           n cumulative
##   <chr>      <int>      <dbl>
## 1 yesterdays     9      0.900

Bigrams and Trigrams

## # A tibble: 10 × 2
##    bigram       n
##    <chr>    <int>
##  1 of the    4122
##  2 in the    3938
##  3 to the    1988
##  4 on the    1734
##  5 for the   1704
##  6 to be     1425
##  7 and the   1267
##  8 at the    1179
##  9 in a      1065
## 10 with the  1016
## # A tibble: 10 × 2
##    trigram         n
##    <chr>       <int>
##  1 <NA>         1207
##  2 one of the    334
##  3 a lot of      263
##  4 the end of    168
##  5 to be a       151
##  6 out of the    138
##  7 some of the   138
##  8 going to be   137
##  9 as well as    136
## 10 it was a      129

Modeling Plan

We will build a prediction algorithm using an n-gram backoff model: • Trigrams: If two previous words are known, suggest the most likely third. • Bigrams: If only one previous word is known, suggest the most likely next. • Unigrams: If no match, fall back to the most frequent word.

This will be deployed using a Shiny app, where users type a phrase and receive word suggestions in real time.

We’ll use the tidytext, dplyr, and shiny packages and consider performance improvements such as token pre-filtering and hash tables for fast lookup.

Next Steps