Data Science Capstone Presentation

2024-06-18

Overview

For this Capstone project. I created a shiny app that predicts the next word based on users input texts. The training data is downloaded from: https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip

The dataset contains texts from blogs, news and twitter in four languages. For the purpose of this project, only the English texts were analyzed.

Due to the large size of the files, I sampled 1% of the original dataset to create the training corpus for prediction.

Description of Prediction Algorithm – analysis

Preprocessing

To prepare the data for analysis, we performed preprocessing including converting all letters to lower case, removing numbers and punctuation, removing extra white spaces and stop words.

Tokenization and N-gram Analysis

We then tokenized the text by words and created bigrams, trigrams and quadgrams. We then generated the frequencies of the bigrams, trigrams and quadgrams and stored the results to be used in the prediction model and shiny app. The code for preprocessing the text and N-gram anlysis can be found here: https://github.com/xhu0925/DSCapstone/blob/main/CreateSample.R

Description of Prediction Algorithm – Model Building

The final prediction model is based on the n-gram frequencies:

If the input contains three or more words, the last three words are used to create a trigram. The model will look for the trigram in the previously created quadgram frequency table and returns the fourth word as prediction. If the input contains two or one words, the sample principal applies and the trigram and bigram frequency tables will be used to geenrate the final prediction. If no prediction is found, the model will return “the”.

The prediction model can be found here: https://github.com/xhu0925/DSCapstone/blob/main/Model.R

Descripton of the APP

The shinyapp can be accessed here: https://xhu0925.shinyapps.io/DScapstonePrediction/

The user will enter text in the side panel and main panel will display the predicted word. A help button is also available to provide instructions.

Example: The user entered “hello”, and the predicted next word is “world”.