Data Science Capstone project

2022-07-03

Introduction of project

The purpose of the Data Science Capstone project was to develop a data product which uses natural language processing to predict the next word a user may want to type. The final product for this project is a shiny application. This project was the final part of a 10 course Data Science track by Johns Hopkins University on Coursera. It was done as an industry partnership with SwiftKey. The job was to clean and analyze a large corpus of unstructured text and build a word prediction model and use it in a web application.

Steps in the analysis. Download and load dataset into R. Qlean the data

The first step was Data cleaning. That includes transforming the raw text in the corpus into a format more suitable for automated manipulation.
The tm package alloweds numerous functions for such transformations like convert to lower case, stripp of whitespace, and removing of common stopwords.
As next step we have the creation of a term-document-matrix, which calculates words or phrases and their frequencies in a corpus. Word clouds were built for better understanding of the data sets.

Word cloud for category blogs

Prediction model. N-Grams

The predictive model use known content to predict unknown content. The data is in various languages freely available to download. The file included three text document collections, blogs, news feeds, and tweets, in four languages, German, English, Finnish, and Russian, of which only the German collections were used. The files are too large to be manipulated. For this reason in the project was used a smaller corpora with data.

N-Grams are words, or combinations of words, broken out by the number of words in that combination.As an outline:Unigrams: one word; Bigrams: two words; Trigrams: three words and so forth.

Trigrams

## # A tibble: 6 × 4
##   word1           word2     word3      n
##   <chr>           <chr>     <chr>  <int>
## 1 bundeskanzlerin angela    merkel   220
## 2 präsident       barack    obama    205
## 3 new             york      times    154
## 4 angela          merkel    cdu      152
## 5 us              präsident barack   131
## 6 kanzlerin       angela    merkel   118

Shiny App.How to use

To use the app here you have to tipe 1, 2 or 3 words. If you choose 1 word (for example präsident, then the app will get the most frequent bigramms (2-words sequence) from text file. In this exemple is the prediction barack. If the user tipes präsident barack, then tha app will get the most frequent trigrams (obama). If the user chooses 3 words, then the app will get the most frequent quadgrams.

https://diyanananova.shinyapps.io/apps/