Language Modelling and Word Prediction

Olga Alieva
March 22, 2022

Project Outline

For more details on the model please visit https://rpubs.com/locusclassicus/captone_final.

  • This project presents a word prediction algorithm and an online app for the Coursera Data Science Capstone course.
  • We use an n-gram model to estimate the probability of the last word of a n-gram given the previous words, and to assign probabilities to entire sequences.
  • The project is fully completed in R, using its NLP infrastructure.
  • The project utilizes the Swiftkey corpus, which includes different types of texts (blogs, news, twitter) in several languages (English, German, Russian, and Finnish). For our prediction App, we only used the English corpus.

Method and Implementation

  • As our training set, we only used a part (1/10) of the dataset, from each type (blogs, news, twits) proportionally.
  • The training set was tokenized into 2-grams, 3-grams and 4-grams using the tidytext library.
  • Stop-words, profanity and non-word characters were filtered.
  • A frequency table of unique and repeated n-grams (n from 2 to 4) was created. The next slide gives an idea of the clean data we used.

Sample Data

      word1          word2      word3      word4     n
1      <NA>           <NA>       <NA>       <NA> 21783
2    martin         luther       king         jr    32
3       dow          jones industrial    average    31
4     happy          cinco         de       mayo    26
5      swag           swag       swag       swag    24
6       bmw        service     center california    22
7    senate      president    stephen    sweeney    22
8     extra         virgin      olive        oil    21
9    amazon             eu associates programmes    19
10   design         custom  paintball     jersey    19
11       eu     associates programmes   designed    19
12      add         boston        add     boston    18
13     vice      president        joe      biden    17
14 national transportation     safety      board    16
15    prime       minister      david    cameron    16

Application