Capstone Project - Word Prediction

16/3/2022

Introduction

Natural Language Processing (NLP) is a field that deals with text and voice.
It is a branch of Artificial Intelligence (AI) that helps giving computers the ability to understand and process text and voice data and imitate the human behavior.[1]
This project marks the end of the Data Science Specialization program, which is instructed by professors from John Hopkins University via Coursera.
The data for this project was provided from SwiftKey, big thanks for the oppertunity and cooperation during this project.

Takes in the data provided (here only data sets for the english language were chosen),
Does some pre-processing which include cleaning of the data
Does some exploratory analysis of the data to gather information have an idea of the data
Implements the Katz Back Off model (here with Good-turning discounting) [2]
Develops a webapplication that predicts the next words in a sentence provided by the user. Please find the link to the web application here. WordPred

One of the biggest challenges that was faced during this project was physical limitation of the my personal computer which continously was crashing due to over processing and CPU usage.
In order to due the exploratory analysis, part of the data was needed that must be done in order to produce a fairly accurate picutre of the structure of the data.
It was not possible to include the stopwords, because that increased the memory usage and my particular machine could not handle it.
It was important to constantly save the outcomes to clean up the environment, which slows the coding process and thought flow.

The web application was end product of this project
Note that all the data is used, which causes the application to load more slowly
The application is fairly simple and the degree of fanciness is relative
The application include some instructions on to how to use it, and the main part is the prediction tab, and finally some visualization in the last tab.
The idea is to develop the application as time allows and fine tune the prediction model when possible

As this was a completely new area for me, it was challenging to grasp the theory of it, let along the coding process.
I would really like to give credit to @thachngoctran (github) [3,4] for a nice and structured explanation of how it is possible to implement the theory in R.
I would also thank to the authors behind ‘data.table’ which really increased the speed of string searching by using secondary indices and auto indexing [5,6]
Finally I am really grateful and appreciative of the support and patience of my family