16/3/2022
Introduction
- Natural Language Processing (NLP) is a field that deals with text and voice.
- It is a branch of Artificial Intelligence (AI) that helps giving computers the ability to understand and process text and voice data and imitate the human behavior.[1]
- This project marks the end of the Data Science Specialization program, which is instructed by professors from John Hopkins University via Coursera.
- The data for this project was provided from SwiftKey, big thanks for the oppertunity and cooperation during this project.
Coding process
- The goal of this project was to write a code which:
- Takes in the data provided (here only data sets for the english language were chosen),
- Does some pre-processing which include cleaning of the data
- Does some exploratory analysis of the data to gather information have an idea of the data
- Implements the Katz Back Off model (here with Good-turning discounting) [2]
- Develops a webapplication that predicts the next words in a sentence provided by the user. Please find the link to the web application here. WordPred
Challenges
- One of the biggest challenges that was faced during this project was physical limitation of the my personal computer which continously was crashing due to over processing and CPU usage.
- In order to due the exploratory analysis, part of the data was needed that must be done in order to produce a fairly accurate picutre of the structure of the data.
- It was not possible to include the stopwords, because that increased the memory usage and my particular machine could not handle it.
- It was important to constantly save the outcomes to clean up the environment, which slows the coding process and thought flow.
End product - Shiny web application
- The web application was end product of this project
- Note that all the data is used, which causes the application to load more slowly
- The application is fairly simple and the degree of fanciness is relative
- The application include some instructions on to how to use it, and the main part is the prediction tab, and finally some visualization in the last tab.
- The idea is to develop the application as time allows and fine tune the prediction model when possible
Aknowledgment
- As this was a completely new area for me, it was challenging to grasp the theory of it, let along the coding process.
- I would really like to give credit to @thachngoctran (github) [3,4] for a nice and structured explanation of how it is possible to implement the theory in R.
- I would also thank to the authors behind ‘data.table’ which really increased the speed of string searching by using secondary indices and auto indexing [5,6]
- Finally I am really grateful and appreciative of the support and patience of my family
References