2024-12-17

Introduction

This presentation reports the Exploratory Data Analysis (EDA) conducted for building a “N-gram based Next Word Prediction” model, last project in the R for Data Science 10 Course Specialisation.

This report comprises details on following phases.

  • Basic Characteristics
  • Data Pre-Processing
  • Word and N-gram Distributions
  • How many words we need to cover 60% of all instances
  • Next plans

Basic Understanding of Data

For the current prototype n-gram model, we are using a sample of data.

##    Source Lines   Size
## 1   Blogs  5000 1.4 Mb
## 2    News  5000 1.3 Mb
## 3 Twitter  5000 0.7 Mb

Pre-Processing

Normal Text cleaning

As a data preprocessing text all text were cleaned using quanteda library.

Profanity Filtering

## [1] "No. of tokens before 'Profanity Filtering' was 34127"
## [1] "while after it was 33646"
## [1] "Thus, current corpus had 481 profane words."

Word and N-gram Distributions

Distribution of unigrams.

Word and N-gram Distributions

Distribution of Bi-grams.

Word and N-gram Distributions

Distriburion of Tri-grams.

How Many Words we need to cover 60% of all instances in language

Based on the following graph and associated data from a frequency sorted dictionary and cumulative percentages, we found:

  • For 50% we need 141-154 Words,
  • For 90% 7054-8012 words .
  • For 75% we need 1509 words
  • For 60% we need 377-415 words

The Graph

Next Plans

In order to build a n-gram’s language model, we will:

  • Count frequency matrices
  • Calculate probability matrices
  • Building a function to predict next word based on the probability matrices
  • Deploy the model on Shiny server
  • Build and make the model accessible to anyone through a Shiny app

See you!

Thank you for your time in reviewing this report.