EDA_final_project

2024-12-17

Introduction

This presentation reports the Exploratory Data Analysis (EDA) conducted for building a “N-gram based Next Word Prediction” model, last project in the R for Data Science 10 Course Specialisation.

This report comprises details on following phases.

Basic Characteristics
Data Pre-Processing
Word and N-gram Distributions
How many words we need to cover 60% of all instances
Next plans

Basic Understanding of Data

For the current prototype n-gram model, we are using a sample of data.

##    Source Lines   Size
## 1   Blogs  5000 1.4 Mb
## 2    News  5000 1.3 Mb
## 3 Twitter  5000 0.7 Mb

Pre-Processing

Normal Text cleaning

As a data preprocessing text all text were cleaned using quanteda library.

Profanity Filtering

## [1] "No. of tokens before 'Profanity Filtering' was 34127"

## [1] "while after it was 33646"

## [1] "Thus, current corpus had 481 profane words."

Word and N-gram Distributions

Distribution of unigrams.

Word and N-gram Distributions

Distribution of Bi-grams.

Word and N-gram Distributions

Distriburion of Tri-grams.

How Many Words we need to cover 60% of all instances in language

Based on the following graph and associated data from a frequency sorted dictionary and cumulative percentages, we found:

For 50% we need 141-154 Words,
For 90% 7054-8012 words .
For 75% we need 1509 words
For 60% we need 377-415 words

The Graph

Next Plans

In order to build a n-gram’s language model, we will:

Count frequency matrices
Calculate probability matrices
Building a function to predict next word based on the probability matrices
Deploy the model on Shiny server
Build and make the model accessible to anyone through a Shiny app

See you!

Thank you for your time in reviewing this report.