Capstone project week 2

Overview

In this report I briefly summarize the “predicting new word” project. It covers the data processing, the exploratory data analysis steps and some of the results, as well as the plans for the prediction algorithm.

Pre-processing

To make the text ready for the Exploratory data analysis (EDA) and the natural language processing (NLP) the following steps were performed:

Loading the three datasets
Merging the three datasets into one sample set
Randomly subsampling the sample dataset (10% of the data)
Conversion to lowercase
Profanity removal

Exploratory Data Analysis (EDA)

The first step was to learn about the NLP from various internet sources and find out what are the main steps which I should perform during this task.

Next I analysed the sample dataset and get some information about the text. Here you can find the most important findings:

Some EDA with unigrams, after tokenization:

## Total lines: 333667

## Total words: 7003840

## Unique words: 155104

## Average words per line: 20.99051

## Number of rare words (frequency<3): 103184

## The number of words covering 50% of occurencies:  130

## The number of words covering 90% of occurencies:  6859

Some EDA with bigrams, after tokenization:

## Total bigrams: 6670185

## Unique bigrams: 1949605

## Number of rare bigrams (frequency<3): 1673516

## The number of bigrams covering 50% of occurencies:  33277

## The number of bigrams covering 90% of occurencies:  1282586

Some EDA with trigrams, after tokenization:

## Total trigrams: 6337766

## Unique trigrams: 4401304

## Number of rare trigrams (frequency<2): 3905008

## The number of trigrams covering 50% of occurencies:  1232420

## The number of trigrams covering 90% of occurencies:  3767527

Findings and next step for modelling

Based on the EDA and analyzing the text the following observations were made:

The data set is quite large, so I used only the 10% of it to my analysis and for the prediction, because of computational and runtime issues. It still contains millions of words.
Relatively small amount of unique words and bigrams are enough for the 50 or 90 % covarage of the words/bigrams
It should be considered to remove the unique words and bigrams with very low frequency
The profanity test is necessary, but always should be updated with new words
I did not encounter any memory or computational problem with the smaller data set.

Plans for modelling:

There are some models, but first I would like to use a simple n-gram model with 1-,2- or 3-grams
There are some options how to handle unseen n-grams. I would like to use the Kneser-Ney smoothing, because it’ a little bit more advanced method and I can get a more accurate result. -Finally, I want to emphasize that building such a model is often leads to compromises between quality and time, resources we invest in it. At first I want to make an easy and relatively efficient model.

Capstone project week 2

Bence Szikora

2025-03-17

Overview

Pre-processing

Exploratory Data Analysis (EDA)

Findings and next step for modelling