In this report I briefly summarize the “predicting new word” project. It covers the data processing, the exploratory data analysis steps and some of the results, as well as the plans for the prediction algorithm.
To make the text ready for the Exploratory data analysis (EDA) and the natural language processing (NLP) the following steps were performed:
The first step was to learn about the NLP from various internet sources and find out what are the main steps which I should perform during this task.
Next I analysed the sample dataset and get some information about the text. Here you can find the most important findings:
Some EDA with unigrams, after tokenization:
## Total lines: 333667
## Total words: 7003840
## Unique words: 155104
## Average words per line: 20.99051
## Number of rare words (frequency<3): 103184
## The number of words covering 50% of occurencies: 130
## The number of words covering 90% of occurencies: 6859
Some EDA with bigrams, after tokenization:
## Total bigrams: 6670185
## Unique bigrams: 1949605
## Number of rare bigrams (frequency<3): 1673516
## The number of bigrams covering 50% of occurencies: 33277
## The number of bigrams covering 90% of occurencies: 1282586
Some EDA with trigrams, after tokenization:
## Total trigrams: 6337766
## Unique trigrams: 4401304
## Number of rare trigrams (frequency<2): 3905008
## The number of trigrams covering 50% of occurencies: 1232420
## The number of trigrams covering 90% of occurencies: 3767527
Based on the EDA and analyzing the text the following observations were made:
Plans for modelling: