- Download and extract the text files
- Choose each text file in English language for basic analysis
- Clean the corpus & perform basic exploratory analysis
- Sample data from each text file to perform analysis
- Build N-Gram model using RWeka package (NGramTokenizer) for sampled data
- Save unique unigram, bigram and trigram data along with their respective frequencies
- Construct predictive model to predict next possible words (I have used Kneser-Ney smoothing to predict next word)
Size of data file constructed using unigram, bigram and trigram is 64KB. Size of original data file is 563MB.