(Getting and cleaning)
First step of the exploratory analysis is getting a sample of each of the sets and cleaning it. For this purpose, the following iterative process has been followed:
- Reading a bunch of lines from the original file (randomly decide if use it or not, for sampling)
- Tokenizing each line, getting each separate "word"
- Cleaning each "word" using "regular expresions" (numbers, special characters...)
- Unifying lower and upper case
- Reading another bunch of lines (back to Step1)