Input Data

There are a total of 1840598 non empty rows in the 10 pincode files. The distribution is as follows across the files.

Non empty rows
N
PincodeData_1.csv 178,288
PincodeData_2.csv 81,160
PincodeData_3.csv 79,759
PincodeData_4.csv 118,105
PincodeData_5.csv 158,205
PincodeData_6.csv 196,472
PincodeData_7.csv 206,133
PincodeData_8.csv 159,409
PincodeData_9.csv 132,989
PincodeData_10.csv 132,513
PincodeData_11.csv 209,981
PincodeData_12.csv 187,584
TOTAL 1,840,598.00

Preliminary Cleaning

On performing basic cleaning, 557 badly formed addresses were found. Out of which 254 addresses have been cleaned up and saved in another column.

In addition, there are 407 very short email addresses that occur a total 2059 times. These are likely false placeholders and better avoided. The top 10 of such very short emails in the whole database are as follows.

Very short email addresses
top 10
email No. of times repeated
na@gmail.com 704
no@gmail.com 143
Hr@medapp.in 139
its@ycook.in 133
Info@avfs.in 47
gsdo@nne.com 37
raj.m@vmc.in 35
Care@neuu.in 30
Hr@xng.co.in 30
zu@gmail.com 27

Summary

So on the whole only 1068619 unique email addresses are available in the database from the 10 files with a clean percent of 58%