There are a total of 1840598 non empty rows in the 10 pincode files. The distribution is as follows across the files.
Non empty rows | |
---|---|
N | |
PincodeData_1.csv | 178,288 |
PincodeData_2.csv | 81,160 |
PincodeData_3.csv | 79,759 |
PincodeData_4.csv | 118,105 |
PincodeData_5.csv | 158,205 |
PincodeData_6.csv | 196,472 |
PincodeData_7.csv | 206,133 |
PincodeData_8.csv | 159,409 |
PincodeData_9.csv | 132,989 |
PincodeData_10.csv | 132,513 |
PincodeData_11.csv | 209,981 |
PincodeData_12.csv | 187,584 |
TOTAL | 1,840,598.00 |
On performing basic cleaning, 557 badly formed addresses were found. Out of which 254 addresses have been cleaned up and saved in another column.
In addition, there are 407 very short email addresses that occur a total 2059 times. These are likely false placeholders and better avoided. The top 10 of such very short emails in the whole database are as follows.
Very short email addresses | |
---|---|
top 10 | |
No. of times repeated | |
na@gmail.com | 704 |
no@gmail.com | 143 |
Hr@medapp.in | 139 |
its@ycook.in | 133 |
Info@avfs.in | 47 |
gsdo@nne.com | 37 |
raj.m@vmc.in | 35 |
Care@neuu.in | 30 |
Hr@xng.co.in | 30 |
zu@gmail.com | 27 |
So on the whole only 1068619 unique email addresses are available in the database from the 10 files with a clean percent of 58%