Data Loading

Loading data and converting strings to Factors

system.time(soil.data <- fread("haryana_soil_nutrient.csv", header = T, stringsAsFactors = T))
##    user  system elapsed 
##   1.946   0.093   2.043

Basic Information

Structure of data

str(soil.data)
## Classes 'data.table' and 'data.frame':   685480 obs. of  17 variables:
##  $ DistrictId            : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ DistrictName          : Factor w/ 21 levels "Ambala","Bhiwani",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ BlockId               : int  447 447 447 447 447 447 447 447 447 447 ...
##  $ BlockName             : Factor w/ 120 levels "Adampur","Agroha",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ SampleNo              : Factor w/ 661872 levels "HR56946/2016-17/10849052",..: 13866 15536 15537 15538 15539 15540 15541 15542 15543 15544 ...
##  $ SoilPh                : num  8 8.7 9.1 8.9 8.5 8.8 8.6 8.5 8.7 8.4 ...
##  $ ElectricalConductivity: num  0.27 0.15 0.93 0.14 0.18 0.16 0.17 0.2 0.18 0.17 ...
##  $ OrganicCarbon         : num  0.51 0.24 0.31 0.21 0.22 0.3 0.27 0.25 0.34 0.24 ...
##  $ Nitrogen              : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ Phosphorous           : num  3.24 21.01 28.87 27.55 23.68 ...
##  $ Potassium             : num  48.6 733.4 666 658.3 623.4 ...
##  $ Sulphur               : num  291 410 364 407 389 ...
##  $ Zinc                  : num  0.47 2.93 2.79 3.19 2.79 2.63 2.56 2.83 2.51 3.3 ...
##  $ Iron                  : num  20.2 44.8 44.5 52.4 40.8 ...
##  $ Copper                : num  2.95 3.93 4.13 4.22 3.91 3.73 3.52 3.91 3.54 4.54 ...
##  $ Magnesium             : num  1.43 2.28 2.37 2.13 1.98 1.65 1.58 1.83 2.3 2.46 ...
##  $ Boron                 : num  NA NA NA NA NA NA NA NA NA NA ...
##  - attr(*, ".internal.selfref")=<externalptr>

Summary of data

summary(soil.data)
##    DistrictId         DistrictName       BlockId      
##  Min.   :69.00   Fatehabad  : 75797   Min.   : 447.0  
##  1st Qu.:74.00   Karnal     : 75215   1st Qu.: 483.0  
##  Median :78.00   Sonipat    : 53964   Median : 515.0  
##  Mean   :77.97   Kurukshetra: 51495   Mean   : 615.4  
##  3rd Qu.:81.00   Sirsa      : 50544   3rd Qu.: 539.0  
##  Max.   :89.00   Hisar      : 46473   Max.   :6778.0  
##                  (Other)    :331992                   
##         BlockName                          SampleNo          SoilPh      
##  Bawani Khera: 28483   HR58985/2016-17/25359166:     2   Min.   : 0.010  
##  Nilokheri   : 25072   HR58985/2016-17/25359252:     2   1st Qu.: 7.580  
##  Bhattu Kalan: 22144   HR58985/2016-17/25359388:     2   Median : 7.810  
##  Fatehabad   : 20148   HR58985/2016-17/25359514:     2   Mean   : 7.992  
##  Baragudha   : 19004   HR58985/2016-17/25359600:     2   3rd Qu.: 8.100  
##  Dabwali     : 18666   HR58985/2016-17/25359691:     2   Max.   :96.000  
##  (Other)     :551963   (Other)                 :685468   NA's   :6319    
##  ElectricalConductivity OrganicCarbon        Nitrogen     
##  Min.   :0.000          Min.   : 0.0000   Min.   :   0.1  
##  1st Qu.:0.340          1st Qu.: 0.2500   1st Qu.:  61.4  
##  Median :0.520          Median : 0.3200   Median :  77.9  
##  Mean   :0.678          Mean   : 0.4949   Mean   :  82.6  
##  3rd Qu.:0.800          3rd Qu.: 0.3900   3rd Qu.:  97.1  
##  Max.   :5.000          Max.   :97.1400   Max.   :9714.0  
##  NA's   :7917           NA's   :1093      NA's   :659898  
##   Phosphorous        Potassium          Sulphur            Zinc        
##  Min.   :   0.00   Min.   :   0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:   4.90   1st Qu.:  93.39   1st Qu.: 19.80   1st Qu.:  0.750  
##  Median :   8.70   Median : 172.00   Median : 44.48   Median :  1.370  
##  Mean   :  16.70   Mean   : 238.89   Mean   : 71.58   Mean   :  3.423  
##  3rd Qu.:  14.49   3rd Qu.: 326.17   3rd Qu.: 92.98   3rd Qu.:  2.520  
##  Max.   :9890.00   Max.   :9980.00   Max.   :999.00   Max.   :996.000  
##  NA's   :2639      NA's   :2643      NA's   :30037    NA's   :5764     
##       Iron            Copper         Magnesium           Boron       
##  Min.   :  0.00   Min.   :  0.0    Min.   :  0.000   Min.   :  0.0   
##  1st Qu.:  4.29   1st Qu.:  0.6    1st Qu.:  2.440   1st Qu.:  0.3   
##  Median :  7.77   Median :  1.1    Median :  4.740   Median :  1.4   
##  Mean   : 13.42   Mean   :  2.9    Mean   :  9.157   Mean   : 18.0   
##  3rd Qu.: 14.06   3rd Qu.:  2.1    3rd Qu.:  9.090   3rd Qu.: 10.1   
##  Max.   :999.00   Max.   :984.7    Max.   :999.000   Max.   :937.6   
##  NA's   :5647     NA's   :352638   NA's   :9963      NA's   :680507

Missing Data

Find missing values in each column

sapply(soil.data, function(x){sum(is.na(x))})
##             DistrictId           DistrictName                BlockId 
##                      0                      0                      0 
##              BlockName               SampleNo                 SoilPh 
##                      0                      0                   6319 
## ElectricalConductivity          OrganicCarbon               Nitrogen 
##                   7917                   1093                 659898 
##            Phosphorous              Potassium                Sulphur 
##                   2639                   2643                  30037 
##                   Zinc                   Iron                 Copper 
##                   5764                   5647                 352638 
##              Magnesium                  Boron 
##                   9963                 680507

Find missing percentange of each column

rows_tot <- nrow(soil.data)
sapply(soil.data, function(x){round(100*sum(is.na(x))/rows_tot, digits = 2)})
##             DistrictId           DistrictName                BlockId 
##                   0.00                   0.00                   0.00 
##              BlockName               SampleNo                 SoilPh 
##                   0.00                   0.00                   0.92 
## ElectricalConductivity          OrganicCarbon               Nitrogen 
##                   1.15                   0.16                  96.27 
##            Phosphorous              Potassium                Sulphur 
##                   0.38                   0.39                   4.38 
##                   Zinc                   Iron                 Copper 
##                   0.84                   0.82                  51.44 
##              Magnesium                  Boron 
##                   1.45                  99.27

Boron, Nitrogen has more than 95% data missing, while Copper has 51% data missing. Problem is data for Nitrogen is missing.

NOTE: Nitrogen is one of the most important component in soil and missing data will create an issue for us.

More details on Nitrogen

Block Wise

nitrogen.missing <- is.na(soil.data$Nitrogen)
table(paste0(soil.data$DistrictName, " - ",soil.data$BlockName), nitrogen.missing)
##                                 nitrogen.missing
##                                  FALSE  TRUE
##   Ambala - Ambala-I                  0  4832
##   Ambala - Ambala-Ii                 0  2689
##   Ambala - Barara                    0  2686
##   Ambala - Naraingarh                0   309
##   Ambala - Saha                      0  3695
##   Ambala - Shahzadpur                0   307
##   Bhiwani - Badhra                   2  3347
##   Bhiwani - Bawani Khera            22 28461
##   Bhiwani - Bhiwani                  0    31
##   Bhiwani - Dadri-I                  0     2
##   Bhiwani - Kairu                    0    66
##   Bhiwani - Loharu                   0     3
##   Bhiwani - Siwani                   0   558
##   Bhiwani - Tosham                   0   442
##   Faridabad - Ballabgarh          3482  3000
##   Faridabad - Faridabad           2805   438
##   Fatehabad - Bhattu Kalan           0 22144
##   Fatehabad - Fatehabad             12 20136
##   Fatehabad - Jakhal                 2  4003
##   Fatehabad - Ratia                  1 14275
##   Fatehabad - Tohana                12 15212
##   Gurgaon - Farrukh Nagar            4   242
##   Gurgaon - Gurgaon                  0  2363
##   Gurgaon - Pataudi                  0   583
##   Gurgaon - Sohna                    0     1
##   Hisar - Adampur                   15 10426
##   Hisar - Agroha                     4  2772
##   Hisar - Barwala                   10 12341
##   Hisar - Hansi-I                    4 12898
##   Hisar - Hansi-Ii                   0     2
##   Hisar - Hisar-I                    0   928
##   Hisar - Hisar-Ii                   0  4838
##   Hisar - Narnaund                   0  1447
##   Hisar - Uklana                     1   787
##   Jhajjar - Bahadurgarh              7 11152
##   Jhajjar - Beri                     1   750
##   Jhajjar - Jhajjar                  2  8762
##   Jhajjar - Matannail                0  1686
##   Jhajjar - Salhawas                 0   743
##   Jind - Alewa                       0  1946
##   Jind - Jind                       24 15085
##   Jind - Julana                      3  6553
##   Jind - Narwana                     0  1477
##   Jind - Pillukhera                  0     1
##   Jind - Safidon                     5  2915
##   Jind - Uchana                      0  1205
##   Kaithal - Guhla                  177   674
##   Kaithal - Kaithal                 65 13116
##   Kaithal - Kalayat                  1    55
##   Kaithal - Pundri                   5   970
##   Kaithal - Rajound                  0    35
##   Kaithal - Siwan                    0     1
##   Karnal - Gharaunda (Part)          3 13280
##   Karnal - Indri                     1 10965
##   Karnal - Karnal                    4 10761
##   Karnal - Nilokheri                14 25058
##   Karnal - Nissing At Chirao         3 15126
##   Kurukshetra - Babain               0  4638
##   Kurukshetra - Ismailabad           0  5275
##   Kurukshetra - Ladwa               17  6590
##   Kurukshetra - Pehowa               3 13493
##   Kurukshetra - Shahbad              3  7877
##   Kurukshetra - Thanesar             1 13598
##   Mahendragarh - Ateli Nangal        3  5379
##   Mahendragarh - Kanina              4 10892
##   Mahendragarh - Mahendragarh        0  5235
##   Mahendragarh - Nangal Chaudhry     0  4389
##   Mahendragarh - Narnaul             2  6145
##   Mahendragarh - Nizampur            0   591
##   Mahendragarh - Satnali             0   383
##   Mahendragarh - Sihma               6  1714
##   Mewat - Ferozepur Jhirka        1227     1
##   Mewat - Nagina                   547    91
##   Mewat - Nuh                      652  2087
##   Mewat - Punahana                 655   390
##   Mewat - Taoru                   3523  1929
##   Palwal - Hassanpur               135  4020
##   Palwal - Hathin                   16 10499
##   Palwal - Hodal                    12  6676
##   Palwal - Palwal                   75  4809
##   Palwal - Prithla                   1  3283
##   Panchkula - Barwala                5  5712
##   Panchkula - Morni                  0   432
##   Panchkula - Pinjore                3  2425
##   Panchkula - Raipur Rani            4  6689
##   Panipat - Bapoli                   1  8051
##   Panipat - Israna                  10  9514
##   Panipat - Madlauda                 0 11924
##   Panipat - Panipat                  6  5935
##   Panipat - Samalkha                 4  7356
##   Rewari - Bawal                  1606  3306
##   Rewari - Jatusana               1033  2898
##   Rewari - Khol At Rewari         4157  1538
##   Rewari - Nahar                    71  4031
##   Rewari - Rewari                 4909  2539
##   Rohtak - Kalanaur                  6  2554
##   Rohtak - Lakhan Majra             11  5028
##   Rohtak - Maham                     5  5552
##   Rohtak - Rohtak                    1  4193
##   Rohtak - Sampla                    0   873
##   Sirsa - Baragudha                 27 18977
##   Sirsa - Dabwali                   28 18638
##   Sirsa - Ellenabad                  2  3775
##   Sirsa - Nathusari Chopta           1  6472
##   Sirsa - Odhan                      0   292
##   Sirsa - Rania                      0  1864
##   Sirsa - Sirsa                      0   468
##   Sonipat - Gannaur                 26 12247
##   Sonipat - Gohana                   6  9789
##   Sonipat - Kathura                  2  6901
##   Sonipat - Kharkhoda               12  9135
##   Sonipat - Mundlana                22 12601
##   Sonipat - Murthal                  4   345
##   Sonipat - Rai                      9  2518
##   Sonipat - Sonipat                 32   315
##   Yamunanagar - Bilaspur             5  6620
##   Yamunanagar - Chhachhrauli         0   504
##   Yamunanagar - Jagadhri             0  1006
##   Yamunanagar - Mustafabad           0  8377
##   Yamunanagar - Radaur               1  7505
##   Yamunanagar - Sadaura (Part)       0  2435

District Wise

table(soil.data$DistrictName, nitrogen.missing)
##               nitrogen.missing
##                FALSE  TRUE
##   Ambala           0 14518
##   Bhiwani         24 32910
##   Faridabad     6287  3438
##   Fatehabad       27 75770
##   Gurgaon          4  3189
##   Hisar           34 46439
##   Jhajjar         10 23093
##   Jind            32 29182
##   Kaithal        248 14851
##   Karnal          25 75190
##   Kurukshetra     24 51471
##   Mahendragarh    15 34728
##   Mewat         6604  4498
##   Palwal         239 29287
##   Panchkula       12 15258
##   Panipat         21 42780
##   Rewari       11776 14312
##   Rohtak          23 18200
##   Sirsa           58 50486
##   Sonipat        113 53851
##   Yamunanagar      6 26447

More Details

nitrogen.data <- subset(soil.data, is.na(soil.data$Nitrogen)==FALSE)
nitrogen.data$Identity<-1
agg.district <-aggregate(nitrogen.data$Identity ~ nitrogen.data$DistrictName,data=nitrogen.data, FUN=sum)

qplot(agg.district$`nitrogen.data$DistrictName`, agg.district$`nitrogen.data$Identity`, data=agg.district, xlab = "District", ylab="Num of Entries with Nitrogen Data", col=`nitrogen.data$DistrictName`) + theme(axis.text.x = element_text(angle = 60, hjust = 1))

This clearly shows that data is entered only in Rewari, Mewat and Faridabad.

Electrical Conductivity

ggplot(soil.data) + aes(x = DistrictName, y=ElectricalConductivity, fill=DistrictName) + geom_bar(stat = "summary", fun.y="mean")+ theme(axis.text.x = element_text(angle = 60, hjust = 1)) + xlab("District") + ylab("Electrical Conductivity")
## Warning: Removed 7917 rows containing non-finite values (stat_summary).

High electrical conductivity of soil can be associated with more moisture in the soil. Palwal, Hisar, Faridabad and Jhajjar have high water content in the soil.

Faridabad is close to Yamuna, so water content will be high. What about the other districts?
* Good Irrigation
* More Rainfall

Need to find more details.