Loading data and converting strings to Factors
system.time(soil.data <- fread("haryana_soil_nutrient.csv", header = T, stringsAsFactors = T))
## user system elapsed
## 1.946 0.093 2.043
Structure of data
str(soil.data)
## Classes 'data.table' and 'data.frame': 685480 obs. of 17 variables:
## $ DistrictId : int 70 70 70 70 70 70 70 70 70 70 ...
## $ DistrictName : Factor w/ 21 levels "Ambala","Bhiwani",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ BlockId : int 447 447 447 447 447 447 447 447 447 447 ...
## $ BlockName : Factor w/ 120 levels "Adampur","Agroha",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ SampleNo : Factor w/ 661872 levels "HR56946/2016-17/10849052",..: 13866 15536 15537 15538 15539 15540 15541 15542 15543 15544 ...
## $ SoilPh : num 8 8.7 9.1 8.9 8.5 8.8 8.6 8.5 8.7 8.4 ...
## $ ElectricalConductivity: num 0.27 0.15 0.93 0.14 0.18 0.16 0.17 0.2 0.18 0.17 ...
## $ OrganicCarbon : num 0.51 0.24 0.31 0.21 0.22 0.3 0.27 0.25 0.34 0.24 ...
## $ Nitrogen : num NA NA NA NA NA NA NA NA NA NA ...
## $ Phosphorous : num 3.24 21.01 28.87 27.55 23.68 ...
## $ Potassium : num 48.6 733.4 666 658.3 623.4 ...
## $ Sulphur : num 291 410 364 407 389 ...
## $ Zinc : num 0.47 2.93 2.79 3.19 2.79 2.63 2.56 2.83 2.51 3.3 ...
## $ Iron : num 20.2 44.8 44.5 52.4 40.8 ...
## $ Copper : num 2.95 3.93 4.13 4.22 3.91 3.73 3.52 3.91 3.54 4.54 ...
## $ Magnesium : num 1.43 2.28 2.37 2.13 1.98 1.65 1.58 1.83 2.3 2.46 ...
## $ Boron : num NA NA NA NA NA NA NA NA NA NA ...
## - attr(*, ".internal.selfref")=<externalptr>
Summary of data
summary(soil.data)
## DistrictId DistrictName BlockId
## Min. :69.00 Fatehabad : 75797 Min. : 447.0
## 1st Qu.:74.00 Karnal : 75215 1st Qu.: 483.0
## Median :78.00 Sonipat : 53964 Median : 515.0
## Mean :77.97 Kurukshetra: 51495 Mean : 615.4
## 3rd Qu.:81.00 Sirsa : 50544 3rd Qu.: 539.0
## Max. :89.00 Hisar : 46473 Max. :6778.0
## (Other) :331992
## BlockName SampleNo SoilPh
## Bawani Khera: 28483 HR58985/2016-17/25359166: 2 Min. : 0.010
## Nilokheri : 25072 HR58985/2016-17/25359252: 2 1st Qu.: 7.580
## Bhattu Kalan: 22144 HR58985/2016-17/25359388: 2 Median : 7.810
## Fatehabad : 20148 HR58985/2016-17/25359514: 2 Mean : 7.992
## Baragudha : 19004 HR58985/2016-17/25359600: 2 3rd Qu.: 8.100
## Dabwali : 18666 HR58985/2016-17/25359691: 2 Max. :96.000
## (Other) :551963 (Other) :685468 NA's :6319
## ElectricalConductivity OrganicCarbon Nitrogen
## Min. :0.000 Min. : 0.0000 Min. : 0.1
## 1st Qu.:0.340 1st Qu.: 0.2500 1st Qu.: 61.4
## Median :0.520 Median : 0.3200 Median : 77.9
## Mean :0.678 Mean : 0.4949 Mean : 82.6
## 3rd Qu.:0.800 3rd Qu.: 0.3900 3rd Qu.: 97.1
## Max. :5.000 Max. :97.1400 Max. :9714.0
## NA's :7917 NA's :1093 NA's :659898
## Phosphorous Potassium Sulphur Zinc
## Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
## 1st Qu.: 4.90 1st Qu.: 93.39 1st Qu.: 19.80 1st Qu.: 0.750
## Median : 8.70 Median : 172.00 Median : 44.48 Median : 1.370
## Mean : 16.70 Mean : 238.89 Mean : 71.58 Mean : 3.423
## 3rd Qu.: 14.49 3rd Qu.: 326.17 3rd Qu.: 92.98 3rd Qu.: 2.520
## Max. :9890.00 Max. :9980.00 Max. :999.00 Max. :996.000
## NA's :2639 NA's :2643 NA's :30037 NA's :5764
## Iron Copper Magnesium Boron
## Min. : 0.00 Min. : 0.0 Min. : 0.000 Min. : 0.0
## 1st Qu.: 4.29 1st Qu.: 0.6 1st Qu.: 2.440 1st Qu.: 0.3
## Median : 7.77 Median : 1.1 Median : 4.740 Median : 1.4
## Mean : 13.42 Mean : 2.9 Mean : 9.157 Mean : 18.0
## 3rd Qu.: 14.06 3rd Qu.: 2.1 3rd Qu.: 9.090 3rd Qu.: 10.1
## Max. :999.00 Max. :984.7 Max. :999.000 Max. :937.6
## NA's :5647 NA's :352638 NA's :9963 NA's :680507
Find missing values in each column
sapply(soil.data, function(x){sum(is.na(x))})
## DistrictId DistrictName BlockId
## 0 0 0
## BlockName SampleNo SoilPh
## 0 0 6319
## ElectricalConductivity OrganicCarbon Nitrogen
## 7917 1093 659898
## Phosphorous Potassium Sulphur
## 2639 2643 30037
## Zinc Iron Copper
## 5764 5647 352638
## Magnesium Boron
## 9963 680507
Find missing percentange of each column
rows_tot <- nrow(soil.data)
sapply(soil.data, function(x){round(100*sum(is.na(x))/rows_tot, digits = 2)})
## DistrictId DistrictName BlockId
## 0.00 0.00 0.00
## BlockName SampleNo SoilPh
## 0.00 0.00 0.92
## ElectricalConductivity OrganicCarbon Nitrogen
## 1.15 0.16 96.27
## Phosphorous Potassium Sulphur
## 0.38 0.39 4.38
## Zinc Iron Copper
## 0.84 0.82 51.44
## Magnesium Boron
## 1.45 99.27
Boron, Nitrogen has more than 95% data missing, while Copper has 51% data missing. Problem is data for Nitrogen is missing.
NOTE: Nitrogen is one of the most important component in soil and missing data will create an issue for us.
Block Wise
nitrogen.missing <- is.na(soil.data$Nitrogen)
table(paste0(soil.data$DistrictName, " - ",soil.data$BlockName), nitrogen.missing)
## nitrogen.missing
## FALSE TRUE
## Ambala - Ambala-I 0 4832
## Ambala - Ambala-Ii 0 2689
## Ambala - Barara 0 2686
## Ambala - Naraingarh 0 309
## Ambala - Saha 0 3695
## Ambala - Shahzadpur 0 307
## Bhiwani - Badhra 2 3347
## Bhiwani - Bawani Khera 22 28461
## Bhiwani - Bhiwani 0 31
## Bhiwani - Dadri-I 0 2
## Bhiwani - Kairu 0 66
## Bhiwani - Loharu 0 3
## Bhiwani - Siwani 0 558
## Bhiwani - Tosham 0 442
## Faridabad - Ballabgarh 3482 3000
## Faridabad - Faridabad 2805 438
## Fatehabad - Bhattu Kalan 0 22144
## Fatehabad - Fatehabad 12 20136
## Fatehabad - Jakhal 2 4003
## Fatehabad - Ratia 1 14275
## Fatehabad - Tohana 12 15212
## Gurgaon - Farrukh Nagar 4 242
## Gurgaon - Gurgaon 0 2363
## Gurgaon - Pataudi 0 583
## Gurgaon - Sohna 0 1
## Hisar - Adampur 15 10426
## Hisar - Agroha 4 2772
## Hisar - Barwala 10 12341
## Hisar - Hansi-I 4 12898
## Hisar - Hansi-Ii 0 2
## Hisar - Hisar-I 0 928
## Hisar - Hisar-Ii 0 4838
## Hisar - Narnaund 0 1447
## Hisar - Uklana 1 787
## Jhajjar - Bahadurgarh 7 11152
## Jhajjar - Beri 1 750
## Jhajjar - Jhajjar 2 8762
## Jhajjar - Matannail 0 1686
## Jhajjar - Salhawas 0 743
## Jind - Alewa 0 1946
## Jind - Jind 24 15085
## Jind - Julana 3 6553
## Jind - Narwana 0 1477
## Jind - Pillukhera 0 1
## Jind - Safidon 5 2915
## Jind - Uchana 0 1205
## Kaithal - Guhla 177 674
## Kaithal - Kaithal 65 13116
## Kaithal - Kalayat 1 55
## Kaithal - Pundri 5 970
## Kaithal - Rajound 0 35
## Kaithal - Siwan 0 1
## Karnal - Gharaunda (Part) 3 13280
## Karnal - Indri 1 10965
## Karnal - Karnal 4 10761
## Karnal - Nilokheri 14 25058
## Karnal - Nissing At Chirao 3 15126
## Kurukshetra - Babain 0 4638
## Kurukshetra - Ismailabad 0 5275
## Kurukshetra - Ladwa 17 6590
## Kurukshetra - Pehowa 3 13493
## Kurukshetra - Shahbad 3 7877
## Kurukshetra - Thanesar 1 13598
## Mahendragarh - Ateli Nangal 3 5379
## Mahendragarh - Kanina 4 10892
## Mahendragarh - Mahendragarh 0 5235
## Mahendragarh - Nangal Chaudhry 0 4389
## Mahendragarh - Narnaul 2 6145
## Mahendragarh - Nizampur 0 591
## Mahendragarh - Satnali 0 383
## Mahendragarh - Sihma 6 1714
## Mewat - Ferozepur Jhirka 1227 1
## Mewat - Nagina 547 91
## Mewat - Nuh 652 2087
## Mewat - Punahana 655 390
## Mewat - Taoru 3523 1929
## Palwal - Hassanpur 135 4020
## Palwal - Hathin 16 10499
## Palwal - Hodal 12 6676
## Palwal - Palwal 75 4809
## Palwal - Prithla 1 3283
## Panchkula - Barwala 5 5712
## Panchkula - Morni 0 432
## Panchkula - Pinjore 3 2425
## Panchkula - Raipur Rani 4 6689
## Panipat - Bapoli 1 8051
## Panipat - Israna 10 9514
## Panipat - Madlauda 0 11924
## Panipat - Panipat 6 5935
## Panipat - Samalkha 4 7356
## Rewari - Bawal 1606 3306
## Rewari - Jatusana 1033 2898
## Rewari - Khol At Rewari 4157 1538
## Rewari - Nahar 71 4031
## Rewari - Rewari 4909 2539
## Rohtak - Kalanaur 6 2554
## Rohtak - Lakhan Majra 11 5028
## Rohtak - Maham 5 5552
## Rohtak - Rohtak 1 4193
## Rohtak - Sampla 0 873
## Sirsa - Baragudha 27 18977
## Sirsa - Dabwali 28 18638
## Sirsa - Ellenabad 2 3775
## Sirsa - Nathusari Chopta 1 6472
## Sirsa - Odhan 0 292
## Sirsa - Rania 0 1864
## Sirsa - Sirsa 0 468
## Sonipat - Gannaur 26 12247
## Sonipat - Gohana 6 9789
## Sonipat - Kathura 2 6901
## Sonipat - Kharkhoda 12 9135
## Sonipat - Mundlana 22 12601
## Sonipat - Murthal 4 345
## Sonipat - Rai 9 2518
## Sonipat - Sonipat 32 315
## Yamunanagar - Bilaspur 5 6620
## Yamunanagar - Chhachhrauli 0 504
## Yamunanagar - Jagadhri 0 1006
## Yamunanagar - Mustafabad 0 8377
## Yamunanagar - Radaur 1 7505
## Yamunanagar - Sadaura (Part) 0 2435
District Wise
table(soil.data$DistrictName, nitrogen.missing)
## nitrogen.missing
## FALSE TRUE
## Ambala 0 14518
## Bhiwani 24 32910
## Faridabad 6287 3438
## Fatehabad 27 75770
## Gurgaon 4 3189
## Hisar 34 46439
## Jhajjar 10 23093
## Jind 32 29182
## Kaithal 248 14851
## Karnal 25 75190
## Kurukshetra 24 51471
## Mahendragarh 15 34728
## Mewat 6604 4498
## Palwal 239 29287
## Panchkula 12 15258
## Panipat 21 42780
## Rewari 11776 14312
## Rohtak 23 18200
## Sirsa 58 50486
## Sonipat 113 53851
## Yamunanagar 6 26447
More Details
nitrogen.data <- subset(soil.data, is.na(soil.data$Nitrogen)==FALSE)
nitrogen.data$Identity<-1
agg.district <-aggregate(nitrogen.data$Identity ~ nitrogen.data$DistrictName,data=nitrogen.data, FUN=sum)
qplot(agg.district$`nitrogen.data$DistrictName`, agg.district$`nitrogen.data$Identity`, data=agg.district, xlab = "District", ylab="Num of Entries with Nitrogen Data", col=`nitrogen.data$DistrictName`) + theme(axis.text.x = element_text(angle = 60, hjust = 1))
This clearly shows that data is entered only in Rewari, Mewat and Faridabad.
ggplot(soil.data) + aes(x = DistrictName, y=ElectricalConductivity, fill=DistrictName) + geom_bar(stat = "summary", fun.y="mean")+ theme(axis.text.x = element_text(angle = 60, hjust = 1)) + xlab("District") + ylab("Electrical Conductivity")
## Warning: Removed 7917 rows containing non-finite values (stat_summary).
High electrical conductivity of soil can be associated with more moisture in the soil. Palwal, Hisar, Faridabad and Jhajjar have high water content in the soil.
Faridabad is close to Yamuna, so water content will be high. What about the other districts?
* Good Irrigation
* More Rainfall
Need to find more details.