Exploratory data analysis of crop yield in Haryana for the period 2012-2016.
Loading yield data
yield_db <- read.csv("Tidy_Data_final.csv")
summary(yield_db)
## Year District Block Crop
## Min. :2012 Bhiwani : 350 Barwala: 70 Barley :605
## 1st Qu.:2013 Hisar : 315 Kaithal: 43 Gram :605
## Median :2014 Sirsa : 253 Sirsa : 43 Wheat :605
## Mean :2014 Jind : 245 Adampur: 35 Bajra :485
## 3rd Qu.:2015 Sonepat : 245 Agroha : 35 Cotton :485
## Max. :2016 Mahendergarh: 223 Alewa : 35 Maize :485
## (Other) :2609 (Other):3979 (Other):970
## Yield
## 0 :1397
## 274 : 31
## 0Â : 24
## - : 21
## 2057 : 15
## 1749 : 13
## (Other):2739
yield_db$Yield <- as.integer(as.character(yield_db$Yield))
## Warning: NAs introduced by coercion
yield_db$Year <- as.factor(yield_db$Year)
# Replace 0 to NA for Yield
yield_db$Yield[yield_db$Yield==0] <- NA
str(yield_db)
## 'data.frame': 4240 obs. of 5 variables:
## $ Year : Factor w/ 5 levels "2012","2013",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ District: Factor w/ 21 levels "Ambala","Bhiwani",..: 6 6 6 6 6 6 6 6 6 4 ...
## $ Block : Factor w/ 123 levels "Adampur","Agroha",..: 39 16 79 43 44 15 123 2 1 100 ...
## $ Crop : Factor w/ 8 levels "Bajra","Barley",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ Yield : int 4425 4072 4245 4210 4277 4425 4271 4509 4057 4951 ...
The dataset has yield (produced/output in tonnes per hectare). The data is granual upto Block level and can be traced by Year and Crop Type.
missingYield <- prop.table(table(is.na(yield_db$Yield==0))) * 100
34.009434% missing values in yield date.
IMPORTANT: How to tackle missing data. Best approach here will be remove these rows.
Number of records per crop
table(yield_db$Crop)
##
## Bajra Barley Cotton Gram Maize Paddy Sugarcane
## 485 605 485 605 485 485 485
## Wheat
## 605
prop.table(table(yield_db$Crop)) *100
##
## Bajra Barley Cotton Gram Maize Paddy Sugarcane
## 11.43868 14.26887 11.43868 14.26887 11.43868 11.43868 11.43868
## Wheat
## 14.26887
The data is quite balanced.
Box plot of all crops excluding Sugarcane
ggplotly(ggplot(data=yield_db[yield_db$Crop!='Sugarcane',], aes(x=Crop, y=Yield, fill=Crop)) + geom_boxplot())
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 1285 rows containing non-finite values (stat_boxplot).
Box plot Sugarcane
ggplotly(ggplot(data=yield_db[yield_db$Crop=='Sugarcane',], aes(x=Crop, y=Yield, fill=Crop)) + geom_boxplot())
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 157 rows containing non-finite values (stat_boxplot).
Looking at boxplot infomration, it looks yield of Sugarcane is on a totally different scale compared to other Crops.
Talk to the team about this.
avg_yield_crop <- yield_db %>% group_by_(.dots=c("Crop","Year")) %>% dplyr::summarize(Mean = mean(Yield, na.rm = T))
avg_yield_crop <- as.data.frame(avg_yield_crop)
avg_yield_crop
## Crop Year Mean
## 1 Bajra 2013 2012.1979
## 2 Bajra 2014 1812.5745
## 3 Bajra 2015 1793.3187
## 4 Bajra 2016 2026.4536
## 5 Barley 2012 3313.6351
## 6 Barley 2013 3821.1806
## 7 Barley 2014 2958.2714
## 8 Barley 2015 3303.3788
## 9 Barley 2016 3608.1795
## 10 Cotton 2013 564.4933
## 11 Cotton 2014 493.0250
## 12 Cotton 2015 286.6604
## 13 Cotton 2016 548.0345
## 14 Gram 2012 1253.4783
## 15 Gram 2013 990.8182
## 16 Gram 2014 758.1739
## 17 Gram 2015 673.1290
## 18 Gram 2016 1453.9429
## 19 Maize 2013 3123.1667
## 20 Maize 2014 2435.2857
## 21 Maize 2015 2945.3667
## 22 Maize 2016 3219.3158
## 23 Paddy 2013 3102.9143
## 24 Paddy 2014 2981.9906
## 25 Paddy 2015 2945.5351
## 26 Paddy 2016 3020.1204
## 27 Sugarcane 2013 74102.0385
## 28 Sugarcane 2014 71587.6627
## 29 Sugarcane 2015 73049.1071
## 30 Sugarcane 2016 77390.8675
## 31 Wheat 2012 4397.4538
## 32 Wheat 2013 4636.6555
## 33 Wheat 2014 3860.0924
## 34 Wheat 2015 4286.9754
## 35 Wheat 2016 4764.3689
For all crops excluding Sugarcane
ggplotly(ggplot(data = yield_db[yield_db$Crop!='Sugarcane',], aes(x=Year, y=Yield, fill=Year)) + geom_boxplot() + facet_grid(Crop~., scales = "free_y") + ylab("Average Yield"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 1285 rows containing non-finite values (stat_boxplot).
For Sugarcane
ggplotly(ggplot(data = yield_db[yield_db$Crop=='Sugarcane',], aes(x=Year, y=Yield, fill=Year)) + geom_boxplot() + ylab("Average Yield"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 157 rows containing non-finite values (stat_boxplot).
IMPORTANT: Outliers need to be tacked by crop and year