Exploratory data analysis of crop yield in Haryana for the period 2012-2016.

Loading Data

Loading yield data

yield_db <- read.csv("Tidy_Data_final.csv")
summary(yield_db)
##       Year              District        Block           Crop    
##  Min.   :2012   Bhiwani     : 350   Barwala:  70   Barley :605  
##  1st Qu.:2013   Hisar       : 315   Kaithal:  43   Gram   :605  
##  Median :2014   Sirsa       : 253   Sirsa  :  43   Wheat  :605  
##  Mean   :2014   Jind        : 245   Adampur:  35   Bajra  :485  
##  3rd Qu.:2015   Sonepat     : 245   Agroha :  35   Cotton :485  
##  Max.   :2016   Mahendergarh: 223   Alewa  :  35   Maize  :485  
##                 (Other)     :2609   (Other):3979   (Other):970  
##      Yield     
##  0      :1397  
##  274    :  31  
##  0      :  24  
##  -      :  21  
##  2057   :  15  
##  1749   :  13  
##  (Other):2739
yield_db$Yield <- as.integer(as.character(yield_db$Yield))
## Warning: NAs introduced by coercion
yield_db$Year <- as.factor(yield_db$Year)

# Replace 0 to NA for Yield
yield_db$Yield[yield_db$Yield==0] <- NA

str(yield_db)
## 'data.frame':    4240 obs. of  5 variables:
##  $ Year    : Factor w/ 5 levels "2012","2013",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ District: Factor w/ 21 levels "Ambala","Bhiwani",..: 6 6 6 6 6 6 6 6 6 4 ...
##  $ Block   : Factor w/ 123 levels "Adampur","Agroha",..: 39 16 79 43 44 15 123 2 1 100 ...
##  $ Crop    : Factor w/ 8 levels "Bajra","Barley",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ Yield   : int  4425 4072 4245 4210 4277 4425 4271 4509 4057 4951 ...

The dataset has yield (produced/output in tonnes per hectare). The data is granual upto Block level and can be traced by Year and Crop Type.

Missing Data

missingYield <- prop.table(table(is.na(yield_db$Yield==0))) * 100

34.009434% missing values in yield date.

IMPORTANT: How to tackle missing data. Best approach here will be remove these rows.

Crop wise Yield

Number of records per crop

table(yield_db$Crop)
## 
##     Bajra    Barley    Cotton      Gram     Maize     Paddy Sugarcane 
##       485       605       485       605       485       485       485 
##     Wheat 
##       605
prop.table(table(yield_db$Crop)) *100
## 
##     Bajra    Barley    Cotton      Gram     Maize     Paddy Sugarcane 
##  11.43868  14.26887  11.43868  14.26887  11.43868  11.43868  11.43868 
##     Wheat 
##  14.26887

The data is quite balanced.

Outlier Analysis

Box plot of all crops excluding Sugarcane

ggplotly(ggplot(data=yield_db[yield_db$Crop!='Sugarcane',], aes(x=Crop, y=Yield, fill=Crop)) + geom_boxplot())
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 1285 rows containing non-finite values (stat_boxplot).

Box plot Sugarcane

ggplotly(ggplot(data=yield_db[yield_db$Crop=='Sugarcane',], aes(x=Crop, y=Yield, fill=Crop)) + geom_boxplot())
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 157 rows containing non-finite values (stat_boxplot).

Looking at boxplot infomration, it looks yield of Sugarcane is on a totally different scale compared to other Crops.

Talk to the team about this.

Average Yield by Crop

avg_yield_crop <- yield_db %>% group_by_(.dots=c("Crop","Year")) %>% dplyr::summarize(Mean = mean(Yield, na.rm = T))
avg_yield_crop <- as.data.frame(avg_yield_crop)
avg_yield_crop
##         Crop Year       Mean
## 1      Bajra 2013  2012.1979
## 2      Bajra 2014  1812.5745
## 3      Bajra 2015  1793.3187
## 4      Bajra 2016  2026.4536
## 5     Barley 2012  3313.6351
## 6     Barley 2013  3821.1806
## 7     Barley 2014  2958.2714
## 8     Barley 2015  3303.3788
## 9     Barley 2016  3608.1795
## 10    Cotton 2013   564.4933
## 11    Cotton 2014   493.0250
## 12    Cotton 2015   286.6604
## 13    Cotton 2016   548.0345
## 14      Gram 2012  1253.4783
## 15      Gram 2013   990.8182
## 16      Gram 2014   758.1739
## 17      Gram 2015   673.1290
## 18      Gram 2016  1453.9429
## 19     Maize 2013  3123.1667
## 20     Maize 2014  2435.2857
## 21     Maize 2015  2945.3667
## 22     Maize 2016  3219.3158
## 23     Paddy 2013  3102.9143
## 24     Paddy 2014  2981.9906
## 25     Paddy 2015  2945.5351
## 26     Paddy 2016  3020.1204
## 27 Sugarcane 2013 74102.0385
## 28 Sugarcane 2014 71587.6627
## 29 Sugarcane 2015 73049.1071
## 30 Sugarcane 2016 77390.8675
## 31     Wheat 2012  4397.4538
## 32     Wheat 2013  4636.6555
## 33     Wheat 2014  3860.0924
## 34     Wheat 2015  4286.9754
## 35     Wheat 2016  4764.3689

For all crops excluding Sugarcane

ggplotly(ggplot(data = yield_db[yield_db$Crop!='Sugarcane',], aes(x=Year, y=Yield, fill=Year)) + geom_boxplot() + facet_grid(Crop~., scales = "free_y") + ylab("Average Yield"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 1285 rows containing non-finite values (stat_boxplot).

For Sugarcane

ggplotly(ggplot(data = yield_db[yield_db$Crop=='Sugarcane',], aes(x=Year, y=Yield, fill=Year)) + geom_boxplot() + ylab("Average Yield"))
## We recommend that you use the dev version of ggplot2 with `ggplotly()`
## Install it with: `devtools::install_github('hadley/ggplot2')`
## Warning: Removed 157 rows containing non-finite values (stat_boxplot).

IMPORTANT: Outliers need to be tacked by crop and year