Introduction

Government of India has made a lot of useful data publically available. This includes data for agriculture, infrastructure, technology and so on. And so we can look at some of this data to see if we can glean something meaningful from them.

In this study here, we look at the agricultural food production for various grains, cereals and oilseeds from the year 2001 to 2017. We have also obtained data for annual rainfall in India for these years as a separate dataset and our goal is to see how strongly agricultural food production depends on annual rainfall.

Executive Summary

Load the Data

library(dplyr)
library(ggplot2)
library(xlsx)
library(reshape2)
library(corrplot)

grains <- read.csv("./Table_8.3-All_India_1.csv")
rain <- read.xlsx("./All_India_Area_Weighted_Monthly_Seasonal_And_Annual_Rainfall.xls", sheetIndex = 1)

Exploratory Data Analysis (EDA) - Grain Production

Now that we have loaded the data, we can look at some of the variables to see what it contains

str(grains)
## 'data.frame':    16 obs. of  36 variables:
##  $ Year                                              : Factor w/ 16 levels "2001-02","2002-03",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Food.Grains..Cereals....Rice..000.tonnes.         : int  93340 71820 88526 83132 91793 93355 96693 99172 89083 95970 ...
##  $ Food.Grains..Cereals....Jowar..000.tonnes.        : int  7557 7012 6681 7244 7630 7151 7926 7246 6698 7003 ...
##  $ Food.Grains..Cereals....Bajra..000.tonnes.        : int  8284 4719 12109 7931 7684 8424 9970 8887 6506 10370 ...
##  $ Food.Grains..Cereals....Maize..000.tonnes.        : int  13160 11152 14984 14172 14710 15097 18955 19731 16720 21726 ...
##  $ Food.Grains..Cereals....Ragi..000.tonnes.         : int  2375 1316 1966 2432 2354 1444 2152 2040 1889 2194 ...
##  $ Food.Grains..Cereals....Small.Millets..000.tonnes.: int  577 459 564 478 472 480 551 445 382 442 ...
##  $ Food.Grains..Cereals....Wheat..000.tonnes.        : int  72766 65761 72156 68637 69355 75807 78570 80679 80804 86874 ...
##  $ Food.Grains..Cereals....Barley..000.tonnes.       : int  1425 1407 1298 1207 1221 1328 1196 1689 1355 1663 ...
##  $ Food.Grains..Cereals....Total..000.tonnes.        : int  199483 163646 198284 185233 195217 203085 216014 219889 203435 226241 ...
##  $ Food.Grains..Pulses....Gram..000.tonnes.          : int  5473 4237 5718 5469 5600 6334 5749 7060 7476 8221 ...
##  $ Food.Grains..Pulses....Tur..000.tonnes.           : int  2260 2186 2356 2347 2738 2314 3076 2266 2465 2861 ...
##  $ Food.Grains..Pulses....Other.Pulses..000.tonnes.  : int  5635 4702 6831 5314 5046 5550 5937 5240 4720 7159 ...
##  $ Food.Grains..Pulses....Total..000.tonnes.         : int  13368 11125 14905 13130 13384 14198 14762 14566 14662 18241 ...
##  $ Food.Grains...Total..000.tonnes.                  : int  212851 174771 213189 198363 208602 217282 230775 234456 218098 244482 ...
##  $ Oilseeds...Ground.nuts..000.tonnes.               : int  7028 4121 8127 6774 7993 4864 9183 7168 5428 8266 ...
##  $ Oilseeds...Sesamum..000.tonnes.                   : int  698 441 782 674 641 618 757 640 588 893 ...
##  $ Oilseeds...Rapeseed.and.Mustard..000.tonnes.      : int  5083 3880 6291 7593 8131 7438 5834 7201 6608 8179 ...
##  $ Oilseeds...Linseed..000.tonnes.                   : int  209 177 197 170 173 168 163 169 154 147 ...
##  $ Oilseeds...Castor.seed..000.tonnes.               : int  653 428 797 793 991 762 1054 1171 1009 1350 ...
##  $ Total.Nine.Oilseeds..000.tonnes.                  : int  20662 14838 25186 24354 27978 24289 29755 27719 24882 32477 ...
##  $ Cotton..000.Bales.                                : int  9997 8624 13729 16429 18499 22632 25884 22276 24022 33000 ...
##  $ Jute..000.Bales.                                  : int  10584 10274 10252 9399 9970 10317 10220 9634 11230 10009 ...
##  $ Mesta..000.Bales.                                 : int  1094 1002 921 873 870 956 990 731 587 611 ...
##  $ Tea..Million.kgs.                                 : num  854 846 879 907 949 973 987 973 991 967 ...
##  $ Coffee..000.MT.                                   : int  301 275 271 276 274 288 262 262 290 302 ...
##  $ Natural.Rubber..000.MT.                           : num  631 649 712 750 803 853 825 865 831 862 ...
##  $ Banana..000.tonnes.                               : int  13160 11975 11388 11710 12105 16609 23823 26217 26470 29780 ...
##  $ Sugarcane..000.tonnes.                            : int  297208 287383 233862 237088 281172 355520 348188 285029 292302 342382 ...
##  $ Tobacco..000.tonnes.                              : int  546 500 550 549 552 469 437 569 665 802 ...
##  $ Potatoes..000.tonnes.                             : int  23924 23269 23060 23631 23905 22181 34658 34391 36577 42339 ...
##  $ Black.Pepper..000.tonnes.                         : int  62 72 73 73 93 69 50 50 50 52 ...
##  $ Chillies..000.tonnes.                             : int  1069 895 1236 1186 1015 1242 1371 1382 1470 1223 ...
##  $ Ginger..000.tonnes.                               : int  322 280 302 359 391 393 775 832 708 702 ...
##  $ Coconut..000.tonnes.                              : int  12963 12535 12178 12833 14811 15840 10148 10148 10824 10840 ...
##  $ Turmeric..000.tonnes.                             : int  563 522 565 718 852 787 884 895 928 993 ...

So we can see all the different kinds of food grains that this data contains and also their units and variable types. We now rename them so that they become a little easier to read keeping in mind that almost all of the variables are measured in thousands of tonnes except for cotton, jute and mesta which are measured in thousands of bales.

grains <- grains %>% rename(rice = 2, jowar = 3, bajra = 4, maize = 5, ragi = 6, millets = 7, wheat = 8)
grains <- grains %>% rename(barley = 9, tcereals = 10, gram = 11, tur = 12, otherpulses = 13, totalpulses = 14)
grains <- grains %>% rename(totalgrains = 15, gnuts = 16, sesame = 17, mustard = 18, linseed = 19)
grains <- grains %>% rename(castor = 20, totaloilseeds = 21, cotton = 22, jute = 23, mesta = 24, tea = 25)
grains <- grains %>% rename(coffee = 26, rubber = 27, banana = 28, sugarcane = 29, tobacco = 30, potatoes = 31)
grains <- grains %>% rename(pepper = 32, chilles = 33, ginger = 34, coconut = 35, turmeric = 36)

Let us now look and see if we can find any kind of corelation between these grains. We do this by creating a corelation matrix and use the melt function to get the corelation in terms of each of the variables with the help of a lower triangular matrix.

cormat <- round(cor(as.matrix(grains[,2:36])),2)

##Function 
get_lower_tri <- function(cormat){
        cormat[upper.tri(cormat)] <- NA
        return(cormat)
}

lowertri <- get_lower_tri(cormat)
melted.cormat <- melt(lowertri, na.rm = TRUE)

##Sorting the corelation matrix to pick up variables with maximum positive or negative corelation
sorted.cormat <- melted.cormat[order(melted.cormat$value),]

We can now look at the first few rows and last few rows to see if there are any corelations between the grain production numbers.We arbitrarily assume a 70% corelation to be our threshold for this corelation exercise.

neg.cormat <- head(sorted.cormat, 6)

ggplot(data = neg.cormat, aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white") 


Figure 1: A heatmap plot showing the negative corelation between grains where the corelation is greater than an absolute value of 0.7

So we can see that there is a fairly strong negative corelation among a few variables which would imply that on average if a certain crop does well in terms of production, then the corresponding crop would do poorly.

Some of these might make sense but some others (if not most) could be entirely coincidental. Further investigation would be required to see if this makes sense.

Let us now turn to the variables that may have a positive corelation.

pos.cormat <- sorted.cormat[c(140:170),]

ggplot(data = pos.cormat, aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white") 


Figure 2: A heatmap plot showing the positive corelation between grains where the corelation is greater than an absolute value of 0.7

So again we can see that some grains have a very high corelation and further investigation might be needed to see if this seems reasonable. Another thing to note here is that rice has a fairly high proportion in terms of the total grain production.

EDA - Rainfall

We had earlier loaded the rainfall data set. Let us explore this data set to see what it contains.

head(rain)
##   YEAR  JAN  FEB  MAR  APR  MAY   JUN   JUL   AUG   SEP   OCT  NOV  DEC
## 1 1901 34.7 38.6 17.8 38.9 50.6 113.2 241.4 271.6 124.7  52.4 38.7  8.2
## 2 1902  7.4  4.2 19.0 44.1 48.8 111.7 284.9 201.0 200.2  62.5 29.4 25.2
## 3 1903 16.7  8.0 31.1 17.1 59.5 120.3 293.2 274.0 198.1 119.5 40.3 18.0
## 4 1904 14.9  9.7 31.4 33.7 73.8 165.5 260.3 207.7 130.8  69.8 11.2 16.4
## 5 1905 24.7 20.3 41.8 33.8 55.8  93.7 253.0 201.7 178.1  54.9  9.6 10.1
## 6 1906 21.4 49.9 31.4 15.8 37.2 177.0 286.5 251.4 183.9  50.6 17.7 26.3
##      ANN Jan.Feb Mar.May Jun.Sep Oct.Dec
## 1 1030.8    73.2   107.3   751.0    99.3
## 2 1038.4    11.6   111.9   797.8   117.2
## 3 1195.9    24.7   107.7   885.6   177.8
## 4 1025.1    24.5   138.8   764.3    97.4
## 5  977.5    45.0   131.4   726.4    74.7
## 6 1149.2    71.3    84.4   898.9    94.6

So we see here that this dataset contains month-wise and annual rainfall across India from the year 1901 all the way to 2014.

Combined EDA

The questions that we try to answer here is:
- Does grain production depend on the amount of annual rainfall?
- Is there some sort of consistency between the dataset for rainfall and grain production? In other words, we would like to see for example, that since rice and maize are positively corelated in the grains dataset, can we also expect to see this positive corelation if we were to look at the their production with the rainfall data now included?

Let us start by looking at the first question. In order to do this, we begin by filtering the data for rainfall from the year 2001 onwards since we only have grain production data from 2001 onwards. Also, since we only have the rainfall data up to 2014, we extract only the data from 2001-14 for the grains dataset.

rain.new <- rain %>% filter(YEAR >=2001)
grains.new <- grains[c(1:14),]

## Combining the two datasets
grainsrain <- cbind(grains.new, rain.new)

We can now look at a plot to try and answer the first question.

grains1 <- melt(grainsrain, id = "ANN", measure = c("rice", "maize"))
ggplot(data = grains1, aes(x = ANN, y = value, color = variable)) + geom_point() + xlab("Annual Rainfall in mm") + ylab("Grain Production (1000s tonnes)")


Figure 3: A plot showing the grain production as a function of annual rainfall

We can see two things for the graph above. One, is that the grain production for rice is way higher than that for maize. The second observation that we can make here is that the grain production does seem to trend upwards for both rice and maize as the amount of annual rainfall increases. However, we see that the production also goes down a bit if there is excessive rainfall which seems to make sense.

Another point that this graph confirms is that there is a positive corelation between rice and maize. We had seen eariler from just looking at the grains dataset that rice and maize had a positive corelation in the 0.95 range. Now when we look at the influence of rainfall on grain production, we get a consistent trend with the grain production dataset thus partly answering the second question.

Let us now try to see if a negative corelation between maize and mesta that was suggested by the grains dataset can be confirmed. Since the scales of production for maize and mesta are so different, we plot them out as separate graphs.

ggplot(grainsrain, aes(x = ANN, y = mesta)) + geom_point()+ xlab("Annual Rainfall in mm") + ylab("Crop Production (1000s bales)")


Figure 4: Mesta crop production vs annual rainfall

ggplot(grainsrain, aes(x = ANN, y = maize)) + geom_point()+ xlab("Annual Rainfall in mm") + ylab("Grain Production (1000s tonnes)")


Figure 5: Maize crop production vs annual rainfall

So we can see here that mesta production tends to go higher as we go from a moderate annual rainfall season to a high annual rainfall season whereas for maize, the opposite is true. We tend to get a higher maize production for moderate annual rainfall and a slightly lower production for a higher annual rainfall season thus confirming the negative corelation between the two.

References:

The grain production data was obtained from: https://data.gov.in/resources/all-india-level-production-principal-crops-2001-02-2016-17

The annual rainfall data was obtained here: https://data.gov.in/catalog/all-india-area-weighted-monthly-seasonal-and-annual-rainfall-mm

I’d also like to mention the link below as a reference for creating the heatmaps shown: http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization

About the author:
The author works as a HVAC Project Engineer in the Twin Cities and is an enthusiastic data science hobbyist. Any feedback would be greatly appreciated.
Please feel free to reach out with your feedback at premkumarsiddharth@outlook.com