Government of India has made a lot of useful data publically available. This includes data for agriculture, infrastructure, technology and so on. And so we can look at some of this data to see if we can glean something meaningful from them.
In this study here, we look at the agricultural food production for various grains, cereals and oilseeds from the year 2001 to 2017. We have also obtained data for annual rainfall in India for these years as a separate dataset and our goal is to see how strongly agricultural food production depends on annual rainfall.
library(dplyr)
library(ggplot2)
library(xlsx)
library(reshape2)
library(corrplot)
grains <- read.csv("./Table_8.3-All_India_1.csv")
rain <- read.xlsx("./All_India_Area_Weighted_Monthly_Seasonal_And_Annual_Rainfall.xls", sheetIndex = 1)
Now that we have loaded the data, we can look at some of the variables to see what it contains
str(grains)
## 'data.frame': 16 obs. of 36 variables:
## $ Year : Factor w/ 16 levels "2001-02","2002-03",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Food.Grains..Cereals....Rice..000.tonnes. : int 93340 71820 88526 83132 91793 93355 96693 99172 89083 95970 ...
## $ Food.Grains..Cereals....Jowar..000.tonnes. : int 7557 7012 6681 7244 7630 7151 7926 7246 6698 7003 ...
## $ Food.Grains..Cereals....Bajra..000.tonnes. : int 8284 4719 12109 7931 7684 8424 9970 8887 6506 10370 ...
## $ Food.Grains..Cereals....Maize..000.tonnes. : int 13160 11152 14984 14172 14710 15097 18955 19731 16720 21726 ...
## $ Food.Grains..Cereals....Ragi..000.tonnes. : int 2375 1316 1966 2432 2354 1444 2152 2040 1889 2194 ...
## $ Food.Grains..Cereals....Small.Millets..000.tonnes.: int 577 459 564 478 472 480 551 445 382 442 ...
## $ Food.Grains..Cereals....Wheat..000.tonnes. : int 72766 65761 72156 68637 69355 75807 78570 80679 80804 86874 ...
## $ Food.Grains..Cereals....Barley..000.tonnes. : int 1425 1407 1298 1207 1221 1328 1196 1689 1355 1663 ...
## $ Food.Grains..Cereals....Total..000.tonnes. : int 199483 163646 198284 185233 195217 203085 216014 219889 203435 226241 ...
## $ Food.Grains..Pulses....Gram..000.tonnes. : int 5473 4237 5718 5469 5600 6334 5749 7060 7476 8221 ...
## $ Food.Grains..Pulses....Tur..000.tonnes. : int 2260 2186 2356 2347 2738 2314 3076 2266 2465 2861 ...
## $ Food.Grains..Pulses....Other.Pulses..000.tonnes. : int 5635 4702 6831 5314 5046 5550 5937 5240 4720 7159 ...
## $ Food.Grains..Pulses....Total..000.tonnes. : int 13368 11125 14905 13130 13384 14198 14762 14566 14662 18241 ...
## $ Food.Grains...Total..000.tonnes. : int 212851 174771 213189 198363 208602 217282 230775 234456 218098 244482 ...
## $ Oilseeds...Ground.nuts..000.tonnes. : int 7028 4121 8127 6774 7993 4864 9183 7168 5428 8266 ...
## $ Oilseeds...Sesamum..000.tonnes. : int 698 441 782 674 641 618 757 640 588 893 ...
## $ Oilseeds...Rapeseed.and.Mustard..000.tonnes. : int 5083 3880 6291 7593 8131 7438 5834 7201 6608 8179 ...
## $ Oilseeds...Linseed..000.tonnes. : int 209 177 197 170 173 168 163 169 154 147 ...
## $ Oilseeds...Castor.seed..000.tonnes. : int 653 428 797 793 991 762 1054 1171 1009 1350 ...
## $ Total.Nine.Oilseeds..000.tonnes. : int 20662 14838 25186 24354 27978 24289 29755 27719 24882 32477 ...
## $ Cotton..000.Bales. : int 9997 8624 13729 16429 18499 22632 25884 22276 24022 33000 ...
## $ Jute..000.Bales. : int 10584 10274 10252 9399 9970 10317 10220 9634 11230 10009 ...
## $ Mesta..000.Bales. : int 1094 1002 921 873 870 956 990 731 587 611 ...
## $ Tea..Million.kgs. : num 854 846 879 907 949 973 987 973 991 967 ...
## $ Coffee..000.MT. : int 301 275 271 276 274 288 262 262 290 302 ...
## $ Natural.Rubber..000.MT. : num 631 649 712 750 803 853 825 865 831 862 ...
## $ Banana..000.tonnes. : int 13160 11975 11388 11710 12105 16609 23823 26217 26470 29780 ...
## $ Sugarcane..000.tonnes. : int 297208 287383 233862 237088 281172 355520 348188 285029 292302 342382 ...
## $ Tobacco..000.tonnes. : int 546 500 550 549 552 469 437 569 665 802 ...
## $ Potatoes..000.tonnes. : int 23924 23269 23060 23631 23905 22181 34658 34391 36577 42339 ...
## $ Black.Pepper..000.tonnes. : int 62 72 73 73 93 69 50 50 50 52 ...
## $ Chillies..000.tonnes. : int 1069 895 1236 1186 1015 1242 1371 1382 1470 1223 ...
## $ Ginger..000.tonnes. : int 322 280 302 359 391 393 775 832 708 702 ...
## $ Coconut..000.tonnes. : int 12963 12535 12178 12833 14811 15840 10148 10148 10824 10840 ...
## $ Turmeric..000.tonnes. : int 563 522 565 718 852 787 884 895 928 993 ...
So we can see all the different kinds of food grains that this data contains and also their units and variable types. We now rename them so that they become a little easier to read keeping in mind that almost all of the variables are measured in thousands of tonnes except for cotton, jute and mesta which are measured in thousands of bales.
grains <- grains %>% rename(rice = 2, jowar = 3, bajra = 4, maize = 5, ragi = 6, millets = 7, wheat = 8)
grains <- grains %>% rename(barley = 9, tcereals = 10, gram = 11, tur = 12, otherpulses = 13, totalpulses = 14)
grains <- grains %>% rename(totalgrains = 15, gnuts = 16, sesame = 17, mustard = 18, linseed = 19)
grains <- grains %>% rename(castor = 20, totaloilseeds = 21, cotton = 22, jute = 23, mesta = 24, tea = 25)
grains <- grains %>% rename(coffee = 26, rubber = 27, banana = 28, sugarcane = 29, tobacco = 30, potatoes = 31)
grains <- grains %>% rename(pepper = 32, chilles = 33, ginger = 34, coconut = 35, turmeric = 36)
Let us now look and see if we can find any kind of corelation between these grains. We do this by creating a corelation matrix and use the melt function to get the corelation in terms of each of the variables with the help of a lower triangular matrix.
cormat <- round(cor(as.matrix(grains[,2:36])),2)
##Function
get_lower_tri <- function(cormat){
cormat[upper.tri(cormat)] <- NA
return(cormat)
}
lowertri <- get_lower_tri(cormat)
melted.cormat <- melt(lowertri, na.rm = TRUE)
##Sorting the corelation matrix to pick up variables with maximum positive or negative corelation
sorted.cormat <- melted.cormat[order(melted.cormat$value),]
We can now look at the first few rows and last few rows to see if there are any corelations between the grain production numbers.We arbitrarily assume a 70% corelation to be our threshold for this corelation exercise.
neg.cormat <- head(sorted.cormat, 6)
ggplot(data = neg.cormat, aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white")
Figure 1: A heatmap plot showing the negative corelation between grains where the corelation is greater than an absolute value of 0.7
So we can see that there is a fairly strong negative corelation among a few variables which would imply that on average if a certain crop does well in terms of production, then the corresponding crop would do poorly.
Some of these might make sense but some others (if not most) could be entirely coincidental. Further investigation would be required to see if this makes sense.
Let us now turn to the variables that may have a positive corelation.
pos.cormat <- sorted.cormat[c(140:170),]
ggplot(data = pos.cormat, aes(x = Var1, y = Var2, fill = value)) + geom_tile(color = "white")
Figure 2: A heatmap plot showing the positive corelation between grains where the corelation is greater than an absolute value of 0.7
So again we can see that some grains have a very high corelation and further investigation might be needed to see if this seems reasonable. Another thing to note here is that rice has a fairly high proportion in terms of the total grain production.
We had earlier loaded the rainfall data set. Let us explore this data set to see what it contains.
head(rain)
## YEAR JAN FEB MAR APR MAY JUN JUL AUG SEP OCT NOV DEC
## 1 1901 34.7 38.6 17.8 38.9 50.6 113.2 241.4 271.6 124.7 52.4 38.7 8.2
## 2 1902 7.4 4.2 19.0 44.1 48.8 111.7 284.9 201.0 200.2 62.5 29.4 25.2
## 3 1903 16.7 8.0 31.1 17.1 59.5 120.3 293.2 274.0 198.1 119.5 40.3 18.0
## 4 1904 14.9 9.7 31.4 33.7 73.8 165.5 260.3 207.7 130.8 69.8 11.2 16.4
## 5 1905 24.7 20.3 41.8 33.8 55.8 93.7 253.0 201.7 178.1 54.9 9.6 10.1
## 6 1906 21.4 49.9 31.4 15.8 37.2 177.0 286.5 251.4 183.9 50.6 17.7 26.3
## ANN Jan.Feb Mar.May Jun.Sep Oct.Dec
## 1 1030.8 73.2 107.3 751.0 99.3
## 2 1038.4 11.6 111.9 797.8 117.2
## 3 1195.9 24.7 107.7 885.6 177.8
## 4 1025.1 24.5 138.8 764.3 97.4
## 5 977.5 45.0 131.4 726.4 74.7
## 6 1149.2 71.3 84.4 898.9 94.6
So we see here that this dataset contains month-wise and annual rainfall across India from the year 1901 all the way to 2014.
The questions that we try to answer here is: - Does grain production depend on the amount of annual rainfall? - Is there some sort of consistency between the dataset for rainfall and grain production? In other words, we would like to see for example, that since rice and maize are positively corelated in the grains dataset, can we also expect to see this positive corelation if we were to look at the their production with the rainfall data now included?
Let us start by looking at the first question. In order to do this, we begin by filtering the data for rainfall from the year 2001 onwards since we only have grain production data from 2001 onwards. Also, since we only have the rainfall data up to 2014, we extract only the data from 2001-14 for the grains dataset.
rain.new <- rain %>% filter(YEAR >=2001)
grains.new <- grains[c(1:14),]
## Combining the two datasets
grainsrain <- cbind(grains.new, rain.new)
We can now look at a plot to try and answer the first question.
grains1 <- melt(grainsrain, id = "ANN", measure = c("rice", "maize"))
ggplot(data = grains1, aes(x = ANN, y = value, color = variable)) + geom_point() + xlab("Annual Rainfall in mm") + ylab("Grain Production (1000s tonnes)")
Figure 3: A plot showing the grain production as a function of annual rainfall
We can see two things for the graph above. One, is that the grain production for rice is way higher than that for maize. The second observation that we can make here is that the grain production does seem to trend upwards for both rice and maize as the amount of annual rainfall increases. However, we see that the production also goes down a bit if there is excessive rainfall which seems to make sense.
Another point that this graph confirms is that there is a positive corelation between rice and maize. We had seen eariler from just looking at the grains dataset that rice and maize had a positive corelation in the 0.95 range. Now when we look at the influence of rainfall on grain production, we get a consistent trend with the grain production dataset thus partly answering the second question.
Let us now try to see if a negative corelation between maize and mesta that was suggested by the grains dataset can be confirmed. Since the scales of production for maize and mesta are so different, we plot them out as separate graphs.
ggplot(grainsrain, aes(x = ANN, y = mesta)) + geom_point()+ xlab("Annual Rainfall in mm") + ylab("Crop Production (1000s bales)")
Figure 4: Mesta crop production vs annual rainfall
ggplot(grainsrain, aes(x = ANN, y = maize)) + geom_point()+ xlab("Annual Rainfall in mm") + ylab("Grain Production (1000s tonnes)")
Figure 5: Maize crop production vs annual rainfall
So we can see here that mesta production tends to go higher as we go from a moderate annual rainfall season to a high annual rainfall season whereas for maize, the opposite is true. We tend to get a higher maize production for moderate annual rainfall and a slightly lower production for a higher annual rainfall season thus confirming the negative corelation between the two.
References:
The grain production data was obtained from: https://data.gov.in/resources/all-india-level-production-principal-crops-2001-02-2016-17
The annual rainfall data was obtained here: https://data.gov.in/catalog/all-india-area-weighted-monthly-seasonal-and-annual-rainfall-mm
I’d also like to mention the link below as a reference for creating the heatmaps shown: http://www.sthda.com/english/wiki/ggplot2-quick-correlation-matrix-heatmap-r-software-and-data-visualization
About the author: The author works as a HVAC Project Engineer in the Twin Cities and is an enthusiastic data science hobbyist. Any feedback would be greatly appreciated. Please feel free to reach out with your feedback at premkumarsiddharth@outlook.com