MATH2349 Semester 1, 2020

Required packages

library(readr)
library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:tidyr':
## 
##     extract

library(lubridate)

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

library(ggplot2)
library(outliers)

Executive Summary

The dataset is pre-processed to compare the Agricultural Land (hectares) available in Indian States to the Agricultural Land (hectares) used for cultivating, harvesting, winnowing and threshing the rice crop in given four years. The dataset also shows information on the agricultural land cultivated and harvested by Men and Women. In the analysis, appropriate data type conversions and tidying of the dataset is used. We have used the gather() function to tidy the dataset and later have merged the two datasets using the left_join() function.Also, there were missing values which were replaced by the mean of the particular column. The combined dataset has variables: States, Gender, Year, Agricultural Land in every state, Land used for cultivation of a particular crop. A new column is added to the combined the dataset which gives information on percentage of Land used for cultivation of the particular crop. I have also checked for the outliers in the dataset and the outliers were capped using capping function. The data transformation is done using plotting a histogram.

Data

The Datasets used for analysis are Agricultural Land (hectares) in India’s every state and the land (hectares) in every state used for cultivating, harvesting, winnowing and threshing a rice crop. The dataset was sourced from Kaggle.com whose author was Robin Reni. The first dataset provides the information about the available land in every Indian State for Agriculture and has two variables/columns i.e. States and Agricultural Land.

Acknowledgement - The year columns in second dataset has values 1-2011,2-2012,3-2013,4-2014. I have changed those names to 2011,2012,2013,2014 respectively.

States/UTs:States or Union Territories present in the Republic of India and Agriculturalland:Land in hectares available for cultivating,harvesting,winnowing and threshing rice crop in every State/UTs in India.

Agri_Land_India <- read_csv("D:/MS Analytics/Data Wrangling/Assignment 2/datasets_19581_25424_Indias Agricultural land details.csv")

## Parsed with column specification:
## cols(
##   `States/UTs` = col_character(),
##   Agriculturalland = col_double()
## )

head(Agri_Land_India)

The second dataset provides information about agricultural land used in every state used for cultivating, harvesting,winnowing and threshing rice crop for four years from 2011 to 2014. The second dataset contains seven variables/columns that are Work Done(Unique Entity), States, Gender, Years 2011 to 2014 respectively.

The first column:unique value/entity, States:Name of every state in Republic of India, Gender:Men and Women, Years: 2011 to 2014.

When the data is tidied, the two datasets are merged for further analysis.

Agri_Harvest_Winnow_thresh <- read_csv("D:/MS Analytics/Data Wrangling/Assignment 2/datasets_19581_25424_Indian states agricultural harvesting winnowing threshing  rate.csv")

## Parsed with column specification:
## cols(
##   Particulars = col_character(),
##   States = col_character(),
##   Gender = col_character(),
##   `2011` = col_double(),
##   `2012` = col_double(),
##   `2013` = col_double(),
##   `2014` = col_double()
## )

head(Agri_Harvest_Winnow_thresh)

Understand

The Agricultural Land India dataset has two variables out of which States is a character data type and Agriculturalland is numeric data type.

str(Agri_Land_India)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 28 obs. of  2 variables:
##  $ States/UTs      : chr  "Andhra Pradesh" "Arunachal Pradesh" "Assam" "Bihar" ...
##  $ Agriculturalland: num  15930 424 3217 6582 5552 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `States/UTs` = col_character(),
##   ..   Agriculturalland = col_double()
##   .. )

The second dataset contains seven variables/columns. The states variable is a character data type which stores names of every state in India, Gender variable is character data type but we convert it into factor variables and label them 1,0 for Men and Women respectively. The last four columns are years which are numeric data type which store numeric values of land used in every state for cultivating, harvesting, winnowing and threshing crop.

str(Agri_Harvest_Winnow_thresh)

## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 32 obs. of  7 variables:
##  $ Particulars: chr  "Harvesting, Winnowing and Threshing" "Harvesting, Winnowing and Threshing" "Harvesting, Winnowing and Threshing" "Harvesting, Winnowing and Threshing" ...
##  $ States     : chr  "Andhra Pradesh" "Assam" "Bihar" "Gujarat" ...
##  $ Gender     : chr  "Men" "Men" "Men" "Men" ...
##  $ 2011       : num  214 181 184 133 339 ...
##  $ 2012       : num  237 179 173 157 322 ...
##  $ 2013       : num  232 185 178 163 316 ...
##  $ 2014       : num  239 191 185 164 310 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Particulars = col_character(),
##   ..   States = col_character(),
##   ..   Gender = col_character(),
##   ..   `2011` = col_double(),
##   ..   `2012` = col_double(),
##   ..   `2013` = col_double(),
##   ..   `2014` = col_double()
##   .. )

Here we convert the Gender into a factor variable which are labelled as 1,0 for Men and Women respectively.

Agri_Harvest_Winnow_thresh$Gender <- factor(Agri_Harvest_Winnow_thresh$Gender, levels = c("Men","Women"), labels = c(1,0))

levels(Agri_Harvest_Winnow_thresh$Gender)

## [1] "1" "0"

is.factor(Agri_Harvest_Winnow_thresh$Gender)

## [1] TRUE

Scan I

In this step I am searching for any missing values in the datasets. The missing values if found will be replaced by the mean of the particular column.

No missing values were found in the first dataset. One missing value was found in the second dataset in 2014 column.

I summarised the second dataset where I found the minimun, maximum, mean, median, 1st and 3rd quantile of each variable.

I replaced the missing value (NA) with mean of column 2014.

colSums(is.na(Agri_Land_India))

##       States/UTs Agriculturalland 
##                0                0

colSums(is.na(Agri_Harvest_Winnow_thresh))

## Particulars      States      Gender        2011        2012        2013 
##           0           0           0           0           0           0 
##        2014 
##           1

Agri_Harvest_Winnow_thresh %>% summary()

##  Particulars           States          Gender      2011            2012      
##  Length:32          Length:32          1:17   Min.   :126.3   Min.   :135.8  
##  Class :character   Class :character   0:15   1st Qu.:161.6   1st Qu.:158.8  
##  Mode  :character   Mode  :character          Median :201.6   Median :194.8  
##                                               Mean   :220.6   Mean   :219.7  
##                                               3rd Qu.:252.0   3rd Qu.:241.0  
##                                               Max.   :501.1   Max.   :501.1  
##                                                                              
##       2013            2014      
##  Min.   :135.6   Min.   :133.9  
##  1st Qu.:163.4   1st Qu.:165.8  
##  Median :197.3   Median :191.9  
##  Mean   :222.6   Mean   :211.9  
##  3rd Qu.:240.8   3rd Qu.:236.7  
##  Max.   :501.1   Max.   :379.2  
##                  NA's   :1

After running this chuck we can see the missing value is replaced by the mean of column 2014 and the colsums() tells us now there no missing values present in the dataset.

Agri_Harvest_Winnow_thresh$`2014`[is.na(Agri_Harvest_Winnow_thresh$`2014`=="NA")] = 211.9

colSums(is.na(Agri_Harvest_Winnow_thresh))

## Particulars      States      Gender        2011        2012        2013 
##           0           0           0           0           0           0 
##        2014 
##           0

Tidy & Manipulate Data I

In the tidy and manipulate data I step, we have used the gather() to gather all the year columns in the second dataset in one single column.

To store the values of every year, we created a new variable ‘Sqft_area’ that denotes the values of every year.

The Agricultural Harvest dataset is now in a tidy format.

Agri_Harvest1=Agri_Harvest_Winnow_thresh

Agri_Harvest1 <- Agri_Harvest_Winnow_thresh %>% gather('2011','2012','2013','2014', key = "Year", value = "Sqft_area")

head(Agri_Harvest1)

In this step, we want to merge the two datasets into one.

We do this by checking which columns in both datasets are common. In the first dataset ‘India Agricultural Land’, the states of India are stored in ‘States/UTs’ column. In the second dataset, the states of India are stored in the ‘states’ column.

Therefore we rename the ‘States/UTs’ column in the first dataset to ‘States’.

names(Agri_Land_India)

## [1] "States/UTs"       "Agriculturalland"

This is new renamed column of the first data set.

State_rename <- rename(Agri_Land_India, States="States/UTs")

head(State_rename)

The two datasets are merged using left join with a common key as States.

I have merged the dataset to further analyze and find out the pecentage of land in every state used for cultivating, harvesting, winnowing and threshing.

‘subset1’: a new object created to merge the two datasets which will be used further for analyzing the data.

subset1 <- left_join(Agri_Harvest1, State_rename, by = "States")

head(subset1)

Tidy & Manipulate Data II

In this step I have analyzed and merged dataset to find out percentage of land culivated, harvested, winnowed and threshed according to gender.

I have created ‘subset2’ from ‘subset1’ to add a new column ‘Land_for_harvest_winnow_threshing’ which gives an information about the percentage of land in every state used for crop cultivation and harvesting purposes classified according to gender.

Land_for_Harvest_Winnow_Threshing:(Sqft_area / Agriculturalland * 100)

subset2=subset1

subset2 <- subset1 %>% mutate(Land_for_Harvest_Winnow_Threshing = Sqft_area/Agriculturalland*100)

head(subset2)

str(subset2)

## Classes 'tbl_df', 'tbl' and 'data.frame':    128 obs. of  7 variables:
##  $ Particulars                      : chr  "Harvesting, Winnowing and Threshing" "Harvesting, Winnowing and Threshing" "Harvesting, Winnowing and Threshing" "Harvesting, Winnowing and Threshing" ...
##  $ States                           : chr  "Andhra Pradesh" "Assam" "Bihar" "Gujarat" ...
##  $ Gender                           : Factor w/ 2 levels "1","0": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Year                             : chr  "2011" "2011" "2011" "2011" ...
##  $ Sqft_area                        : num  214 181 184 133 339 ...
##  $ Agriculturalland                 : num  15930 3217 6582 12661 3664 ...
##  $ Land_for_Harvest_Winnow_Threshing: num  1.35 5.62 2.8 1.05 9.25 ...

Scan II

The ‘AgriculturalLand’ variable will not show an outliers or cannot be used to check the outliers as the values will vary across all the states. These values depend on the size of the state and the land available for agriculture in the particular state.

In addition to this, the ‘Land_for_Harvest_Winnow_Threshing’ variable also cannot be used for checking the outliers as these values will also differ from state to state depending on the amount of the particular crop cultivated.

subset2 %>% boxplot(Sqft_area~Year,data = .)

From the boxplot above we can see that ‘Years’ have outliers where as ‘Sqft_Area’ variable depends usually on the Agricultural Land available in each state and agricultural land available for cultivation of the particular crop.

The cap() function from the module notes helps to caps those observations outside the lower limit with the value of 5th percentile and those that lie above the upper limit, with the value of 95th percentile.(from StackOverflow)

Hence I used the cap() function on the variable year

cap <- function(x){
  quantiles <- quantile(x,c(0.05,0.25,0.75,0.95))
  x[x<quantiles[2]-1.5*IQR(x)] <- quantiles [1]
  x[x>quantiles[3]+1.5*IQR(x)] <- quantiles [4]
  x
}

subset3 <- subset1 %>% group_by(Year) %>% mutate(Sqft_area=cap(Sqft_area))

Transform

In this step we apply transformation of the variable ‘Sqft_Area’.

For the better understanding of the variable and better scaling the transformation of data is carried out. A transition is found when a data distribution is non linear and when we investigate/analyze further and apply correct data transformation techniques a linear distribution can be obtained.

head(subset2)

In the histogram of ‘Sqft_Area’, it is seen data is skewed to the right.

The Median value of the Agricultural land under cultivation is nearly about 200(hectares)

To minimize the skewness, a log transformation of the variable is used.

subset2$Sqft_area %>% hist(main="Histogram - Square foot area", col = "skyblue")

The log transformation of the ‘Sqft_Area’ variable is used for the normalising the dataset.

log10(subset2$Sqft_area) %>% hist(main="Histogram - Logarithm of Square foot area", col = "yellow")

NOTE: Note that sometimes the order of the tasks may be different than the order given here. For example, you may need to tidy the data sets first to be able to create the common key to merge. Therefore, for such cases you may have a different ordering of the sections.

Any further or optional pre-processing tasks can be added to the template using an additional section in the R Markdown file. Make sure your code is visible (within the margin of the page). Do not use View() to show your data, instead give headers (using head() )