MATH2349 Data Wrangling

Required packages

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(knitr)

Executive Summary

In data pre-processing process, large number of data has been pre-processed. I have checked the number of missing values, null values, in appropriate values. The values has been converted to mean of the respective columns. As two data sets has to be merged, i have to change the column name in one dataset as merge function will only run if both the datasets have same column name. There are so many columns which does not fit the name for the dataset. So, I have changed those name also. I have created a variable which computes gdp in billion in INR as the dataset is only related to India. Gathered all the tidy columns which is not required into a single column under the variable name Type.

Data

Two datasets has been downloaded from kaggle.com. The source for those datasets are: 1. https://www.kaggle.com/khalidative/crimeanalysis?select=crime_by_district_rt.csv(District_Crimes) 2. https://www.kaggle.com/siddheshmahajan/indias-gdp-statewise(State_GDP)

In District_Crimes dataset, it has 13 variables with 9841 observations. It is combined with gender, Year of crimes, types of crimes, the state and district in which the crime has committed. POA is the crimes against tribes and scheduled castes. PCR is the protection of civil rights which has been initiated for the enforcement of any disability. Other crimes will tell about the crimes abagint scheduled castes.

In State_GDP dataset, it gives the some information of India state GDPin 2020 for different states. It has two variables with trillion in INR and billion in USD.

District_Crimes=read.csv("/Users/aravindsanjeevi/Desktop/csv/crime_by_district.csv")
GDP=read.csv("/Users/aravindsanjeevi/Desktop/csv/INDIA_GDP.csv")
head(District_Crimes)

head(GDP)

Tidy & Manipulate Data I

The dataset is untidy. Because the different types of crimes has been placed in column rather in rows. So i have used gather from dplyr package to gather the column names from Murder to Hurt to join as a single column with different types of crimes. The column DISTRICT has to be removes as it is not useful for this process. I have to rename the column name to merge the two datasets.

District_Crimes=subset(District_Crimes,select=-c(DISTRICT))
Gathering_Crimes=gather(District_Crimes,Type,Crimes,Murder:Hurt)
GDP=GDP%>%
  rename(STATE.UT=State_UT)
India_GDP=merge(GDP,Gathering_Crimes,by="STATE.UT")
head(India_GDP)

Understand

The dataset that we got from the merging has summarized. From the summary, we know the type of variables. The variable Type is character so it has to be factorised using factor function and it is ordered. The variable nominal gdp trillion and billion is converted into billion is USD and trillion in INR. These variables are numeric and it has to be factorised.

India_GDP=India_GDP%>%
  rename(Trillion_in_INR=Nominal.GDP.Trillion.INR.)%>%
  rename(Billion_in_USD=Nominal.GDP.Billion.USD.)
summary(India_GDP)

##    STATE.UT              Rank       Trillion_in_INR  Billion_in_USD 
##  Length:4949        Min.   : 1.00   Min.   : 0.176   Min.   :  2.5  
##  Class :character   1st Qu.: 5.00   1st Qu.: 4.090   1st Qu.: 57.0  
##  Mode  :character   Median :11.00   Median : 9.620   Median :130.0  
##                     Mean   :11.95   Mean   :11.184   Mean   :155.9  
##                     3rd Qu.:17.00   3rd Qu.:17.910   3rd Qu.:250.0  
##                     Max.   :32.00   Max.   :32.240   Max.   :450.0  
##       Year      Prevention.of.atrocities..POA..Act
##  Min.   :2013   Min.   :   0.00                   
##  1st Qu.:2013   1st Qu.:   0.00                   
##  Median :2013   Median :   2.00                   
##  Mean   :2013   Mean   :  39.45                   
##  3rd Qu.:2013   3rd Qu.:  20.00                   
##  Max.   :2013   Max.   :5584.00                   
##  Protection.of.Civil.Rights..PCR..Act Other.Crimes.Against.SCs
##  Min.   : 0.0000                      Min.   :   0.00         
##  1st Qu.: 0.0000                      1st Qu.:   0.00         
##  Median : 0.0000                      Median :   2.00         
##  Mean   : 0.1669                      Mean   :  43.05         
##  3rd Qu.: 0.0000                      3rd Qu.:  24.00         
##  Max.   :21.0000                      Max.   :5339.00         
##      Type               Crimes       
##  Length:4949        Min.   :  0.000  
##  Class :character   1st Qu.:  0.000  
##  Mode  :character   Median :  0.000  
##                     Mean   :  2.903  
##                     3rd Qu.:  1.000  
##                     Max.   :659.000

Type_factor=factor(India_GDP$Type,ordered = TRUE)
head(Type_factor)

## [1] Assault.on.women Arson            Dacoity          Dacoity         
## [5] Arson            Dacoity         
## 7 Levels: Arson < Assault.on.women < Dacoity < ... < Robbery

Trillion_factor=factor(India_GDP$Trillion_in_INR,ordered=TRUE)
head(Trillion_factor)

## [1] 10.81 10.81 10.81 10.81 10.81 10.81
## 28 Levels: 0.176 < 0.215 < 0.222 < 0.231 < 0.234 < 0.318 < 0.33 < ... < 32.24

Billion_factor=factor(India_GDP$Billion_in_USD,ordered=TRUE)
head(Billion_factor)

## [1] 150 150 150 150 150 150
## 26 Levels: 2.5 < 3 < 3.1 < 3.2 < 3.3 < 4.5 < 4.6 < 5 < 6.5 < 11 < 26 < ... < 450

Tidy & Manipulate Data II

In this step, I have created a variable called Billion_in_INR. Because the data has all the information about only India. So I have computed from Billion_in_USD to Billion_in_INR with the dollar value of USD. Using cbind(), I have attached the variable to the previous datasets.

Billion_in_INR=(India_GDP$Billion_in_USD)*71
India_GDP=cbind(India_GDP,Billion_in_INR)
head(India_GDP)

Scan I

I have scanned the total datasets, and found that most of the values are 0. I have decided to change the value of 0 to mean of that column. For that, I have changed the values to NA and then converted it to mean.

India_GDP[India_GDP==0]=NA  
India_GDP=India_GDP%>%
  mutate_all(~ifelse(is.na(.x),mean(.x,na.rm = TRUE),.x))
head(India_GDP)

Scan II

With the scanning of numeric variables, there are only three numeric variables in the dataset that is Trillion_in_INR, Billion_in_USD and Billion_in_INR. Using the plot, there is no outliers between these columns.

plot(India_GDP$Trillion_in_INR,India_GDP$Billion_in_USD)
abline(lm(Billion_in_USD~Trillion_in_INR,India_GDP),col="blue")

Transform

The variable Billion_in_INR has been trasnformed using log function. From the histigram of the the Transformation variable, we can see that it is left skewed. From the comparison of both histograms, the transformation variable is left skewed with log() function.

hist(India_GDP$Billion_in_USD)

Transformation=log(India_GDP$Billion_in_USD)
hist(Transformation)