library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(knitr)
In data pre-processing process, large number of data has been pre-processed. I have checked the number of missing values, null values, in appropriate values. The values has been converted to mean of the respective columns. As two data sets has to be merged, i have to change the column name in one dataset as merge function will only run if both the datasets have same column name. There are so many columns which does not fit the name for the dataset. So, I have changed those name also. I have created a variable which computes gdp in billion in INR as the dataset is only related to India. Gathered all the tidy columns which is not required into a single column under the variable name Type.
Two datasets has been downloaded from kaggle.com. The source for those datasets are: 1. https://www.kaggle.com/khalidative/crimeanalysis?select=crime_by_district_rt.csv(District_Crimes) 2. https://www.kaggle.com/siddheshmahajan/indias-gdp-statewise(State_GDP)
In District_Crimes dataset, it has 13 variables with 9841 observations. It is combined with gender, Year of crimes, types of crimes, the state and district in which the crime has committed. POA is the crimes against tribes and scheduled castes. PCR is the protection of civil rights which has been initiated for the enforcement of any disability. Other crimes will tell about the crimes abagint scheduled castes.
In State_GDP dataset, it gives the some information of India state GDPin 2020 for different states. It has two variables with trillion in INR and billion in USD.
District_Crimes=read.csv("/Users/aravindsanjeevi/Desktop/csv/crime_by_district.csv")
GDP=read.csv("/Users/aravindsanjeevi/Desktop/csv/INDIA_GDP.csv")
head(District_Crimes)
head(GDP)
The dataset is untidy. Because the different types of crimes has been placed in column rather in rows. So i have used gather from dplyr package to gather the column names from Murder to Hurt to join as a single column with different types of crimes. The column DISTRICT has to be removes as it is not useful for this process. I have to rename the column name to merge the two datasets.
District_Crimes=subset(District_Crimes,select=-c(DISTRICT))
Gathering_Crimes=gather(District_Crimes,Type,Crimes,Murder:Hurt)
GDP=GDP%>%
rename(STATE.UT=State_UT)
India_GDP=merge(GDP,Gathering_Crimes,by="STATE.UT")
head(India_GDP)
The dataset that we got from the merging has summarized. From the summary, we know the type of variables. The variable Type is character so it has to be factorised using factor function and it is ordered. The variable nominal gdp trillion and billion is converted into billion is USD and trillion in INR. These variables are numeric and it has to be factorised.
India_GDP=India_GDP%>%
rename(Trillion_in_INR=Nominal.GDP.Trillion.INR.)%>%
rename(Billion_in_USD=Nominal.GDP.Billion.USD.)
summary(India_GDP)
## STATE.UT Rank Trillion_in_INR Billion_in_USD
## Length:4949 Min. : 1.00 Min. : 0.176 Min. : 2.5
## Class :character 1st Qu.: 5.00 1st Qu.: 4.090 1st Qu.: 57.0
## Mode :character Median :11.00 Median : 9.620 Median :130.0
## Mean :11.95 Mean :11.184 Mean :155.9
## 3rd Qu.:17.00 3rd Qu.:17.910 3rd Qu.:250.0
## Max. :32.00 Max. :32.240 Max. :450.0
## Year Prevention.of.atrocities..POA..Act
## Min. :2013 Min. : 0.00
## 1st Qu.:2013 1st Qu.: 0.00
## Median :2013 Median : 2.00
## Mean :2013 Mean : 39.45
## 3rd Qu.:2013 3rd Qu.: 20.00
## Max. :2013 Max. :5584.00
## Protection.of.Civil.Rights..PCR..Act Other.Crimes.Against.SCs
## Min. : 0.0000 Min. : 0.00
## 1st Qu.: 0.0000 1st Qu.: 0.00
## Median : 0.0000 Median : 2.00
## Mean : 0.1669 Mean : 43.05
## 3rd Qu.: 0.0000 3rd Qu.: 24.00
## Max. :21.0000 Max. :5339.00
## Type Crimes
## Length:4949 Min. : 0.000
## Class :character 1st Qu.: 0.000
## Mode :character Median : 0.000
## Mean : 2.903
## 3rd Qu.: 1.000
## Max. :659.000
Type_factor=factor(India_GDP$Type,ordered = TRUE)
head(Type_factor)
## [1] Assault.on.women Arson Dacoity Dacoity
## [5] Arson Dacoity
## 7 Levels: Arson < Assault.on.women < Dacoity < ... < Robbery
Trillion_factor=factor(India_GDP$Trillion_in_INR,ordered=TRUE)
head(Trillion_factor)
## [1] 10.81 10.81 10.81 10.81 10.81 10.81
## 28 Levels: 0.176 < 0.215 < 0.222 < 0.231 < 0.234 < 0.318 < 0.33 < ... < 32.24
Billion_factor=factor(India_GDP$Billion_in_USD,ordered=TRUE)
head(Billion_factor)
## [1] 150 150 150 150 150 150
## 26 Levels: 2.5 < 3 < 3.1 < 3.2 < 3.3 < 4.5 < 4.6 < 5 < 6.5 < 11 < 26 < ... < 450
In this step, I have created a variable called Billion_in_INR. Because the data has all the information about only India. So I have computed from Billion_in_USD to Billion_in_INR with the dollar value of USD. Using cbind(), I have attached the variable to the previous datasets.
Billion_in_INR=(India_GDP$Billion_in_USD)*71
India_GDP=cbind(India_GDP,Billion_in_INR)
head(India_GDP)
I have scanned the total datasets, and found that most of the values are 0. I have decided to change the value of 0 to mean of that column. For that, I have changed the values to NA and then converted it to mean.
India_GDP[India_GDP==0]=NA
India_GDP=India_GDP%>%
mutate_all(~ifelse(is.na(.x),mean(.x,na.rm = TRUE),.x))
head(India_GDP)
With the scanning of numeric variables, there are only three numeric variables in the dataset that is Trillion_in_INR, Billion_in_USD and Billion_in_INR. Using the plot, there is no outliers between these columns.
plot(India_GDP$Trillion_in_INR,India_GDP$Billion_in_USD)
abline(lm(Billion_in_USD~Trillion_in_INR,India_GDP),col="blue")
The variable Billion_in_INR has been trasnformed using log function. From the histigram of the the Transformation variable, we can see that it is left skewed. From the comparison of both histograms, the transformation variable is left skewed with log() function.
hist(India_GDP$Billion_in_USD)
Transformation=log(India_GDP$Billion_in_USD)
hist(Transformation)