Introduction

India is a country where hundreds and thousands of crime occurs everyday. In India, it is highly debated that majority of the time the person accused for a crime is not convicted and is liberated due to some reason. This might be due to bribery, political influence, etc. The website https://www.kaggle.com/rajanand/crime-in-india(‘Kaggle’) provides us with a number of datasets for crime in India. My goal is to work on the cases of rape and to determine the percentage of cases convicted and the percentage of cases that are not convicted and let free because of political influence, bribery and other reasons. I have chosen the dataset “Victims_of_rape” and “Cases_under_crime_against_women” from the group of datasets. The dataset is available in .csv format. I have also used a dataset from https://data.world/bhavnachawla/crime-rate-against-children-india-2001-2012(‘data.world’) in which data for the crimes against children are available.

Importing packages

The following packages are imported into the r-studio which is helpful is cleaning the dataset or performing pre-processing so that the effective information can be extracted from the data.
readr → Reading the dataset
dplyr → Data manipulation tools
ggplot2 → Data visualization tools
deductive → Data correction and imputation
Validate → Conditions to test
forecast → Functions for Box Cox transformation
tidyverse → Special Data preprocessing package car → Package containing qqplot function

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(deductive)
library(validate)

## 
## Attaching package: 'validate'

## The following object is masked from 'package:ggplot2':
## 
##     expr

## The following object is masked from 'package:dplyr':
## 
##     expr

library(forecast)

## Warning: package 'forecast' was built under R version 4.0.3

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

library(tidyverse)

## -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --

## v tibble  3.0.3     v stringr 1.4.0
## v tidyr   1.1.1     v forcats 0.5.0
## v purrr   0.3.4

## -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
## x validate::expr() masks ggplot2::expr(), dplyr::expr()
## x dplyr::filter()  masks stats::filter()
## x dplyr::lag()     masks stats::lag()

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:purrr':
## 
##     some

## The following object is masked from 'package:dplyr':
## 
##     recode

Executive Summary

First the datasets are read into the r platform using functions from the readr package.
After the datasets are read the variables are explained and the structure of individual datasets are determined.
Since none of our datasets have factor variable, a factor variable is defined for categorizing the data. 4. Understanding the data involves proper data type conversions and renaming of column names for easy understanding.
Tidy and manipulate 1 involves the tiding of utidy dataset using gather and spread functions.
Tidy and manipulate 2 involves joining or merging datasets and using mutate function to define variables that are useful for inference.
Scan 1 of the data preprocessing involves the check for missing and special values. If there are any missing values appropriate missing values handlig techniques are applied. If the dataset has a consistent rule such a variable A is the sum of Variable B and C. Such rules can be defined using imput_lr() function. 8. Scan 2 of the data preprocessing involves the check for outliers. There are several techniques for handling the outliers and the most effective one is chosen to avoid them.
The variables in the dataset are not always normal. Thus, normality check is done on the data variable and if not normal transformations are applied on the variable. The transformed data variable will be normal and is better inferred.
Finally the resulting preprocessed data is inferred followed by conclusion.

Reading the dataset

The working directory is initially set using setwd() and dataset is read using the read_csv() of the readr package.

setwd("C:/Users/Lenovo/Documents/RMIT/Courses/Data Wrangling/Assignments/Assignment 2/Rape in India")
getwd()

## [1] "C:/Users/Lenovo/Documents/RMIT/Courses/Data Wrangling/Assignments/Assignment 2/Rape in India"

df1 <- read_csv("Victims_of_rape.csv")

## Parsed with column specification:
## cols(
##   Area_Name = col_character(),
##   Year = col_double(),
##   Subgroup = col_character(),
##   Rape_Cases_Reported = col_double(),
##   Victims_Above_50_Yrs = col_double(),
##   `Victims_Between_10-14_Yrs` = col_double(),
##   `Victims_Between_14-18_Yrs` = col_double(),
##   `Victims_Between_18-30_Yrs` = col_double(),
##   `Victims_Between_30-50_Yrs` = col_double(),
##   Victims_of_Rape_Total = col_double(),
##   Victims_Upto_10_Yrs = col_double()
## )

df2 <- read_csv("Cases_under_crime_against_women.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Area_Name = col_character(),
##   Group_Name = col_character(),
##   Sub_Group_Name = col_character()
## )

## See spec(...) for full column specifications.

## Warning: 1400 parsing failures.
##  row col   expected     actual                                  file
## 2451  -- 22 columns 23 columns 'Cases_under_crime_against_women.csv'
## 2452  -- 22 columns 23 columns 'Cases_under_crime_against_women.csv'
## 2453  -- 22 columns 23 columns 'Cases_under_crime_against_women.csv'
## 2454  -- 22 columns 23 columns 'Cases_under_crime_against_women.csv'
## 2455  -- 22 columns 23 columns 'Cases_under_crime_against_women.csv'
## .... ... .......... .......... .....................................
## See problems(...) for more details.

df3 <- read_csv("Crime head-wise persons arrested under crime against children during 2001-2012.csv")

## Parsed with column specification:
## cols(
##   `STATE/UT` = col_character(),
##   `CRIME HEAD` = col_character(),
##   `2001` = col_double(),
##   `2002` = col_double(),
##   `2003` = col_double(),
##   `2004` = col_double(),
##   `2005` = col_double(),
##   `2006` = col_double(),
##   `2007` = col_double(),
##   `2008` = col_double(),
##   `2009` = col_double(),
##   `2010` = col_double(),
##   `2011` = col_double(),
##   `2012` = col_double()
## )

Details of the variables

Details of the variables in dataset 1-
1. Area_Name → Name of the state/territory
2. Year → Year in which the crime took place
3. Subgroup → Victims of incest rape, victims of other rape and total rape
4. Rape_Cases_Reported → The total number of cases reported
5. Victims_Above_50_Yrs → Victims above the age of 50
6. Victims_Between_10-14_Yrs → Victims between the age 10-14
7. Victims_Between_14-18_Yrs → Victims between the age 14-18
8. Victims_Between_18-30_Yrs → Victims between the age 18-30
9. Victims_Between_30-50_Yrs → Victims between the age 30-50
10. Victims_of_Rape_Total → Overall Victims of rape
11. Victims_Upto_10_Yrs → Victims upto the age of 10

Details of the variables in dataset 2-
1. Area_Name → Name of the State/Territory
2. Year → Year in which the crime took place
3. Group_Name → Category of crime
4. Sub_Group_Name → Subcategory of crime
5. Cases_Acquitted_or_Discharged
6. Cases_charge_sheets_were_not_laid_but_Final_Report_submitted → Charge sheets were not filed but final reports were submitted
7. Cases_Chargesheeted → Cases filed
8. Cases_Compounded_or_Withdrawn → Cases withdrawn after chargesheeted
9. Cases_Convicted → Cases that are convicted after reporting
10.Cases_Declared_False_on_Account_of_Mistake_of_Fact_or_of_Law → Cases that were mistaken
11.Cases_Investigated_Chargesheets+FR_Submitted → Investigated cases
12.Cases_not_Investigated_or_in_which_investigation_was_refused → Not investigated and refused cases
13.Cases_Pending_Investigation_at_Year_End → Pending cases at year end
14.Cases_Pending_Investigation_from_previous_year → Pending cases from previous year
15.Cases_Pending_Trial_at_Year_End → Year end pending trial
16.Cases_Pending_Trial_from_the_previous_year → Pending trial from previous year
17.Cases_Reported → Reported cases
18.Cases_Sent_for_Trial → Trial cases
19.Cases_Trials_Completed → Trials completed
20.Cases_Withdrawn_by_the_Govt → Government withdrawn cases
21.Cases_withdrawn_by_the_Govt_during_investigation → Case withdrawn by the government during investigating of a particular case.
22.Total_Cases_for_Trial → Total trial cases

Details of the variables in dataset 3-
1. STATE/UT → Name of the State or Union territory
2. CRIME HEAD → Type of crime
3. 2001 → Year of crime
4. 2002 → Year of crime
5. 2003 → Year of crime
6. 2004 → Year of crime
7. 2005 → Year of crime
8. 2006 → Year of crime
9. 2007 → Year of crime
10.2008 → Year of crime
11.2009 → Year of crime
12.2010 → Year of crime
13.2011 → Year of crime
14.2012 → Year of crime

Structure of the datasets

The function str(x) is used to determine the structure of the data.

str(df1)

## tibble [1,050 x 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Area_Name                : chr [1:1050] "Andaman & Nicobar Islands" "Andaman & Nicobar Islands" "Andaman & Nicobar Islands" "Andaman & Nicobar Islands" ...
##  $ Year                     : num [1:1050] 2001 2001 2001 2002 2002 ...
##  $ Subgroup                 : chr [1:1050] "Total Rape Victims" "Victims of Incest Rape" "Victims of Other Rape" "Total Rape Victims" ...
##  $ Rape_Cases_Reported      : num [1:1050] 3 1 2 2 0 2 2 0 2 10 ...
##  $ Victims_Above_50_Yrs     : num [1:1050] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Victims_Between_10-14_Yrs: num [1:1050] 0 0 0 0 0 0 0 0 0 1 ...
##  $ Victims_Between_14-18_Yrs: num [1:1050] 3 1 2 1 0 1 1 0 1 7 ...
##  $ Victims_Between_18-30_Yrs: num [1:1050] 0 0 0 1 0 1 0 0 0 1 ...
##  $ Victims_Between_30-50_Yrs: num [1:1050] 0 0 0 0 0 0 0 0 0 1 ...
##  $ Victims_of_Rape_Total    : num [1:1050] 3 1 2 2 0 2 2 0 2 10 ...
##  $ Victims_Upto_10_Yrs      : num [1:1050] 0 0 0 0 0 0 1 0 1 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Area_Name = col_character(),
##   ..   Year = col_double(),
##   ..   Subgroup = col_character(),
##   ..   Rape_Cases_Reported = col_double(),
##   ..   Victims_Above_50_Yrs = col_double(),
##   ..   `Victims_Between_10-14_Yrs` = col_double(),
##   ..   `Victims_Between_14-18_Yrs` = col_double(),
##   ..   `Victims_Between_18-30_Yrs` = col_double(),
##   ..   `Victims_Between_30-50_Yrs` = col_double(),
##   ..   Victims_of_Rape_Total = col_double(),
##   ..   Victims_Upto_10_Yrs = col_double()
##   .. )

str(df2)

## tibble [4,165 x 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Area_Name                                                   : chr [1:4165] "Andaman & Nicobar Islands" "Andhra Pradesh" "Arunachal Pradesh" "Assam" ...
##  $ Year                                                        : num [1:4165] 2001 2001 2001 2001 2001 ...
##  $ Group_Name                                                  : chr [1:4165] "Rape" "Rape" "Rape" "Rape" ...
##  $ Sub_Group_Name                                              : chr [1:4165] "01. Rape" "01. Rape" "01. Rape" "01. Rape" ...
##  $ Cases_Acquitted_or_Discharged                               : num [1:4165] 5 731 1 334 406 9 520 0 3 80 ...
##  $ Cases_charge_sheets_were_not_laid_but_Final_Report_submitted: num [1:4165] 2 22 2 95 141 0 15 0 0 65 ...
##  $ Cases_Chargesheeted                                         : num [1:4165] 3 769 25 495 685 16 976 8 1 361 ...
##  $ Cases_Compounded_or_Withdrawn                               : num [1:4165] 0 35 0 10 0 0 1 0 0 0 ...
##  $ Cases_Convicted                                             : num [1:4165] 0 197 2 101 155 2 158 1 0 29 ...
##  $ Cases_Declared_False_on_Account_of_Mistake_of_Fact_or_of_Law: num [1:4165] 0 74 0 45 105 2 11 0 0 1 ...
##  $ Cases_Investigated_Chargesheets+FR_Submitted                : num [1:4165] 5 791 27 590 826 16 991 8 1 426 ...
##  $ Cases_not_Investigated_or_in_which_investigation_was_refused: num [1:4165] 0 3 0 0 0 0 1 0 0 0 ...
##  $ Cases_Pending_Investigation_at_Year_End                     : num [1:4165] 1 393 18 1045 488 ...
##  $ Cases_Pending_Investigation_from_previous_year              : num [1:4165] 3 390 12 863 531 1 130 3 1 243 ...
##  $ Cases_Pending_Trial_at_Year_End                             : num [1:4165] 34 1974 282 1964 3185 ...
##  $ Cases_Pending_Trial_from_the_previous_year                  : num [1:4165] 36 2170 260 1914 3061 ...
##  $ Cases_Reported                                              : num [1:4165] 3 871 33 817 888 18 959 6 0 381 ...
##  $ Cases_Sent_for_Trial                                        : num [1:4165] 3 769 25 495 685 16 976 8 1 361 ...
##  $ Cases_Trials_Completed                                      : num [1:4165] 5 928 3 435 561 11 678 1 3 109 ...
##  $ Cases_Withdrawn_by_the_Govt                                 : num [1:4165] 0 2 0 0 0 0 0 0 0 0 ...
##  $ Cases_withdrawn_by_the_Govt_during_investigation            : num [1:4165] 0 0 0 0 0 0 0 0 0 0 ...
##  $ Total_Cases_for_Trial                                       : num [1:4165] 39 2937 285 2409 3746 ...
##  - attr(*, "problems")= tibble [1,400 x 5] (S3: tbl_df/tbl/data.frame)
##   ..$ row     : int [1:1400] 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 ...
##   ..$ col     : chr [1:1400] NA NA NA NA ...
##   ..$ expected: chr [1:1400] "22 columns" "22 columns" "22 columns" "22 columns" ...
##   ..$ actual  : chr [1:1400] "23 columns" "23 columns" "23 columns" "23 columns" ...
##   ..$ file    : chr [1:1400] "'Cases_under_crime_against_women.csv'" "'Cases_under_crime_against_women.csv'" "'Cases_under_crime_against_women.csv'" "'Cases_under_crime_against_women.csv'" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Area_Name = col_character(),
##   ..   Year = col_double(),
##   ..   Group_Name = col_character(),
##   ..   Sub_Group_Name = col_character(),
##   ..   Cases_Acquitted_or_Discharged = col_double(),
##   ..   Cases_charge_sheets_were_not_laid_but_Final_Report_submitted = col_double(),
##   ..   Cases_Chargesheeted = col_double(),
##   ..   Cases_Compounded_or_Withdrawn = col_double(),
##   ..   Cases_Convicted = col_double(),
##   ..   Cases_Declared_False_on_Account_of_Mistake_of_Fact_or_of_Law = col_double(),
##   ..   `Cases_Investigated_Chargesheets+FR_Submitted` = col_double(),
##   ..   Cases_not_Investigated_or_in_which_investigation_was_refused = col_double(),
##   ..   Cases_Pending_Investigation_at_Year_End = col_double(),
##   ..   Cases_Pending_Investigation_from_previous_year = col_double(),
##   ..   Cases_Pending_Trial_at_Year_End = col_double(),
##   ..   Cases_Pending_Trial_from_the_previous_year = col_double(),
##   ..   Cases_Reported = col_double(),
##   ..   Cases_Sent_for_Trial = col_double(),
##   ..   Cases_Trials_Completed = col_double(),
##   ..   Cases_Withdrawn_by_the_Govt = col_double(),
##   ..   Cases_withdrawn_by_the_Govt_during_investigation = col_double(),
##   ..   Total_Cases_for_Trial = col_double()
##   .. )

str(df3)

## tibble [494 x 14] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ STATE/UT  : chr [1:494] "ANDHRA PRADESH" "ARUNACHAL PRADESH" "ASSAM" "BIHAR" ...
##  $ CRIME HEAD: chr [1:494] "INFANTICIDE" "INFANTICIDE" "INFANTICIDE" "INFANTICIDE" ...
##  $ 2001      : num [1:494] 1 0 0 0 7 0 2 0 0 0 ...
##  $ 2002      : num [1:494] 1 0 5 0 29 0 0 3 0 0 ...
##  $ 2003      : num [1:494] 3 0 0 0 5 0 0 0 3 0 ...
##  $ 2004      : num [1:494] 0 0 0 0 12 0 1 1 0 0 ...
##  $ 2005      : num [1:494] 0 0 1 2 0 0 6 0 0 0 ...
##  $ 2006      : num [1:494] 0 0 0 0 15 1 0 0 0 0 ...
##  $ 2007      : num [1:494] 1 0 0 2 11 0 7 1 0 0 ...
##  $ 2008      : num [1:494] 0 0 0 2 6 0 0 5 0 0 ...
##  $ 2009      : num [1:494] 5 0 0 0 2 0 0 0 0 1 ...
##  $ 2010      : num [1:494] 6 0 0 2 0 1 0 0 0 0 ...
##  $ 2011      : num [1:494] 1 0 0 5 10 0 0 0 0 0 ...
##  $ 2012      : num [1:494] 6 0 0 2 5 0 0 0 0 0 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   `STATE/UT` = col_character(),
##   ..   `CRIME HEAD` = col_character(),
##   ..   `2001` = col_double(),
##   ..   `2002` = col_double(),
##   ..   `2003` = col_double(),
##   ..   `2004` = col_double(),
##   ..   `2005` = col_double(),
##   ..   `2006` = col_double(),
##   ..   `2007` = col_double(),
##   ..   `2008` = col_double(),
##   ..   `2009` = col_double(),
##   ..   `2010` = col_double(),
##   ..   `2011` = col_double(),
##   ..   `2012` = col_double()
##   .. )

The structure of both the datasets are determined and it is clear that the df1 (dataset 1) has character and numeric datatypes and the df2 (dataset 2) also has character and numeric datatypes. The year columns of dataset 3 has double as its datatype and needs to be converted into integer type. A variable with Factor datatype is defined for proper categorization of states with low, medium and high number of cases based on the Total cases.

Defining factor variable:

hist(df1$Rape_Cases_Reported)

df1$`Risk` <- ifelse(df1$Rape_Cases_Reported>10, ifelse(df1$Rape_Cases_Reported>100, "High", "Medium"), "Low")
str(df1$`Risk`)

##  chr [1:1050] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" ...

levels(df1$Risk)

## NULL

df1$Risk <- factor(df1$Risk, levels = c("Low", "Medium", "High"), labels = c("Low Risk", "Medium Risk", "High Risk"), ordered = TRUE)
str(df1$Risk)

##  Ord.factor w/ 3 levels "Low Risk"<"Medium Risk"<..: 1 1 1 1 1 1 1 1 1 1 ...

df1

## # A tibble: 1,050 x 12
##    Area_Name  Year Subgroup Rape_Cases_Repo~ Victims_Above_5~ `Victims_Betwee~
##    <chr>     <dbl> <chr>               <dbl>            <dbl>            <dbl>
##  1 Andaman ~  2001 Total R~                3                0                0
##  2 Andaman ~  2001 Victims~                1                0                0
##  3 Andaman ~  2001 Victims~                2                0                0
##  4 Andaman ~  2002 Total R~                2                0                0
##  5 Andaman ~  2002 Victims~                0                0                0
##  6 Andaman ~  2002 Victims~                2                0                0
##  7 Andaman ~  2003 Total R~                2                0                0
##  8 Andaman ~  2003 Victims~                0                0                0
##  9 Andaman ~  2003 Victims~                2                0                0
## 10 Andaman ~  2004 Total R~               10                0                1
## # ... with 1,040 more rows, and 6 more variables:
## #   `Victims_Between_14-18_Yrs` <dbl>, `Victims_Between_18-30_Yrs` <dbl>,
## #   `Victims_Between_30-50_Yrs` <dbl>, Victims_of_Rape_Total <dbl>,
## #   Victims_Upto_10_Yrs <dbl>, Risk <ord>

Preprocessing dataset

Both the datasets are untidy due to the following factors-
* Improper datatypes
* The dataset 3 is untidy and needs to get tidied
* Missing values
* Needs transformation

Thus, data preprocessing is essential in transforming this untidy data into valuable insights.

Understanding data

By looking at the structure of both the datasets it is seen that the datatypes of the cases or victims and year are in double where it should have been an integer. This is because the number or cases and year cannot take up decimal values and can have only integer data types. In dataset 3 the year variables have are in double where it should have been an integer. The following syntax is used for type conversion.

Syntax : as.integer(x)

df1$Year <- as.integer(df1$Year)
df1$Rape_Cases_Reported <- as.integer(df1$Rape_Cases_Reported)
df1$Victims_Above_50_Yrs <- as.integer(df1$Victims_Above_50_Yrs)
df1$`Victims_Between_10-14_Yrs` <- as.integer(df1$`Victims_Between_10-14_Yrs`)
df1$`Victims_Between_14-18_Yrs` <- as.integer(df1$`Victims_Between_14-18_Yrs`)
df1$`Victims_Between_18-30_Yrs` <- as.integer(df1$`Victims_Between_18-30_Yrs`)
df1$`Victims_Between_30-50_Yrs` <- as.integer(df1$`Victims_Between_30-50_Yrs`)
df1$Victims_of_Rape_Total <- as.integer(df1$Victims_of_Rape_Total)
df1$Victims_Upto_10_Yrs <- as.integer(df1$Victims_Upto_10_Yrs)

df2$Year <- as.integer(df2$Year)
df2$Cases_Acquitted_or_Discharged <- as.integer(df2$Cases_Acquitted_or_Discharged)
df2$Cases_charge_sheets_were_not_laid_but_Final_Report_submitted <- as.integer(df2$Cases_charge_sheets_were_not_laid_but_Final_Report_submitted)
df2$Cases_Chargesheeted <- as.integer(df2$Cases_Chargesheeted)
df2$Cases_Compounded_or_Withdrawn <- as.integer(df2$Cases_Compounded_or_Withdrawn)
df2$Cases_Convicted <- as.integer(df2$Cases_Convicted)
df2$Cases_Declared_False_on_Account_of_Mistake_of_Fact_or_of_Law <- as.integer(df2$Cases_Declared_False_on_Account_of_Mistake_of_Fact_or_of_Law)
df2$`Cases_Investigated_Chargesheets+FR_Submitted` <- as.integer(df2$`Cases_Investigated_Chargesheets+FR_Submitted`)
df2$Cases_not_Investigated_or_in_which_investigation_was_refused <- as.integer(df2$Cases_not_Investigated_or_in_which_investigation_was_refused)
df2$Cases_Pending_Investigation_at_Year_End <- as.integer(df2$Cases_Pending_Investigation_at_Year_End)
df2$Cases_Pending_Investigation_from_previous_year <- as.integer(df2$Cases_Pending_Investigation_from_previous_year)
df2$Cases_Pending_Trial_at_Year_End <- as.integer(df2$Cases_Pending_Trial_at_Year_End)
df2$Cases_Pending_Trial_from_the_previous_year <- as.integer(df2$Cases_Pending_Trial_from_the_previous_year)
df2$Cases_Reported <- as.integer(df2$Cases_Reported)
df2$Cases_Sent_for_Trial <- as.integer(df2$Cases_Sent_for_Trial)
df2$Cases_Trials_Completed <- as.integer(df2$Cases_Trials_Completed)
df2$Cases_Withdrawn_by_the_Govt <- as.integer(df2$Cases_Withdrawn_by_the_Govt)
df2$Cases_withdrawn_by_the_Govt_during_investigation <- as.integer(df2$Cases_withdrawn_by_the_Govt_during_investigation)
df2$Total_Cases_for_Trial <- as.integer(df2$Total_Cases_for_Trial)

df3$`2001` <- as.integer(df3$`2001`)
df3$`2002` <- as.integer(df3$`2002`)
df3$`2003` <- as.integer(df3$`2003`)
df3$`2004` <- as.integer(df3$`2004`)
df3$`2005` <- as.integer(df3$`2005`)
df3$`2006` <- as.integer(df3$`2006`)
df3$`2007` <- as.integer(df3$`2007`)
df3$`2008` <- as.integer(df3$`2008`)
df3$`2009` <- as.integer(df3$`2009`)
df3$`2010` <- as.integer(df3$`2010`)
df3$`2011` <- as.integer(df3$`2011`)
df3$`2012` <- as.integer(df3$`2012`)

Tidy and manipulate 1

According to the tidy data principle:
1. Each variable must have its own column.
2. Each observation must have its own row.
3. Each value must have its own cell.

In dataset 3 individual years are given as the variables and this violates principle 2 of the tidy data principle. Individual years are observations and not variables. Thus, tidyr functions are used to tidy data.

My focus is totally towards the total victims of rape and not victims of incest rape and other. So, I am going to remove the rows which has subgroup of “Victims of Incest Rape” and “Victims of Other Rape”. The index of the subgroup “Victims of Incest Rape” and “Victims of Other Rape” is determined using which() function. Then the dataset df1 is subsetted eliminating the rows containing “Victims of Incest Rape” and “Victims of Other Rape” data.

r1 <- which(df1$Subgroup=="Victims of Incest Rape")
r2 <- which(df1$Subgroup=="Victims of Other Rape")
df1 <- df1[-c(r1,r2),]

The column names of df1 are not pleasing when viewed. So by renaming the columns, a person who is observing the dataset will have a good understanding about the variables.

names(df1)[names(df1) == "Area_Name" ] <- "Area"
names(df1)[names(df1) == "Rape_Cases_Reported" ] <- "Total Reported Cases"
names(df1)[names(df1) == "Victims_Above_50_Yrs" ] <- ">50yrs"
names(df1)[names(df1) == "Victims_Between_10-14_Yrs" ] <- "(10-14)yrs"
names(df1)[names(df1) == "Victims_Between_14-18_Yrs" ] <- "(14-18)yrs"
names(df1)[names(df1) == "Victims_Between_18-30_Yrs" ] <- "(18-30)yrs"
names(df1)[names(df1) == "Victims_Between_30-50_Yrs" ] <- "(30-50)yrs"
names(df1)[names(df1) == "Victims_Upto_10_Yrs" ] <- "<10yrs"

The following variables are selected in the df1 dataset- Area
Year
<10yrs
(10-14)yrs
(14-18)yrs
(18-30)yrs
(30-50)yrs
>50yrs
Total Reported Cases

df1 <- df1%>%
  select("Area", "Year", "<10yrs", "(10-14)yrs", "(14-18)yrs", "(18-30)yrs", "(30-50)yrs", ">50yrs", "Total Reported Cases", "Risk")

The column names of the df2 dataset are renamed for better understanding of the data. Only the following variables are selected from df2-
Area Year Cases Convicted Fraud

names(df2)[names(df2) == "Area_Name"] <- "Area"
names(df2)[names(df2) == "Cases_Convicted"] <- "Cases Convicted"

The dataset 3 is untidy and needs to be tidied using the gather and spread function. Our focus is only on the rape cases, so only RAPE OF CHILDREN variable is selected along with STATE/UT and Year.

df3 <- df3 %>%
  gather(`2001`, `2002`,`2003`, `2004`, `2005`, `2006`, `2007`, `2008`, `2009`, `2010`, `2011`, `2012`, key = "Year", value = "Cases")
df3 <- df3%>%
  spread(key = "CRIME HEAD", value = Cases)
df3 <- df3%>%
  select(`STATE/UT`, Year, `RAPE OF CHILDREN`)

Renaming dataset 3

The name of the areas in dataset 3 is different from the area names in dataset 1 and 2. Thus, to obtain a common variable the area names in dataset 3 is changed.

names(df3)[names(df3) == "STATE/UT"] <- "Area"
names(df3)[names(df3) == "RAPE OF CHILDREN"] <- "Rape of Children"
df3$Area[df3$Area=="A & N ISLANDS"] <- "Andaman & Nicobar Islands" 
df3$Area[df3$Area=="ANDHRA PRADESH"] <- "Andhra Pradesh"
df3$Area[df3$Area=="ARUNACHAL PRADESH"] <- "Arunachal Pradesh"
df3$Area[df3$Area=="ASSAM"] <- "Assam"
df3$Area[df3$Area=="BIHAR"] <- "Bihar"
df3$Area[df3$Area=="CHANDIGARH"] <- "Chandigarh" 
df3$Area[df3$Area=="CHHATTISGARH"] <- "Chhattisgarh"
df3$Area[df3$Area=="D & N HAVELI"] <- "Dadra & Nagar Haveli"
df3$Area[df3$Area=="DAMAN & DIU"] <- "Daman & Diu"
df3$Area[df3$Area=="DELHI"] <- "Delhi"
df3$Area[df3$Area=="GOA"] <- "Goa"
df3$Area[df3$Area=="GUJARAT"] <- "Gujarat" 
df3$Area[df3$Area=="HARYANA"] <- "Haryana"
df3$Area[df3$Area=="HIMACHAL PRADESH"] <- "Himachal Pradesh"
df3$Area[df3$Area=="JAMMU & KASHMIR"] <- "Jammu & Kashmir"
df3$Area[df3$Area=="JHARKHAND"] <- "Jharkhand" 
df3$Area[df3$Area=="KARNATAKA"] <- "Karnataka"
df3$Area[df3$Area=="KERALA"] <- "Kerala"
df3$Area[df3$Area=="LAKSHADWEEP"] <- "Lakshadweep"
df3$Area[df3$Area=="MADHYA PRADESH"] <- "Madhya Pradesh"
df3$Area[df3$Area=="MAHARASHTRA"] <- "Maharashtra"
df3$Area[df3$Area=="MANIPUR"] <- "Manipur"
df3$Area[df3$Area=="MEGHALAYA"] <- "Meghalaya"
df3$Area[df3$Area=="NAGALAND"] <- "Nagaland"
df3$Area[df3$Area=="MIZORAM"] <- "Mizoram"
df3$Area[df3$Area=="ODISHA"] <- "Odisha"
df3$Area[df3$Area=="PUDUCHERRY"] <- "Puducherry"
df3$Area[df3$Area=="PUNJAB"] <- "Punjab"
df3$Area[df3$Area=="RAJASTHAN"] <- "Rajasthan"
df3$Area[df3$Area=="SIKKIM"] <- "Sikkim"
df3$Area[df3$Area=="TAMIL NADU"] <- "Tamil Nadu"
df3$Area[df3$Area=="TRIPURA"] <- "Tripura"
df3$Area[df3$Area=="UTTAR PRADESH"] <- "Uttar Pradesh" 
df3$Area[df3$Area=="UTTARAKHAND"] <- "UTTARAKHAND"
df3$Area[df3$Area=="WEST BENGAL"] <- "WEST BENGAL"

df3$Year <- as.integer(df3$Year)

Tidy and manipulate 2

In the df2 dataset only the cases of Rape are filtered out from the Group_Name variable and new variable “Fraud” is created by using the mutate function and adding Cases_Acquitted_or_Discharged, Cases_charge_sheets_were_not_laid_but_Final_Report_submitted, Cases_Compounded_or_Withdrawn)

filter() function is used to filter out the rape data and mutate() function is used to define variable fraud which is used to determine the number of cases that are were not convicted due to some reason.

df2 <- df2%>%
  filter(`Group_Name`=="Rape")%>%
  mutate(Fraud = Cases_Acquitted_or_Discharged + Cases_charge_sheets_were_not_laid_but_Final_Report_submitted + Cases_Compounded_or_Withdrawn)

All the three datasets df1, df2 and df3 are now ready for joining. A left_join() is used for joining df1, df2 and df3 and they are joined by “Area” and “Year”. The resulting dataset is stored in the form of df.

df <- left_join(left_join(df1, df2, by=c("Area", "Year")), df3, by=c("Area", "Year"))

Another factor variable Fraudulence is introduced for categorizing the fraud data.
Fraudulence → The level of Fraudulence due to the number of fraud cases.

hist(df$Fraud)

df$Fraudulence <- ifelse(df$Fraud>1000, ifelse(df$Fraud>8000, "High", "Medium"), "Low")
str(df$Fraudulence)

##  chr [1:350] "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" "Low" ...

levels(df$Fraudulence)

## NULL

df$Fraudulence <- factor(df$Fraudulence, levels = c("Low", "Medium", "High"), labels = c("Low Fraud", "Medium Fraud", "High Fraud"), ordered = TRUE)
str(df$Fraudulence)

##  Ord.factor w/ 3 levels "Low Fraud"<"Medium Fraud"<..: 1 1 1 1 1 1 1 1 1 1 ...

table(df$Risk, df$Fraudulence)

##              
##               Low Fraud Medium Fraud High Fraud
##   Low Risk           51            0          0
##   Medium Risk        80            0          0
##   High Risk         204           15          0

By cross tabulating Risk and Fraudulence it is observed that the areas that are at high risk are not always fraudulent.

Scan 1

The joined dataset df is checked for missing values. The syntax is.na(df) can be used to determine the observations with missing values. Note that zero is not equal to NA. The number of cases can definetely be zero and there will be a very big impact on the analysis if we remove these zeros or replace them by some other value. Thus, the zeros in the dataset should not be removed or changed.

Number of Missing Values: sum(is.na(x)) gives the total number of missing values in the dataframe x.

sum(is.na(df))

## [1] 20

The Special cases such as Inf, -Inf, nan etc are identified using is.infinite(), is.finite() and is.nan() functions. Note that the dataset cannot be entered as an argument in these functions and has to be applied through Apply function. Here Sapply is used to apply special functions to the df dataset.

Number of Special Values: is.infinite() and is.nan() are used to determine the number of special values.

sum(sapply(df, is.infinite))

## [1] 0

sum(sapply(df, is.nan))

## [1] 0

It is seen that there are no missing and special values in our dataset. However we can define a rule that has to consistent throughout the dataset. This is defined using validator() function of the validate package and impute_lr() function of the deductive package.
We know that the sum of the cases of <10yrs, (10-14)yrs, (14-18)yrs, (18-30)yrs, (30-50)yrs and >50yrs is equal to Total Reported Cases. Thus, by using the validator() function from the validate package and impute_lr() function from the deductive package we can define a rule that has to be consistent within the dataset. After imputing the data types will becomes double and are no longer integers.

Rules <- validator( `<10yrs` + `(10-14)yrs` + `(14-18)yrs` + `(18-30)yrs` + `(30-50)yrs` + `>50yrs`  == `Total Reported Cases`,
                    `<10yrs` >=0,
                    `(10-14)yrs` >= 0,
                    `(14-18)yrs` >= 0,
                    `(18-30)yrs` >=0,
                    `(30-50)yrs` >=0,
                    `>50yrs` >=0
)

df <- impute_lr(df,Rules)

The cases are grouped by Area and summarised with each variable taken as the sum of its observations from 2001 to 2010. After grouping and summarising the dataset new variables % of convicted cases, % of fraud and % of children are defined using the mutate function. The data is then arranged in decending order of the % of convicted cases.
group_by() → used to group the data by Area
summarise() → used to summarise the data
mutate() → used to define variable
arrange() → used to arrange variable in ascending order
arrange(desc()) → used to arrange variable in descending order

df <- df%>%
  group_by(Area)%>%
  summarise(`<10yrs` = sum(`<10yrs`, na.rm = TRUE),
            `(10-14)yrs` = sum(`(10-14)yrs`, na.rm = TRUE),
            `(14-18)yrs` = sum(`(14-18)yrs`, na.rm = TRUE),
            `(18-30)yrs` = sum(`(18-30)yrs`, na.rm = TRUE),
            `(30-50)yrs` = sum(`(30-50)yrs`, na.rm = TRUE),
            `>50yrs`= sum(`>50yrs`, na.rm=TRUE),
            `Total Reported Cases` = sum(`Total Reported Cases`, na.rm = TRUE),
            `Cases Convicted` = sum(`Cases Convicted`, na.rm = TRUE),
            `Fraud` = sum(`Fraud`,na.rm = TRUE),
            `Rape of Children` = sum(`Rape of Children`, na.rm = TRUE))

## `summarise()` ungrouping output (override with `.groups` argument)

df <- df%>%
  mutate(`% of convicted cases` = (`Cases Convicted`/`Total Reported Cases`)*100,
         `% of fraud` = (`Fraud`/`Total Reported Cases`)*100,
         `% of children` = (`Rape of Children`/`Total Reported Cases`)*100)%>%
  arrange(desc(`% of convicted cases`))

Scan 2

The data is then scanned for outliers. There are several ways of handling outliers such as Excluding or deleting outliers, imputing, capping and transforming. Here I am using caping for handling outliers. In capping if there are outliers in a given vector they are replaced by the 1st and the 4th quantile value of the data in the vector. A function cap(x) that accepts a vector x is defined and it determines the 0.05, 0.25, 0.75 and 0.95 quantiles of the vector and then replaces the outliers with first and fourth quantiles of the data in the vector. The resulting variable is of double data type.

cap <- function(x){
  quantiles <- quantile( x, c(.05, 0.25, 0.75, .95 ) )
  x[ x < quantiles[2] - 1.5*IQR(x) ] <- quantiles[1]
  x[ x > quantiles[3] + 1.5*IQR(x) ] <- quantiles[4]
  x
}

df$`(10-14)yrs` <- df$`(10-14)yrs`%>% cap()
df$`(14-18)yrs` <- df$`(10-14)yrs`%>% cap()
df$`(18-30)yrs` <- df$`(18-30)yrs`%>% cap()
df$`(30-50)yrs` <- df$`(30-50)yrs`%>% cap()
df$`<10yrs` <- df$`<10yrs`%>% cap()
df$`>50yrs` <- df$`>50yrs`%>% cap()
df$`Cases Convicted` <- df$`Cases Convicted`%>% cap()

It is observed that by joining the two dataset the sum of the number of fraud and convicted cases may exceed the total cases reported. Thus, another rule Fraud = Total Reported Cases - Cases Convicted is defined for consistent data.

a <- which((df$Fraud+df$`Cases Convicted`)>=df$`Total Reported Cases`)
df$Fraud[a] <- df$`Total Reported Cases`[a] - df$`Cases Convicted`[a]

df

## # A tibble: 35 x 14
##    Area  `<10yrs` `(10-14)yrs` `(14-18)yrs` `(18-30)yrs` `(30-50)yrs` `>50yrs`
##    <chr>    <dbl>        <dbl>        <dbl>        <dbl>        <dbl>    <dbl>
##  1 Naga~      13            3            3           119           21       0 
##  2 Mizo~      55           70           70           269          120      15 
##  3 Utta~     363         1341.        1341.         8076         2086      12 
##  4 Laks~       0            0            0             5            2       0 
##  5 Utta~      19           50           50           779          173       0 
##  6 Pudu~       9            4            4            17            2       0 
##  7 Chan~      31           46           46            75           24       0 
##  8 Chha~     295         1341.        1341.         4339         1804     107.
##  9 Punj~     162          267          267          2658          590      13 
## 10 Delhi     593.        1341.        1341.         2059          267      13 
## # ... with 25 more rows, and 7 more variables: `Total Reported Cases` <dbl>,
## #   `Cases Convicted` <dbl>, Fraud <dbl>, `Rape of Children` <int>, `% of
## #   convicted cases` <dbl>, `% of fraud` <dbl>, `% of children` <dbl>

Transformation

The normality of a variable can be tested using qqpot and shapiro test. The histogram of the variable Total Reported Cases is plotted and is checked for normality using qqplot and shapiro test.

STEPS:

hist(x) is used to determine the histogram of vector x.
qqplot(x) is a function in the car package that is used to determine the qqplot of vector x and if all the values in the qqplot are in the range then the variable is said to be normally distributed.
shapiro.test(x) is used to determine the p-value of the variable x. If the p-value is less than the alpha value(0.05) then the variable is not normally distributed.

hist(df$`Total Reported Cases`)

qqPlot(df$`Total Reported Cases`,main = "Q-Q Plot for Normality"
       ,xlab = "Normal Quantiles"
       ,ylab = "% of Fraud")

## [1] 11 32

p <- shapiro.test(df$`Total Reported Cases`)$p.value
normality_test <- p < 0.05
normality_test

## [1] TRUE

According to the normality tests, there are values outside the range in the qqplot and also the p-value of the variable is less than 0.05. Therefore, the variable Total Reported Cases is not normally distributed. Transformations are applied on this variable to make it normally distributed. There are several different types of transformations and the most suitable one is chosen.
After applying several transformations on the variable, the cuberoot transformation has the most significant affect on the variable making it normal.

Syntax: transform_x <- (x)^(1/3)

The qqplot is taken for the transformed vector and all the values are withing the range and also the p-value of the transformed vector is greater than alpha (0.05) making the variable normally distributed.

transform_df <- (df$`Total Reported Cases`)^(1/3)
hist(transform_df)

qqPlot(transform_df,main = "Q-Q Plot for Normality"
       ,xlab = "Normal Quantiles"
       ,ylab = "% of Fraud")

## [1] 11  4

p <- shapiro.test(transform_df)$p.value
normality_test <- p < 0.05

if (normality_test== FALSE) {
print('Normally Distributed')
} else {
print('Not normally Distributed')
}

## [1] "Normally Distributed"

df$`Total Reported Cases` <- transform_df

Inference

ggplot2 package provides us with functions for column charts which can be used for easy inference.

head(df)

## # A tibble: 6 x 14
##   Area  `<10yrs` `(10-14)yrs` `(14-18)yrs` `(18-30)yrs` `(30-50)yrs` `>50yrs`
##   <chr>    <dbl>        <dbl>        <dbl>        <dbl>        <dbl>    <dbl>
## 1 Naga~       13           3            3           119           21        0
## 2 Mizo~       55          70           70           269          120       15
## 3 Utta~      363        1341.        1341.         8076         2086       12
## 4 Laks~        0           0            0             5            2        0
## 5 Utta~       19          50           50           779          173        0
## 6 Pudu~        9           4            4            17            2        0
## # ... with 7 more variables: `Total Reported Cases` <dbl>, `Cases
## #   Convicted` <dbl>, Fraud <dbl>, `Rape of Children` <int>, `% of convicted
## #   cases` <dbl>, `% of fraud` <dbl>, `% of children` <dbl>

ggplot(df, aes(x=Area, y=`% of convicted cases`))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title='Area vs % of convicted cases', x='Area', y='% of convicted cases')

ggplot(df, aes(x=Area, y=`% of fraud`))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title='Area vs % of convicted cases', x='Area', y='% of fraud')

ggplot(df, aes(x=Area, y=`% of children`))+
  geom_col()+
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1))+
  labs(title='Area vs % of convicted cases', x='Area', y='% of children')

From the graphs the following inferrence are made-
1. The % of convicted cases is maximum in Nagaland and minimum in Manipur.
2. The % of fraud cases is maximum in Daman & Diu and minimum in Mizoram.
3. The % of children getting raped is maximum in Andaman and Nicobar Islands and minimum in Assam.

References

[1] https://www.kaggle.com/rajanand/crime-in-india (Kaggle)
[2] https://data.world/bhavnachawla/crime-rate-against-children-india-2001-2012 (Data.World)
[3] https://en.wikipedia.org/wiki/Normality_test (Test for Normality)

MATH2349 Data Wrangling

ASSIGNMENT 2

Mohammed Abraar Asif (S3846691)