MATH2349 Semester 1, 2020

Required packages

Provide the packages required to reproduce the report. Make sure you fulfilled the minimum requirement #10.

# This is the R chunk for the required packages
#install.packages("tidyr")
#install.packages("dplyr")
#install.packages("editrules")
#install.packages("outliers")
#install.packages("forecast")

library(tidyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(editrules)

## Loading required package: igraph

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:dplyr':
## 
##     as_data_frame, groups, union

## The following object is masked from 'package:tidyr':
## 
##     crossing

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

## 
## Attaching package: 'editrules'

## The following objects are masked from 'package:igraph':
## 
##     blocks, normalize

## The following object is masked from 'package:dplyr':
## 
##     contains

## The following objects are masked from 'package:tidyr':
## 
##     contains, separate

library(outliers)
library(forecast)

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

Executive Summary

Data Preprocessing is one of the most important parts of any Data Science or Analytics process. In order to get a clear understanding of the dataset, it has to be clean and tidy first. As the old adage goes, “garbage in garbage out”.
First of all, I had to import both the datasets, Crimes.csv and Suicide_rate.csv into R Studio. Both the datasets contained information about the same States and hence could be merged very easily, without the loss of essential data.
The datasets were merged using left_join function, since my main focus dataset was the Crimes.csv, abd hence even if the data had dissimilar keys, it would have meant that the crimes dataset would not lose any of it’s essential data.
Next thing was to make sure that the data was imported in it’s right format and if not to make sure that I change it.
Once I had taken care of the data types of the imported variables, it was time to tidy the data set according to the TIDY principles. I saw that it was untidy and had to be made tidy by using the tidyR and dplyR libraries.
Next it was the time to check for any missing data or data discrepancies using sanity checks. Sanity checks essentially meant that we had to check for any impossible values in the datasets containing numerical or integer values.
After the last step, I had to check for outliers, if any in the columns by looking at their distributions.
If there were any outliers that I saw they had to be rectified using suitable transformations or normalizations.

Data

The data sets Crimes.csv and Suicide_rate.csv are both data sets about the statistics per State in India. These data sets can be found from Kaggle and are incredibly suited for data exploratory process. https://www.kaggle.com/beradabhishek/india-statewise-information?select=Crime.csv https://www.kaggle.com/beradabhishek/india-statewise-information?select=Sucide_Rate.csv

The Crimes dataset contains information and statistics about the crime rates, number of crimes per year for the years (2014, 2015, 2016) and the rankings based on these stats for each state in India. The dataset also contained figures about Cognizable crime which are set of serious crimes like terrorism, murder, extortion etc. All of this data is related to the 36 Indian States/Union Territories.

The Suicide dataset contains information about the Suiceide Rate per 100000 people and the rankings of each states/UT according to that.
The focus of this task was mainly on the crimes data and to see if Suicide rates get influenced in any particular State, due to Crimes. So once I imported both the datasets into R studio, I chose to merge the datasets using the left_join method, so that the Crimes dataset does not lose any of it’s essential data. We got the reslutant Dataframe main_df.
Although, both of the data sets had similar values for States/UT column which made it pretty straightforward to merge them. There was no data loss whatsoever.

# This is the R chunk for the Data Section
crimes <- read.csv("Crime.csv", stringsAsFactors = FALSE)
suicide_rate <- read.csv("Suicide_Rate.csv", stringsAsFactors = FALSE)

#Dropping the Sno. columns from both dataframes
crimes$S.No <- NULL
suicide_rate$S.No <- NULL

#Changing the column names
colnames(crimes) <- c("States/UT","2014","2015","2016","Percentage_Share_of_State",
                      "Rank_based_on_incidence","Rate_of_cognizable_crime",
                      "Rank_based_on_crime_rate")
colnames(suicide_rate) <- c("States/UT","Suicide_rate_per_100000","Rank_based_on_suicide_rate")


#Checking the structures of both the frames
str(crimes)

## 'data.frame':    36 obs. of  8 variables:
##  $ States/UT                : chr  "Andhra Pradesh" "Arunachal Pradesh" "Assam" "Bihar" ...
##  $ 2014                     : int  114604 2843 94337 177595 58200 4466 131385 79947 14160 23848 ...
##  $ 2015                     : int  110693 2968 103616 176973 56692 3074 126935 84466 14007 23583 ...
##  $ 2016                     : int  106774 2534 102250 164163 55029 2692 147122 88527 13386 24501 ...
##  $ Percentage_Share_of_State: num  3.6 0.1 3.4 5.5 1.8 0.1 4.9 3 0.4 0.8 ...
##  $ Rank_based_on_incidence  : int  13 29 14 9 17 28 11 15 21 20 ...
##  $ Rate_of_cognizable_crime : num  206 192 314 157 212 ...
##  $ Rank_based_on_crime_rate : int  15 17 5 22 14 25 11 4 20 16 ...

str(suicide_rate)

## 'data.frame':    36 obs. of  3 variables:
##  $ States/UT                 : chr  "Andhra Pradesh" "Arunachal Pradesh" "Assam" "Bihar" ...
##  $ Suicide_rate_per_100000   : num  12.1 10.4 10 0.5 27.7 15.4 11.6 13 7.7 3 ...
##  $ Rank_based_on_suicide_rate: int  16 20 21 36 4 12 19 15 24 31 ...

#Merging the dataframes using the left_join funciton, 
#as it will help us keep all the cases for the crimes dataframe. Although both have identical keys
main_df <- crimes%>%left_join(suicide_rate, by = "States/UT")
head(main_df)

Understand

Once we have merged the Data sets together, it was essential to see what kind of data was imported as what data type. This step is critical to check the data consistency in any data set.

Here I used the str() method to check the data tyoes of the merged dataframe (main_df). Here I saw that there were 3 columns which containe unique values of rankings, which should be factor variables but were instead imported as integers.
I could not directly convert them into factors as that would have hindered the proper ordering of the ranks. I had to instead first convert them into characters and then subsequently into factors and ordering them using the levels and ordered = TRUE method.
I also needed to convert the States/UT columns into factor variables as well, as they were uique values.

# This is the R chunk for the Understand Section
# Converting State columns in both dataframes into factors.
# We should also make the 3 Rank columns into factors, since they are unique values.

str(main_df)

## 'data.frame':    36 obs. of  10 variables:
##  $ States/UT                 : chr  "Andhra Pradesh" "Arunachal Pradesh" "Assam" "Bihar" ...
##  $ 2014                      : int  114604 2843 94337 177595 58200 4466 131385 79947 14160 23848 ...
##  $ 2015                      : int  110693 2968 103616 176973 56692 3074 126935 84466 14007 23583 ...
##  $ 2016                      : int  106774 2534 102250 164163 55029 2692 147122 88527 13386 24501 ...
##  $ Percentage_Share_of_State : num  3.6 0.1 3.4 5.5 1.8 0.1 4.9 3 0.4 0.8 ...
##  $ Rank_based_on_incidence   : int  13 29 14 9 17 28 11 15 21 20 ...
##  $ Rate_of_cognizable_crime  : num  206 192 314 157 212 ...
##  $ Rank_based_on_crime_rate  : int  15 17 5 22 14 25 11 4 20 16 ...
##  $ Suicide_rate_per_100000   : num  12.1 10.4 10 0.5 27.7 15.4 11.6 13 7.7 3 ...
##  $ Rank_based_on_suicide_rate: int  16 20 21 36 4 12 19 15 24 31 ...

main_df$`States/UT` <- as.factor(main_df$`States/UT`)
levels(main_df$`States/UT`)

##  [1] "Andaman and Nicobar Islands" "Andhra Pradesh"             
##  [3] "Arunachal Pradesh"           "Assam"                      
##  [5] "Bihar"                       "Chandigarh"                 
##  [7] "Chhattisgarh"                "Dadra and Nagar Haveli"     
##  [9] "Daman and Diu"               "Delhi"                      
## [11] "Goa"                         "Gujarat"                    
## [13] "Haryana"                     "Himachal Pradesh"           
## [15] "Jammu and Kashmir"           "Jharkhand"                  
## [17] "Karnataka"                   "Kerala"                     
## [19] "Lakshwadeep"                 "Madhya Pradesh"             
## [21] "Maharashtra"                 "Manipur"                    
## [23] "Meghalaya"                   "Mizoram"                    
## [25] "Nagaland"                    "Odisha"                     
## [27] "Puducherry"                  "Punjab"                     
## [29] "Rajasthan"                   "Sikkim"                     
## [31] "Tamil Nadu"                  "Telangana"                  
## [33] "Tripura"                     "Uttar Pradesh"              
## [35] "Uttarakhand"                 "West Bengal"

main_df$Rank_based_on_incidence <- as.character(main_df$Rank_based_on_incidence)
main_df$Rank_based_on_incidence <- factor(main_df$Rank_based_on_incidence,levels = c ("1" , "2", "3"  ,"4"  ,"5"  ,"6"  ,"7"  ,"8"  ,"9"  ,"10" ,"11" ,"12", "13" ,"14" ,"15" ,"16" ,"17" ,"18", "19", "20", "21", "22", "23" ,"24" ,"25" ,"26" ,"27" ,"28" ,"29" ,"30" ,"31" ,"32" ,"33" ,"34", "35" ,"36"), ordered = TRUE)
levels(main_df$Rank_based_on_incidence)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36"

main_df$Rank_based_on_crime_rate <- as.character(main_df$Rank_based_on_crime_rate)
main_df$Rank_based_on_crime_rate <- factor(main_df$Rank_based_on_crime_rate,levels = c ("1" , "2", "3"  ,"4"  ,"5"  ,"6"  ,"7"  ,"8"  ,"9"  ,"10" ,"11" ,"12", "13" ,"14" ,"15" ,"16" ,"17" ,"18", "19", "20", "21", "22", "23" ,"24" ,"25" ,"26" ,"27" ,"28" ,"29" ,"30" ,"31" ,"32" ,"33" ,"34", "35" ,"36"), ordered = TRUE)
levels(main_df$Rank_based_on_crime_rate)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36"

main_df$Rank_based_on_suicide_rate <- as.character(main_df$Rank_based_on_suicide_rate)
main_df$Rank_based_on_suicide_rate <- factor(main_df$Rank_based_on_suicide_rate,levels = c ("1" , "2", "3"  ,"4"  ,"5"  ,"6"  ,"7"  ,"8"  ,"9"  ,"10" ,"11" ,"12", "13" ,"14" ,"15" ,"16" ,"17" ,"18", "19", "20", "21", "22", "23" ,"24" ,"25" ,"26" ,"27" ,"28" ,"29" ,"30" ,"31" ,"32" ,"33" ,"34", "35" ,"36"), ordered = TRUE)
levels(main_df$Rank_based_on_suicide_rate)

##  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14" "15"
## [16] "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28" "29" "30"
## [31] "31" "32" "33" "34" "35" "36"

Tidy & Manipulate Data I

Explain why your data (or one of the data sets) doesn’t conform the tidy data principles (minimum requirement #5). Apply the required steps to reshape the data into a tidy format. In addition to the R codes and outputs, explain everything that you do in this step.

One of the data sets namely the Crimes.csv data set was untidy and did not follow the tidy principles. Tidy data principles state that, each variable must have its own column. Each observation must have its own row. Each value must have its own cell.
In the crimes data set each variable did not have their own column as there were 3 Year columns which colud be gathered into one single column. It also meant that each value did not have it’s own cell. Essentially, the year columns in main_df namely 2014, 2015 and 2016 were to be counted as one single variable and hence needed their own single column of Year. The column headers in this untidy dataset had values of variables instead of names of variables.
That is what I did. I used the tidyr library to get the above mentioned columns into one single ‘Year_for_crime_data’ column with their values in ‘Crime_cases’ column using the gather() function. The name of this new dataframe is tidy_df
This helped me to use R’s vectorized capabilities as now each variable had it’s own column and each vaue had it’s own cell.
As I got the new ‘Year_for_crime_data’ column I noticed that this was another column that had to be converted into factor variable. This column had three categories 2014, 2015, 2016.
Next I used the select() method to essentially rearrange the way data was displayed in the newly created tidy_df. I made sure that the Year_for_crime_data and Crime_cases columns were displayed right next to the States/UT column. This was done just to make the dataframe more readable.
Next it was time to use dplyr’s arrange() method to arrange the rows of the dataframe in accordance to the States/UT column. I

# This is the R chunk for the Tidy & Manipulate Data I 
#Tidying the Data Frame
tidy_df <- main_df %>%gather(`2014`, `2015`, `2016`, key = "Year_for_crime_data", value = "Crime_Cases")

#Converting the newly found Year column into factor as we did with the States/UT column
tidy_df$Year_for_crime_data <- as.factor(tidy_df$Year_for_crime_data) 

#To increase readability
tidy_df <- tidy_df %>% select(`States/UT`,Year_for_crime_data, Crime_Cases, everything())


#Arranging the dataframe by the states
tidy_df <- tidy_df%>%arrange(`States/UT`)
head(tidy_df)

str(tidy_df)

## 'data.frame':    108 obs. of  9 variables:
##  $ States/UT                 : Factor w/ 36 levels "Andaman and Nicobar Islands",..: 1 1 1 2 2 2 3 3 3 4 ...
##  $ Year_for_crime_data       : Factor w/ 3 levels "2014","2015",..: 1 2 3 1 2 3 1 2 3 1 ...
##  $ Crime_Cases               : int  746 862 802 114604 110693 106774 2843 2968 2534 94337 ...
##  $ Percentage_Share_of_State : num  0 0 0 3.6 3.6 3.6 0.1 0.1 0.1 3.4 ...
##  $ Rank_based_on_incidence   : Ord.factor w/ 36 levels "1"<"2"<"3"<"4"<..: 33 33 33 13 13 13 29 29 29 14 ...
##  $ Rate_of_cognizable_crime  : num  145 145 145 206 206 ...
##  $ Rank_based_on_crime_rate  : Ord.factor w/ 36 levels "1"<"2"<"3"<"4"<..: 23 23 23 15 15 15 17 17 17 5 ...
##  $ Suicide_rate_per_100000   : num  28.9 28.9 28.9 12.1 12.1 12.1 10.4 10.4 10.4 10 ...
##  $ Rank_based_on_suicide_rate: Ord.factor w/ 36 levels "1"<"2"<"3"<"4"<..: 3 3 3 16 16 16 20 20 20 21 ...

Tidy & Manipulate Data II

For the next part of Tidy and manipulation, I thought of mutating a new column into the existing dataframe. The ‘Rate_of_cognizable_crime’ column contained the number of Cognizable crimes per 100000 people in any particular State. I used this column and the ‘Crime_cases’ column to find out approximately how many Cognizable crimes could be registered in any particular State. There is no exact number of Cognizable crime count provided, so this was just a way to find out how may cases have been recorded approximately according to the rate of Cognizable crime.

For this I used the Rate_of_cognizable_crime column and divided it by 100000, and subsequentally multiplying this result by the number of crime cases.

# This is the R chunk for the Tidy & Manipulate Data II 
tidy_df %>% mutate(Approx_no_of_cognizable_crimes = (Rate_of_cognizable_crime/100000) * Crime_Cases)

Scan I

Scanning the data for missing values, special values and obvious errors (i.e. inconsistencies) included scanning the dataframe by the following ways.

To scan the dataframe for na values, I used the is.na() function on the entire dataframe. Now just doing this was not enough as in a big dataframe there can be cases where we can miss inconsistencies here and there. So to get the total number of missing/na values I used the colSums method along with the is.na method. This provided me with a clear view of which column in particular had if any missing values.
For nan values, I could not use the is.nan method diretly on the datframe as it cannot be applied directly. So I had to use the sapply function together with is.nan. Also just getting a dataframe of TRUE or FALSE values would not be enough to find out where particular missing values are, hence I used the colSums method as mentioend above.
For infinite values, I continued with the same process as I used to find nan values, using colSums and sapply.
There are times when a particular column may have some obvious inconsistencies or some impossible like the percentage share, rate of cognizable crimes, suicide rate and crime cases could not be less than 0 or negative. To automate the process of finding these inconsistencies, I used the editrules package, and specifically editset() and violatededits() functions. I used editset() to make a few custom rules and then used violatededits() to find out if any discrepancies.
As I found out that the data set was complete and not at all riddled with missing data or inconsistencies, but I feel the steps that I took could be applied to any incomplete dataset to scan for discrepencies.

# This is the R chunk for the Scan I
#Scanning the datframe for na values
#Since we can miss a few missing values here and there in a big dataframe, 
#it is better to get a count of missing values
colSums(is.na(tidy_df))

##                  States/UT        Year_for_crime_data 
##                          0                          0 
##                Crime_Cases  Percentage_Share_of_State 
##                          0                          0 
##    Rank_based_on_incidence   Rate_of_cognizable_crime 
##                          0                          0 
##   Rank_based_on_crime_rate    Suicide_rate_per_100000 
##                          0                          0 
## Rank_based_on_suicide_rate 
##                          0

#Once we have checked for missing values, it is time to check for infinite or nan values

#For nan values
colSums(sapply(tidy_df, is.nan))

##                  States/UT        Year_for_crime_data 
##                          0                          0 
##                Crime_Cases  Percentage_Share_of_State 
##                          0                          0 
##    Rank_based_on_incidence   Rate_of_cognizable_crime 
##                          0                          0 
##   Rank_based_on_crime_rate    Suicide_rate_per_100000 
##                          0                          0 
## Rank_based_on_suicide_rate 
##                          0

#For infinite values
colSums(sapply(tidy_df, is.infinite))

##                  States/UT        Year_for_crime_data 
##                          0                          0 
##                Crime_Cases  Percentage_Share_of_State 
##                          0                          0 
##    Rank_based_on_incidence   Rate_of_cognizable_crime 
##                          0                          0 
##   Rank_based_on_crime_rate    Suicide_rate_per_100000 
##                          0                          0 
## Rank_based_on_suicide_rate 
##                          0

#There are times when a particular column may have some obvious inconsistencies
#To check for that I will make use of the editrules library
(Rule1 <- editset(c("Percentage_Share_of_State >= 0", "Rate_of_cognizable_crime >= 0", 
                    "Suicide_rate_per_100000 >= 0", "Crime_Cases >= 0")))

## 
## Edit set:
## num1 : 0 <= Percentage_Share_of_State
## num2 : 0 <= Rate_of_cognizable_crime
## num3 : 0 <= Suicide_rate_per_100000
## num4 : 0 <= Crime_Cases

colSums(violatedEdits(Rule1, tidy_df))

## num1 num2 num3 num4 
##    0    0    0    0

Scan II

Scan the numeric data for outliers.

I had 3 numeric/integer data columns where I could find if they had any outliers.
To find these out I used all three columns to make a boxplot of each. I could not be sure which values were if any were outliers.
For this I used the z scores method from the outliers library. Once I got the minimum and maximum z scores, I used length method to find out which if any of the data from these columns had an absolute z score value of greater than 3.

# This is the R chunk for the Scan II
#Searching for outliers in the Percentage_Share_of_State column
tidy_df$Percentage_Share_of_State%>%boxplot(main = "Box Plot for Percentage share of crime per state",
                                               ylab = "Percentage of crime", col = "grey")

z.scores_per_share <- tidy_df$Percentage_Share_of_State%>%scores(type = "z")
z.scores_per_share%>%summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.7819 -0.7583 -0.4641  0.0000  0.5361  2.9836

length (which( abs(z.scores_per_share) >3 ))

## [1] 0

#Searching for outliers in Rate_of_cognizable_crime column
tidy_df$Rate_of_cognizable_crime %>% boxplot(main = "Box Plot for Rate of cognizable crime",
                                             ylab = "Percentage of cognizable crime", col = "grey")

z.scores_cog_crimes <- tidy_df$Rate_of_cognizable_crime%>%scores(type = "z")
z.scores_cog_crimes%>%summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -0.9955 -0.5356 -0.1564  0.0000  0.1247  4.3606

length (which( abs(z.scores_cog_crimes) >3 ))

## [1] 3

#Searching for outliers in Suicide_rate_per_100000  column
tidy_df$Suicide_rate_per_100000 %>% boxplot(main = "Box Plot for Suicide Rate per 100000",
                                             ylab = "Suicide rate per 100000", col = "grey")

z.scores_suicide_rate <- tidy_df$Suicide_rate_per_100000%>%scores(type = "z")
z.scores_suicide_rate%>%summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## -1.2419 -0.7237 -0.1620  0.0000  0.4482  2.8938

length (which( abs(z.scores_suicide_rate) >3 ))

## [1] 0

Transform

Transforming the column Suicide_rate_per_100000 as it had a few outliers as seen in the last section.

First I had to get a proper look at how the distribution for the column looked like. Once I found out that the distribution was right skewed, it was time to apply some transformation.

# This is the R chunk for the Transform Section

#Transforming the column Suicide_rate_per_100000 as it had a few outliers
hist(tidy_df$Suicide_rate_per_100000)

* I used the mathematical process of taking the log10 of each value of the column. As it turned out this made the values left skewed from being right skewed.

#Using log wwith base 10 transformation, we actually find out that the histogram actually flips to be left skewed
log_suicide_rate <- log10(tidy_df$Suicide_rate_per_100000)
hist(log_suicide_rate)

ON the other hand sqauring the numbers had no impact as I thought, since there were no negative values. Instead it made it even more right skewed.

#Square transform makes little to no impact as data remains right skewed
square_suicide_rate <- tidy_df$Suicide_rate_per_100000^2
hist(square_suicide_rate)

Lastly I used the forecasting package from R and used the BoxCox transform, which as it turned out centralized the distribution the most.

#BoxCox transformation actually makes the distribution most centralized
boxcox_suicide_rate<- BoxCox(tidy_df$Suicide_rate_per_100000,lambda = "auto")
hist(boxcox_suicide_rate)