Objective

To analyze a dataset of US arrest.

Starting Point

The dataset if obtained from the below site: http://vincentarelbundock.github.io/
Dataset reference: US Arrests

Obtaining the data

  1. Download the data in a csv sheet
theURL <- "http://vincentarelbundock.github.io/Rdatasets/csv/datasets/USArrests.csv";
arrest_data <- read.table(file = theURL, header = TRUE, sep = ",");
head(arrest_data);
#State column is named as 'X". Changing the column name to US_State
colnames(arrest_data)[1] <- "US_State";
head(arrest_data);
#Write the data to a CSV
write.table(arrest_data, file = "us_arrest.csv", sep = ",", row.names = FALSE);
#The saved file is uploaded to githib and below is the URL which will be used hereafter.
#https://raw.githubusercontent.com/arunk13/MSDA-Assignments/master/BridgeCourse/Week4/us_arrest.csv 

The dataset

##     US_State Murder Assault UrbanPop Rape
## 1    Alabama   13.2     236       58 21.2
## 2     Alaska   10.0     263       48 44.5
## 3    Arizona    8.1     294       80 31.0
## 4   Arkansas    8.8     190       50 19.5
## 5 California    9.0     276       91 40.6
## 6   Colorado    7.9     204       78 38.7

A brief summary of the dataset

summary(arrest_data);
##        US_State      Murder          Assault         UrbanPop    
##  Alabama   : 1   Min.   : 0.800   Min.   : 45.0   Min.   :32.00  
##  Alaska    : 1   1st Qu.: 4.075   1st Qu.:109.0   1st Qu.:54.50  
##  Arizona   : 1   Median : 7.250   Median :159.0   Median :66.00  
##  Arkansas  : 1   Mean   : 7.788   Mean   :170.8   Mean   :65.54  
##  California: 1   3rd Qu.:11.250   3rd Qu.:249.0   3rd Qu.:77.75  
##  Colorado  : 1   Max.   :17.400   Max.   :337.0   Max.   :91.00  
##  (Other)   :44                                                   
##       Rape      
##  Min.   : 7.30  
##  1st Qu.:15.07  
##  Median :20.10  
##  Mean   :21.23  
##  3rd Qu.:26.18  
##  Max.   :46.00  
## 

Exploratory analysis

# g_murder <- ggplot(data = arrest_data_sub, aes(y = Murder, x = US_State, fill = US_State));
# g_murder + geom_bar(stat = "identity", width = 0.2) + guides(fill = FALSE) + xlab("US States") + ylab("Murder(per 100K)") + ggtitle("Crime rate: Murder rate versus states");
# g_murder_UrbanPop <- ggplot(data = arrest_data, aes(y = Murder, x = UrbanPop));
# g_murder_UrbanPop + geom_point(shape = 1) + geom_smooth(method = lm);
# g_rape_UrbanPop <- ggplot(data = arrest_data, aes(y = Rape, x = UrbanPop));
# g_rape_UrbanPop + geom_point(shape = 1) + geom_smooth(method = lm);
# g_assault_UrbanPop <- ggplot(data = arrest_data, aes(y = Assault, x = UrbanPop));
# g_assault_UrbanPop + geom_point(shape = 1) + geom_smooth(method = lm);

Sort the dataset by urban population

arrest_data_sub <- arrest_data[order(-arrest_data$UrbanPop),];
## Loading required package: ggplot2

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.800   4.075   7.250   7.788  11.250  17.400

Fact: On an average there was 8 muderer arrest per 100,000 residents in 1973

  1. California is among one of the top 5 states by rape arrest.

Extending the above study to top 7 states: 1. California is among one of the top 7 states by rape and assault arrest.

With the available data, it can be concluded that percentage of urban population in cities doesnt have any significant impact on the crime rate. Detailed analysis has to be done on the data of arrests in California to understand if there was an impact of urban poplulatio on the crime rates.

Observations

  1. The murder and assault arrest does not seem to have any coorelation with urban population.
  2. The number of rape arrest seems to have a coorelation with urban pupulation.
  3. To strengthen the above findings, simialr study has to be done with latest data.