1. Data Exploration: This should include summary statistics, means, medians, quartiles, or any other relevant information about the data set. Please include some conclusions in the R Markdown text.

library (readr)
MR <- read.csv(url("https://vincentarelbundock.github.io/Rdatasets/csv/AER/MurderRates.csv"))
summary(MR)
##        X              rate         convictions       executions     
##  Min.   : 1.00   Min.   : 0.810   Min.   :0.1080   Min.   :0.00000  
##  1st Qu.:11.75   1st Qu.: 1.808   1st Qu.:0.1663   1st Qu.:0.02625  
##  Median :22.50   Median : 3.625   Median :0.2260   Median :0.04500  
##  Mean   :22.50   Mean   : 5.404   Mean   :0.2605   Mean   :0.06034  
##  3rd Qu.:33.25   3rd Qu.: 7.725   3rd Qu.:0.3202   3rd Qu.:0.08225  
##  Max.   :44.00   Max.   :19.250   Max.   :0.7570   Max.   :0.40000  
##       time           income           lfp           noncauc       
##  Min.   : 34.0   Min.   :0.760   Min.   :47.00   Min.   :0.00300  
##  1st Qu.: 94.0   1st Qu.:1.550   1st Qu.:51.50   1st Qu.:0.02175  
##  Median :124.0   Median :1.830   Median :53.40   Median :0.06450  
##  Mean   :136.5   Mean   :1.781   Mean   :53.07   Mean   :0.10559  
##  3rd Qu.:179.0   3rd Qu.:2.070   3rd Qu.:54.52   3rd Qu.:0.14450  
##  Max.   :298.0   Max.   :2.390   Max.   :58.80   Max.   :0.45400  
##    southern        
##  Length:44         
##  Class :character  
##  Mode  :character  
##                    
##                    
## 

Conclusion: Average murder rate is 5.4 per 100k residents. Average time served (in months) of convicted murderers is 136.5.

2. Data wrangling: Please perform some basic transformations. They will need to make sense but could include column renaming, creating a subset of the data, replacing values, or creating new columns with derived data (for example – if it makes sense you could sum two columns together)

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
new_MR <- select(MR, "rate", "time", "income", "southern")
colnames(new_MR) <- c("MurderRate","Time_Served", "Family_income", "Southern_Region")
S_new_MR <- subset(new_MR, Southern_Region == "yes")
N_new_MR <- subset(new_MR, Southern_Region == "no")
summary(S_new_MR)
##    MurderRate      Time_Served     Family_income   Southern_Region   
##  Min.   : 2.830   Min.   : 34.00   Min.   :0.760   Length:15         
##  1st Qu.: 7.525   1st Qu.: 74.50   1st Qu.:1.195   Class :character  
##  Median :10.440   Median : 95.00   Median :1.350   Mode  :character  
##  Mean   :10.107   Mean   : 95.93   Mean   :1.401                     
##  3rd Qu.:12.065   3rd Qu.:124.00   3rd Qu.:1.570                     
##  Max.   :19.250   Max.   :161.00   Max.   :2.070
summary(N_new_MR)
##    MurderRate     Time_Served    Family_income   Southern_Region   
##  Min.   :0.810   Min.   : 56.0   Min.   :1.550   Length:29         
##  1st Qu.:1.410   1st Qu.:101.0   1st Qu.:1.810   Class :character  
##  Median :2.800   Median :148.0   Median :1.970   Mode  :character  
##  Mean   :2.971   Mean   :157.5   Mean   :1.978                     
##  3rd Qu.:3.710   3rd Qu.:199.0   3rd Qu.:2.120                     
##  Max.   :8.310   Max.   :298.0   Max.   :2.390
head(new_MR)
##   MurderRate Time_Served Family_income Southern_Region
## 1      19.25          47          1.10             yes
## 2       7.53          58          0.92             yes
## 3       5.66          82          1.72              no
## 4       3.21         100          2.18              no
## 5       2.80         222          1.75              no
## 6       1.41         164          2.26              no
head(S_new_MR)
##    MurderRate Time_Served Family_income Southern_Region
## 1       19.25          47          1.10             yes
## 2        7.53          58          0.92             yes
## 7        6.18         161          2.07             yes
## 8       12.15          70          1.43             yes
## 14      10.44         104          1.35             yes
## 15       9.58         126          1.26             yes
head(N_new_MR)
##    MurderRate Time_Served Family_income Southern_Region
## 3        5.66          82          1.72              no
## 4        3.21         100          2.18              no
## 5        2.80         222          1.75              no
## 6        1.41         164          2.26              no
## 9        1.34         219          1.92              no
## 10       3.71          81          1.82              no

3. Graphics: Please make sure to display at least one scatter plot, box plot and histogram. Don’t be limited to this. Please explore the many other options in R packages such as ggplot2.

library(ggplot2)
#Historgram group by Southern_Region
ggplot(data=new_MR) + geom_histogram(aes(x=MurderRate)) + labs(title = "No. Murder Rate by Region", x= "MurderRate", y = "Count") + facet_grid(~Southern_Region)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

#boxplot group by Southern_Region
ggplot(new_MR,aes(y=MurderRate, x=Southern_Region)) + geom_boxplot() + labs(title = "Murder Rate by Region", x= "Sourthern", y = "Murder Rate")

#Scatterplot showing relationship between Murder Rate and income, group by Southern Region
g <-ggplot(new_MR,aes(x=MurderRate, y=Family_income)) 
g + geom_point() + labs(title = "Murder Rate vs Family Income by Region", x= "MurderRate", y = "Family Income") + facet_grid(~Southern_Region)

#Scatterplot showing relationship between Murder Rate and Time Served, group by Southern Region
g2 <-ggplot(new_MR,aes(x=MurderRate, y=Time_Served)) 
g2 + geom_point() + labs(title = "Murder Rate vs Time Served by Region", x= "MurderRate", y = "Time Served") + facet_grid(~Southern_Region)

4. 4. Meaningful question for analysis: Please state at the beginning a meaningful question for analysis. Use the first three steps and anything else that would be helpful to answer the question you are posing from the data set you chose. Please write a brief conclusion paragraph in R markdown at the end.

Question: How is Sourthern Region different from Non Sourthern Region in terms of Murder Rate, Time Served, Family income and Non Caucasian?

Answer:

In terms of Murder Rate: Sourthern Region has a higer Average Murder Rate. The Median of Sourthern Region Murder Rate is 10.44 per 100k residents while the Median Non Sourthern Region Murder Rate is only 2.80 per 100k residents. This trend can be visualized in Boxplot graph.The Box of Sourthern Region is location at higher position in the graph.

In terms of Time Served: The scatterplot shows despite sourthern region has a relatively higher Murder Rate, the average Time Served is lower than Non Sourthern Region. Comparing the median of both regions, we can see Non Sourthern Region has 53 months more than Sourthern Region

In terms of Family income: In sourthern region, lower family income seems to have a higher Murder Rate.We can see most of the observation are clustered on the lower right area. In Non Sourthern Region, the Murder rate are below 10. However, we can see more oberservation are clustered on the top left area. It indicates Family with higer income may have a higher chance of Murder.

5. BONUS – place the original .csv in a github file and have R read from the link. This will be a veryuseful skill as you progress in your data science education and career.

library (readr)
MR2 <- read.csv(url("https://raw.githubusercontent.com/tonyCUNY/test/main/MurderRates.csv"))
summary(MR2)
##        X              rate         convictions       executions     
##  Min.   : 1.00   Min.   : 0.810   Min.   :0.1080   Min.   :0.00000  
##  1st Qu.:11.75   1st Qu.: 1.808   1st Qu.:0.1663   1st Qu.:0.02625  
##  Median :22.50   Median : 3.625   Median :0.2260   Median :0.04500  
##  Mean   :22.50   Mean   : 5.404   Mean   :0.2605   Mean   :0.06034  
##  3rd Qu.:33.25   3rd Qu.: 7.725   3rd Qu.:0.3202   3rd Qu.:0.08225  
##  Max.   :44.00   Max.   :19.250   Max.   :0.7570   Max.   :0.40000  
##       time           income           lfp           noncauc       
##  Min.   : 34.0   Min.   :0.760   Min.   :47.00   Min.   :0.00300  
##  1st Qu.: 94.0   1st Qu.:1.550   1st Qu.:51.50   1st Qu.:0.02175  
##  Median :124.0   Median :1.830   Median :53.40   Median :0.06450  
##  Mean   :136.5   Mean   :1.781   Mean   :53.07   Mean   :0.10559  
##  3rd Qu.:179.0   3rd Qu.:2.070   3rd Qu.:54.52   3rd Qu.:0.14450  
##  Max.   :298.0   Max.   :2.390   Max.   :58.80   Max.   :0.45400  
##    southern        
##  Length:44         
##  Class :character  
##  Mode  :character  
##                    
##                    
##