We will examine a set of US arrest data available at http://vincentarelbundock.github.io/Rdatasets/
week3 <- read.csv("c:/Users/Nate/Documents/Dataset/USArrests.csv")
View(week3)
attach(week3)
Next will will do some basic data exploration:
summary(week3)
## X Murder Assault UrbanPop
## Alabama : 1 Min. : 0.800 Min. : 45.0 Min. :32.00
## Alaska : 1 1st Qu.: 4.075 1st Qu.:109.0 1st Qu.:54.50
## Arizona : 1 Median : 7.250 Median :159.0 Median :66.00
## Arkansas : 1 Mean : 7.788 Mean :170.8 Mean :65.54
## California: 1 3rd Qu.:11.250 3rd Qu.:249.0 3rd Qu.:77.75
## Colorado : 1 Max. :17.400 Max. :337.0 Max. :91.00
## (Other) :44
## Rape
## Min. : 7.30
## 1st Qu.:15.07
## Median :20.10
## Mean :21.23
## 3rd Qu.:26.18
## Max. :46.00
##
We can see the that the murder rates (per 100,000) murder is lowests with mean rate of 7.788 and max of 17.4, and assault the most with a mean of 170.8 and max of 337.0. Note that most of the rates seem symmetrically distrubuted as the medians are close to the mean, but Assault seems to be left skewed wih a median of 159 and mean of 170. Perhaps the min of 45 is an outlier? The Assymetry of the Assault rate is interesting and I wonder if there is a relationship to urban population that affects it’s distrubution?
total_crime <- Assault + Rape + Murder
View(total_crime)
percent_assualt <- Assault/total_crime
We can look at the assualt data:
boxplot(Assault)
hist(percent_assualt)
Here it looks like one state has a very low percent of assault as total crime
hist(Assault)
The assault population looks slightly bimodal with a second peak between 250-300.
plot(UrbanPop,Assault)
plot(UrbanPop, percent_assualt)
There does appear that one state has a high urban population but low pecent of assault.
fit <- lm(Assault~UrbanPop)
summary(fit)
##
## Call:
## lm(formula = Assault ~ UrbanPop)
##
## Residuals:
## Min 1Q Median 3Q Max
## -150.78 -61.85 -18.68 58.05 196.85
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.0766 53.8508 1.357 0.1811
## UrbanPop 1.4904 0.8027 1.857 0.0695 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 81.33 on 48 degrees of freedom
## Multiple R-squared: 0.06701, Adjusted R-squared: 0.04758
## F-statistic: 3.448 on 1 and 48 DF, p-value: 0.06948
It does not appear that there is a linear relationship between urban population and assault rates. The assymmetry in the Assault data is caused by one outlier where the percent assault is lower than other locations. Using the View() function in Rstudio this looks to be Hawaii with 83% urban population and only 46 assaults per 100,000. Since mean is sensitive to outliers, the mean is 159 and the median is 170.8. However, there is no statistacally significant relationship between urban population and assault.