Crime rate, US cities, Linear Model.
In the following analysis we try to apply a linear model on a multivariate data set.
We refer to the all.us.city.crime.1970 data set containing data about crime rates along with population statistics in the 24 largest cities of the USA in the year 1970 (see R Documentation for details).
library(cluster.datasets)
library(plotly)
## reading the data
data(all.us.city.crime.1970)
crimedf <- all.us.city.crime.1970
names(crimedf)
## [1] "city" "population" "white.change"
## [4] "black.population" "murder" "rape"
## [7] "robbery" "assault" "burglary"
## [10] "car.theft"
The variable population is the city population in thousands.
The variable white.change is the percent change in inner city white population from 1960 to 1970.
The variable black.population is the black population in thousands.
The six crime rate variables (murder, rape, robbery, assault, burglary, car.theft) are per 100,000 population.
We introduce two new variables:
1 - crime.total expressing the sum of the first 4 crime rate variables (i.e. personal crimes or crimes against the Person)
2 - black.perc expressing the percentage of black population on the total city population:
The variable b allows us to identify 4 groups or levels of values for the variable black.perc.
b <- rep("", dim(mydata)[1])
qbp <- quantile(mydata$black.perc)
for(i in 4:1) {
for(j in 1:dim(mydata)[1]){
if(mydata$black.perc[j] <= qbp[i+1])
b[j] <- paste("level ", i, " (<= ", qbp[i+1], ")",
sep = "")
}
}
mydata <- data.frame(mydata, b)
unique(sort(b))
## [1] "level 1 (<= 7.675)" "level 2 (<= 15.95)" "level 3 (<= 18.225)"
## [4] "level 4 (<= 25.8)"
The following plots display the positive relationship between total crime rate (referring to personal crimes) and black population percentage in the US cities.
qplot(factor(b), crime.total, data = mydata,
geom = "boxplot",
colour = I("darkred"), fill = I("tan"),
xlab = "Black population percentage",
ylab = "Crime Rate", main = "US cities - year 1970")
qplot(black.perc, crime.total, data = mydata,
geom = "point", size = population,
colour = factor(b),
xlab = "Black population percentage",
ylab = "Crime Rate", main = "US cities - year 1970")
In the final step, we fit a linear model and perform an analysis of variance
mdl <- lm(crime.total ~ black.perc, mydata)
mdl
##
## Call:
## lm(formula = crime.total ~ black.perc, data = mydata)
##
## Coefficients:
## (Intercept) black.perc
## 55.88 32.50
anova(mdl)
## Analysis of Variance Table
##
## Response: crime.total
## Df Sum Sq Mean Sq F value Pr(>F)
## black.perc 1 1076492 1076492 35.898 4.971e-06 ***
## Residuals 22 659727 29988
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The analysis above shows a good fitting for the chosen model.