Keywords

Crime rate, US cities, Linear Model.

Intro

In the following analysis we try to apply a linear model on a multivariate data set.

Step 1 - reading the data

We refer to the all.us.city.crime.1970 data set containing data about crime rates along with population statistics in the 24 largest cities of the USA in the year 1970 (see R Documentation for details).

library(cluster.datasets)
library(plotly)
## reading the data
data(all.us.city.crime.1970)
crimedf <- all.us.city.crime.1970
names(crimedf)
##  [1] "city"             "population"       "white.change"    
##  [4] "black.population" "murder"           "rape"            
##  [7] "robbery"          "assault"          "burglary"        
## [10] "car.theft"

The variable population is the city population in thousands.
The variable white.change is the percent change in inner city white population from 1960 to 1970.
The variable black.population is the black population in thousands.
The six crime rate variables (murder, rape, robbery, assault, burglary, car.theft) are per 100,000 population.

Step 2 - preprocessing the data

Introduce 2 new variables

We introduce two new variables:
1 - crime.total expressing the sum of the first 4 crime rate variables (i.e. personal crimes or crimes against the Person)
2 - black.perc expressing the percentage of black population on the total city population:

Define a new variable (b) for grouping purposes

The variable b allows us to identify 4 groups or levels of values for the variable black.perc.

b <- rep("", dim(mydata)[1])
qbp <- quantile(mydata$black.perc)
for(i in 4:1) {
  for(j in 1:dim(mydata)[1]){
    if(mydata$black.perc[j] <= qbp[i+1]) 
      b[j] <- paste("level ", i, " (<= ", qbp[i+1], ")", 
                    sep = "")
  }
}
mydata <- data.frame(mydata, b)
unique(sort(b))
## [1] "level 1 (<= 7.675)"  "level 2 (<= 15.95)"  "level 3 (<= 18.225)"
## [4] "level 4 (<= 25.8)"

Step 3 - exploratory data analysis

Create a boxplot and a scatterplot

The following plots display the positive relationship between total crime rate (referring to personal crimes) and black population percentage in the US cities.

qplot(factor(b), crime.total, data = mydata, 
      geom = "boxplot", 
      colour = I("darkred"), fill = I("tan"),
      xlab = "Black population percentage", 
      ylab = "Crime Rate", main = "US cities - year 1970")

qplot(black.perc, crime.total, data = mydata, 
      geom = "point", size = population, 
      colour = factor(b),
      xlab = "Black population percentage", 
      ylab = "Crime Rate", main = "US cities - year 1970")

Step 4 - fit a linear model

In the final step, we fit a linear model and perform an analysis of variance

mdl <- lm(crime.total ~ black.perc, mydata)
mdl
## 
## Call:
## lm(formula = crime.total ~ black.perc, data = mydata)
## 
## Coefficients:
## (Intercept)   black.perc  
##       55.88        32.50

Analysis of variance

anova(mdl)
## Analysis of Variance Table
## 
## Response: crime.total
##            Df  Sum Sq Mean Sq F value    Pr(>F)    
## black.perc  1 1076492 1076492  35.898 4.971e-06 ***
## Residuals  22  659727   29988                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The analysis above shows a good fitting for the chosen model.