Answer the following questions using the Arrests dataset and the Arrests Codebook.
For each question, provide your code and the answer (include all relevant plots).

Q1:

Create a variable called “checksbinary” that equals 1 if an arrestee’s name appears in a police database for a previous arrest, conviction, or parole and 0 if their name does not appear.

Arrest <- read.csv("/Users/YanfeiQin/Desktop/Fall 2021/897-002 Applied Linear Modeling/Lab 10/Arrests.csv", header=TRUE, sep=",")
Arrest <- na.omit(Arrest[,c("released","race","age","gender","employed","citizen","checks")])
Arrest$checksbinary<-ifelse(Arrest$checks != 0, 1, 0)
head(Arrest)
##   released race age gender employed citizen checks checksbinary
## 1        1    0  21      1        1       1      3            1
## 2        0    0  20      1        1       1      5            1
## 3        1    1  21      1        1       1      1            1
## 4        1    0  17      1        1       1      1            1
## 5        0    0  17      1        0       1      0            0
## 6        1    0  19      1        1       1      1            1


Q2:

Create a subset of the Arrests data frame called Arrests2 that includes the following variables:
1. checksbinary
2. race
3. age

myvars <- c("checksbinary", "race", "age")
Arrest2 <- Arrest[myvars]
head(Arrest2)
##   checksbinary race age
## 1            1    0  21
## 2            1    0  20
## 3            1    1  21
## 4            1    0  17
## 5            0    0  17
## 6            1    0  19


Q3:

Does the age variable adhere to the assumption of linearity?

library(ggplot2)
range(Arrest2$age)
## [1] 14 59
linearity <- glm(checksbinary ~ . , family=binomial(link='logit'),
                 data=Arrest2)
logodds <- predict(linearity)
range(logodds)
## [1] -0.1591827  3.2620820
plotlin <- with(Arrest2, data.frame(age = age,
                                   logit = logodds))
ggplot(plotlin, aes(x = age, y = logit))+
     geom_point()+
     labs(x = "age", y = "log odds") +
     geom_smooth(method = "loess", col = "#3e3e3e")+
     geom_smooth(method = "lm", col = "blue") +
     scale_x_continuous(limits=c(14,59))+
     scale_y_continuous(limits= c(-0.5,3.5))
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'


Answer Q3:

From this above graph, we are able to determine that the age variable is linearly associated with the log odds of our outcome. Thus, “age” adheres to the assumption of linearity.

Q4:

Estimate a logistic regression model where the checksbinary variable is regressed on race and age.

lm1<-glm(checksbinary~race + age, family=binomial(link='logit'), data=Arrest2)
exp(cbind(OR = coef(lm1), confint(lm1)))
## Waiting for profiling to be done...
##                    OR     2.5 %    97.5 %
## (Intercept) 0.4011598 0.1600941 0.9480293
## race        2.7101653 1.5107243 5.0532378
## age         1.0553499 1.0183255 1.0980785
  1. Interpret the coefficient for the race variable.

Answer Q4:

race: The odds ratio for race is 2.71, which tells us that the predicted odds of an arrestee’s name appearing in a police database for a previous arrest, conviction, or parole are 2.71 times greater for the arrestees whose race is black than for those whose race is white.