Load and mutate data

To begin our exploratory analysis, we start by loading our our data set and mutate it to make sure we can investigate all predictors collectively and compare them.

data <- d2 %>%
  select(-c("force2", "pct")) %>%
  mutate_at(vars(all_of(yes_no_columns)), ~ifelse(. == "N", 0, 1))%>%
  mutate(race2 = case_when(
    race2 == "white"    ~ 0,
    race2 == "black"    ~ 1,
    race2 == "hisp"     ~ 2,
    race2 == "asian"    ~ 3,
    race2 == "other"    ~ 4
  )) %>%
  mutate(typeofid2  = case_when(
    typeofid2 == "O"    ~ 0,
    typeofid2 == "P"    ~ 1,
    typeofid2 == "R"    ~ 2,
    typeofid2 == "V"    ~ 3,
  )) %>%
  mutate(across(all_of(factor_columns), as.factor)) %>%
  mutate(year = as.numeric(year))

summary(data)
##      force          race2          gender             age2       daytime       
##  0      :3910239   0   : 492688   0   :4495436   Min.   :10      0   :2472074  
##  1      : 736051   1   :2886843   1   : 348159   1st Qu.:19      1   :1871287  
##  3      : 156732   2   :1215401   NA's: 140796   Median :24      NA's: 641030  
##  2      : 122898   3   : 152053                  Mean   :28                    
##  5      :  31590   4   : 235105                  3rd Qu.:34                    
##  6      :  17769   NA's:   2301                  Max.   :90                    
##  (Other):   9112                                 NA's   :41534                 
##   inout2        ac_incid    ac_time     offunif2       typeofid2     
##  0   :3809789   0:2204888   0:3158281   0   :1396968   0   :  84217  
##  1   :1147150   1:2779503   1:1826110   1   :3586909   1   :2640092  
##  NA's:  27452                           NA's:    514   2   : 107497  
##                                                        3   :2133368  
##                                                        NA's:  19217  
##                                                                      
##                                                                      
##  othpers2       cs_objcs    cs_descr    cs_casng    cs_lkout    cs_cloth   
##  0   :3806808   0:4850690   0:4116587   0:3562617   0:4151151   0:4773826  
##  1   :1163298   1: 133701   1: 867804   1:1421774   1: 833240   1: 210565  
##  NA's:  14285                                                              
##                                                                            
##                                                                            
##                                                                            
##                                                                            
##  cs_drgtr    cs_furtv    cs_vcrim    cs_bulge    cs_other         year     
##  0:4522192   0:2801105   0:4588325   0:4543022   0:3944522   Min.   :2003  
##  1: 462199   1:2183286   1: 396066   1: 441369   1:1039869   1st Qu.:2006  
##                                                              Median :2009  
##                                                              Mean   :2008  
##                                                              3rd Qu.:2011  
##                                                              Max.   :2013  
## 

Full data set exploratory analysis

Next we generate bar plots for the predictors characterized as factors and histograms for the continuous predictors to check for the number of missing values, skewness and extreme values.

First of all, is exceptionally skewed, however, one could combine all types of force larger than zero thus generating a binary variable which one could use in a generalized linear model with a binomial family (spoiler alert).

Most of the other variables are binary, due to the mutation of our data set. For these we cannot discuss any skewness, however, some of the variables take significantly more values of 0 than 1, like or and . This may not directly cause any issues, nonetheless it is worth noticing.

The number of missing values varies from none to a significant amount as seen in the variable . We’ll account for these later on.

As is a continuous variable, it is plotted as a histogram. There is nothing eye catching about the distribution relating to skewness or extreme values, though we do note that the police tend to have stopped more younger than older people. This becomes very clear when looking at the violin plot in which the subjects age is stratified on the categorical types of force. There is nothing concerning about this relation either.

The last variable is for now also a continuous variable and thus it is plotted as a histogram. Again, there is nothing eye catching about the distribution relating to skewness or extreme values, only an indication of the half year period in year 2003 and 2013 due to change in government. However, when looking at the corresponding violin plot, we observe an extreme amount of type 5 force being used in year 2006. This could be a mistake but we cannot make any conclusions, only note that it could be possible.

Prior to initiating our data analysis, it is convenient to search for variables that would be strong predictors. By that we would be looking for a numerically large correlation coefficient with the response variable, which in this case will be . Furthermore, for the predictor variables, we would like to avoid using predictors, which have a high correlation to other predictors, as these most likely will have a similar effect on the response variable. To get an idea of how the variables correlate we plot them.

cp <- cor(data.matrix(na.omit(data)), method = "spearman")

ord <- rev(hclust(as.dist(1-abs(cp)))$order)

colPal <- colorRampPalette(c("white", "darkseagreen4"), space = "rgb")(100)

spearman <- levelplot(cp[ord, ord], xlab = "", ylab = "",
col.regions = colPal, at = seq(-1, 1, length.out = 100),
colorkey = list(space = "top", labels = list(cex = 0.5)),
scales = list(x = list(rot = 45),
y = list(rot = 45),
cex = 0.5))

spearman

Unfortunately, the Spearmann-plot at first glance does not show any strong correlations at all. However, knowing that the Spearmann-plot by default cluster the variables with numerically higher correlations. Therefore, we can read off of it, that and , as these are located next to it, even though even these have a very low correlation. It also seems noteworthy that the variables , , and are quite strong correlated, compared to the correlation between the other variables.

We now define a new data set where force id 0 if there is no force, and 1 if there is any kind of force. This is because we later want to fit a binomial model.

table(data$force, data$year)
##    
##       2003   2004   2005   2006   2007   2008   2009   2010   2011   2012
##   0 119564 246398 312773 406541 365744 414700 439271 463337 540799 443857
##   1  21225  41392  56400  69177  75406  91045 103625  97624 102471  57803
##   2   8134  10412  12542    653  13097  14096  16467  16444  16978  10099
##   3   7938  10637  11812  11366  14079  16265  17690  19265  21076  17697
##   4    912   1010    937    631    612    496    481    498    641    712
##   5   1166   1579   1684  16069   1369   1792   1831   2128   1845   1418
##   6   1787   1954   1808   1868   1592   1747   1609   1807   1730   1215
##   7    125    141    235    184    197    161    194    182    184    110
##    
##       2013
##   0 157255
##   1  19883
##   2   3976
##   3   8907
##   4    435
##   5    709
##   6    652
##   7     34

Not many force 4-observations. maybe it’s not a big difference for the police to go from “4: at least drawing a weapon on a civilian.” to “5: at least pushing a civilian to the ground.” Which gives reason for the choice of joining some “forces” and fitting a binomial model.

Data analysis

Our scientific question

…..

Model data

….

Analysis of missing values

How can we replace the missing values? Maybe we should do this only for the columns we are interested in and not all of them.

Diagnostistke NA-plot på hele datasættet??

## 
## Attaching package: 'mice'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## Loading required package: colorspace
## The legacy packages maptools, rgdal, and rgeos, underpinning the sp package,
## which was just loaded, were retired in October 2023.
## Please refer to R-spatial evolution reports for details, especially
## https://r-spatial.org/r/2023/05/15/evolution4.html.
## It may be desirable to make the sf package available;
## package maintainers should consider adding sf to Suggests:.
## VIM is ready to use.
## Suggestions and bug-reports can be submitted at: https://github.com/statistikat/VIM/issues
## 
## Attaching package: 'VIM'
## The following object is masked from 'package:datasets':
## 
##     sleep
## 
## Attaching package: 'ggmice'
## The following objects are masked from 'package:mice':
## 
##     bwplot, densityplot, stripplot, xyplot
## The following objects are masked from 'package:lattice':
## 
##     bwplot, densityplot, stripplot, xyplot
NAplot <- md.pattern(NAdata,
           rotate.names = TRUE)

pMiss <- function(x){sum(is.na(x))/length(x)*100}
apply(NAdata,2,pMiss)
##       race2      gender        age2     daytime      inout2    offunif2 
##  0.04616412  2.82473827  0.83328134 12.86074869  0.55075936  0.01031219 
##   typeofid2    othpers2 
##  0.38554359  0.28659469
#apply(modeldata,1,pMiss)

aggr_plot <- aggr(NAdata, col=c('cadetblue','darkseagreen2'), 
                  numbers=TRUE, 
                  sortVars=TRUE, 
                  labels=names(modeldata), 
                  cex.axis=.7, gap=3, 
                  ylab=c("Histogram of missing data","Pattern"))

## 
##  Variables sorted by number of missings: 
##  Variable        Count
##    inout2 0.1286074869
##      age2 0.0282473827
##   daytime 0.0083328134
##  ac_incid 0.0055075936
##  cs_furtv 0.0038554359
##  cs_lkout 0.0028659469
##     force 0.0004616412
##  offunif2 0.0001031219

From the histogram of the missing values we notice that one predictor is especially missing a lot of values: is missing \(12.86\%\). The big amount of missing values imply that we cannot directly assume they are missing at random. The other predictors are missing up to \(3\%\) of their data values, which means that we in general can assume them to be at least missing at random. Thus we focus on analyzing how the values of are missing.

The missing values could be problematic, if there is a correlation to the response variable. By not taking the missing values into account we calculate the correlation between and to be:

non_na_data <- data[complete.cases(data$daytime), ]
non_na_data$force <- as.numeric(as.character(non_na_data$force))
non_na_data$daytime <- as.numeric(as.character(non_na_data$daytime))
cor(non_na_data$force, non_na_data$daytime)
## [1] -0.01773452

We note no correlation between force and daytime and we investigate further: Given the nature of the dataset we suspect that should be correlated to when stratifying on the binary force:

bin_data<- data %>%
  mutate(force = as.numeric(as.character(force)), 
         bin_force = ifelse(force > 0, 1, 0)) %>%
        select(-force) %>%
  mutate(bin_force = as.factor(bin_force))
     
        
correlations <- numeric()

non_na_data <- bin_data[complete.cases(bin_data$daytime, bin_data$ac_time), ]
force_levels <- unique(non_na_data$bin_force)


non_na_data$daytime <- as.numeric(as.character(non_na_data$daytime))
non_na_data$ac_time <- as.numeric(as.character(non_na_data$ac_time))

for (level in force_levels) {
   subset_data <- non_na_data[non_na_data$bin_force == level, ]
   correlation <- cor(subset_data$ac_time, subset_data$daytime, method = "pearson")
   correlations <- c(correlations, correlation)
 }

correlations
## [1] -0.08107143 -0.08215116

Firstly we note a weak correlation between and . Secondly we don’t observe a systematic correlation to . This observation together with the previous calculations adds to our suspicion that the values are missing at random.

Lastly we consider the cross tabulation between and :

cross_tab <- table(data$year, data$daytime, useNA = "ifany")
row_sums <- rowSums(cross_tab, na.rm = TRUE)
cross_tab_with_sums <- cbind(cross_tab, row_sums)
 
cross_tab_with_sums
##           0      1   <NA> row_sums
## 2003 100148  60686     17   160851
## 2004 201078 112376     69   313523
## 2005 250292 147348    551   398191
## 2006 243565 185416  77508   506489
## 2007 232487 162968  76641   472096
## 2008 261735 191513  87054   540302
## 2009 266595 228727  85846   581168
## 2010 278109 225644  97532   601285
## 2011 307707 275828 102189   685724
## 2012 242830 208138  81943   532911
## 2013  87528  72643  31680   191851

From this we note that in 2003, 2004 and 2005 not many -values were missing. Thus we look at the ratio of the predictor’s values without considering the missing values:

non_na_data <- data[complete.cases(data$daytime, data$year), ]
non_na_data$daytime <- as.numeric(as.character(non_na_data$daytime))
non_na_data$year <- as.numeric(as.character(non_na_data$year))
cross_tab_year <- table(non_na_data$year, non_na_data$daytime)
 
normalized_cross_tab_year <- cbind(cross_tab_year[,1]/rowSums(cross_tab_year), cross_tab_year[,2]/rowSums(cross_tab_year))
 
normalized_cross_tab_year
##           [,1]      [,2]
## 2003 0.6226793 0.3773207
## 2004 0.6414913 0.3585087
## 2005 0.6294437 0.3705563
## 2006 0.5677757 0.4322243
## 2007 0.5878975 0.4121025
## 2008 0.5774653 0.4225347
## 2009 0.5382256 0.4617744
## 2010 0.5520741 0.4479259
## 2011 0.5273154 0.4726846
## 2012 0.5384639 0.4615361
## 2013 0.5464660 0.4535340

We note that the ratios for the years with few missing value are quite different from the years with many missing values. This could indicate that the values are missing not at random. In general we note that from the year 2006 and forward we have fewer observations of being 0. The police officers might be more likely to record “daytime” values accurately while underreporting or omitting “nighttime” values due to:

  1. The policeman could have a hard time evaluating if the stop happened at night or day. By researching the “Stop and Frisk”-program, we find that nighttime is defined by the timeslot 6 pm until 6 am and likewise the timeslot for daytime is 6 am until 6 pm. If force were used, the police stopping civilians, might durate over a couple of hours, in which it can be difficult to differentiate between night and day. When the stop occurs while the time switches from night to day or the opposite, the officer might leave the space empty. Supposedly most of this stop might occur from 6 pm and forward compared to 6 am and forward.

  2. The police could be understaffed at night and thus might not prioritize certain variables when patrolling as much as officers during the day. Furthermore tiredness at night could make the police forget to note the variables when done patrolling.

  3. The types and rates of criminal activity can vary by time of day. Some crimes and thus forces may be more common at certain times. As a result, officers may conduct more stop and frisk procedures during these times. From the cross tabulation we see, that the force being used is stronger at night than at day.

In conclusion the values in daytime must be missing not at random and to avoid bias in the dataset we must impute the valuees. By the arguments above it seems most reasonable to impute the ’s with 0’s. By doing this we get the new normalized cross tabulation with the following ratios:

data_daytime <- data %>% 
  mutate(daytime = as.numeric(ifelse(is.na(daytime), 0, as.character(daytime))))

 
data_daytime$year <- as.numeric(as.character(data_daytime$year))
 
 
cross_tab_year <- table(data_daytime$year, data_daytime$daytime)
 
normalized_cross_tab_year <- cbind(cross_tab_year[,1]/rowSums(cross_tab_year), cross_tab_year[,2]/rowSums(cross_tab_year))

normalized_cross_tab_year
##           [,1]      [,2]
## 2003 0.6227192 0.3772808
## 2004 0.6415702 0.3584298
## 2005 0.6299565 0.3700435
## 2006 0.6339190 0.3660810
## 2007 0.6547990 0.3452010
## 2008 0.6455445 0.3544555
## 2009 0.6064357 0.3935643
## 2010 0.6247304 0.3752696
## 2011 0.5977565 0.4022435
## 2012 0.6094320 0.3905680
## 2013 0.6213572 0.3786428

This gives an indication, that the imputation may not be unreasonable, and we move forward with this choice.

We know in general, that when variables are missing less than \(5\%\) of data values we can assume them to be missing at random. Thus we are safe to impute the missing values by imputing the most probable value for discrete variables and the conditional mean \(\mathbb{E}(X_{z^c}|X_z)\) for continuous variables (p. 99). For this reason we impute the missing values by utilizing MICE-imputation and the pmm method.

Imputed data model fitting

We impute our missing data using the Mice package. By bootstrapping our model data 50 times, we generate five new data sets using the PMM method. This method uses predictive mean matching, hence PMM, to impute the missing values in the data set as the data is missing at random.

modeldata_imp <- mice(modeldata,m=5,maxit=50,meth='pmm',seed=500)
summary(modeldata_imp)

#watch imputed data
modeldata_imp$imp$daytime

#puts imputed data from 1st bootstrap back into dataset
modeldata_mice1 <- complete(modeldata_imp, 1)
modeldata_mice2 <- complete(modeldata_imp, 2)
modeldata_mice3 <- complete(modeldata_imp, 3)
modeldata_mice4 <- complete(modeldata_imp, 4)
modeldata_mice5 <- complete(modeldata_imp, 5)

Histogram of imputed data

data_daytime <- data %>% 
  mutate(daytime = as.numeric(ifelse(is.na(daytime), 0, as.character(daytime))))


mice1$daytime <- data_daytime$daytime

mice2$daytime <- data_daytime$daytime

mice3$daytime <- data_daytime$daytime

mice4$daytime <- data_daytime$daytime

mice5$daytime <- data_daytime$daytime
df_orig <- data.frame(mice1$daytime, modeldata_orig$daytime)
df_orig_long <- df_orig %>%
  pivot_longer(everything(), names_to = "Dataset", values_to = "BinaryValue")



p_orig <- ggplot(df_orig_long, aes(x = BinaryValue, fill = Dataset)) +
  geom_bar(position = "dodge") +
  scale_fill_manual(values = c("mice1.daytime" = "darkseagreen", "modeldata.daytime" = "darkseagreen2")) +
  labs(fill = "Dataset") +
  theme(legend.position = "top", text = element_text(size = 7))

p_orig
## Warning: Removed 641030 rows containing non-finite values (`stat_count()`).

Now we are ready to begin the model selection and model fitting. To answer our scientific question, we’ll need the following predictors in our model:

The setting predictors:

The apperance predictors:

We add a filter on to only include children and young people under the age of 21 in order to investigate our group of interest.

To further research the differences of the imputed data sets, we use to fit five additive GLM’s, one for each imputed data set, and use to plot the estimates for the all five models.

mice1$dataset <- "Dataset 1"
mice2$dataset <- "Dataset 2"
mice3$dataset <- "Dataset 3"
mice4$dataset <- "Dataset 4"
mice5$dataset <- "Dataset 5"

combined_data <- rbind(mice1, mice2, mice3, mice4, mice5)

models <- split(combined_data, combined_data$dataset) %>% 
  map(~ glm(bin_force ~ age2 + daytime + inout2 + ac_incid + offunif2 + cs_furtv + cs_lkout, data = ., family = binomial(link=logit)))

plot_data <- map_dfr(models, broom::tidy, .id = "dataset")

ggplot(plot_data, aes(x = term, y = estimate, color = dataset)) +
  geom_point() +
  geom_errorbar(aes(ymin = estimate - std.error, ymax = estimate + std.error), width = 0.2) +
  theme_minimal() + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust=1))

Overall the estimates generated from each imputed data set look quite similar implying that the data sets are similar.

Analysis on imputed datasets

The function averages the estimates of the complete data model, computes the total variance over the repeated analyses by Rubin’s rules, thus generating a collective model based on the five imputed data sets.

fits <- lapply(list(mice1, mice2, mice3, mice4, mice5), function(data) {
  glm(bin_force ~ age2 + daytime + inout2 + ac_incid + offunif2 + cs_furtv + cs_lkout, 
      family = binomial(link = logit), data = data) })

pooled_results <- pool(fits)

est1 <- summary(fits[[1]])$coefficients[,1]
est2 <- summary(fits[[2]])$coefficients[,1]
est3 <- summary(fits[[3]])$coefficients[,1]
est4 <- summary(fits[[4]])$coefficients[,1]
est5 <- summary(fits[[5]])$coefficients[,1]
pool <- summary(pooled_results)$estimate

cbind(est1,est2,est3,est4,est5,pool)
##                     est1         est2         est3        est4       est5
## (Intercept) -0.807746814 -0.823576721 -0.813026249 -0.80583133 -0.8159651
## age2        -0.003226702 -0.002346712 -0.002929168 -0.00315706 -0.0027254
## daytime1    -0.128283817 -0.127403779 -0.127286830 -0.12924247 -0.1263838
## inout21     -0.303220637 -0.302769550 -0.302702782 -0.30531306 -0.3033441
## ac_incid1   -0.185536198 -0.185526750 -0.185865209 -0.18578735 -0.1859498
## offunif21   -0.343892268 -0.343060674 -0.342257874 -0.34369958 -0.3433301
## cs_furtv1    0.387949172  0.387629758  0.387066478  0.38560910  0.3866447
## cs_lkout1   -0.252385506 -0.252313212 -0.252172326 -0.25266172 -0.2528592
##                     pool
## (Intercept) -0.813229243
## age2        -0.002877008
## daytime1    -0.127720141
## inout21     -0.303470028
## ac_incid1   -0.185733056
## offunif21   -0.343248097
## cs_furtv1    0.386979839
## cs_lkout1   -0.252478398

When looking at the estimates for each imputed data sets next to the pooled estimates, we again see that they are quite similar.

coefficients_matrix <- matrix(c(est1, est2, est3, est4, est5), 8, 5)

median_coefficients <- apply(coefficients_matrix, 1, median)

median_index <- which.min(median_coefficients)

median_data_set <- paste0("est", median_index)

cat("The data set with median coefficients is:", median_data_set, "\n")
## The data set with median coefficients is: est1

Because the imputed data sets are quite similar, we could in theory just choose one. But just to be sure, lets choose the imputed data set with the median of the estimates. This is .

Drop1 for additive model

To investigate if some of our chosen predictors are redundant we make a drop1 on the additive model we fitted in an earlier section. The drop1 function measures the significance of each predictor by comparing the full model with a nested model in which one predictor is dropped at a time, measuring the significance using ANOVA.

add_model <- glm(bin_force ~ age2 + daytime + inout2 + ac_incid + offunif2 + cs_furtv + cs_lkout, 
      family = binomial(link = logit), data = mice1)

drop1(add_model, test = "Chisq")
## Single term deletions
## 
## Model:
## bin_force ~ age2 + daytime + inout2 + ac_incid + offunif2 + cs_furtv + 
##     cs_lkout
##          Df Deviance     AIC     LRT  Pr(>Chi)    
## <none>       1764231 1764247                      
## age2      1  1764241 1764255    10.1  0.001449 ** 
## daytime   1  1765323 1765337  1092.4 < 2.2e-16 ***
## inout2    1  1768378 1768392  4147.5 < 2.2e-16 ***
## ac_incid  1  1766654 1766668  2423.6 < 2.2e-16 ***
## offunif2  1  1771520 1771534  7289.3 < 2.2e-16 ***
## cs_furtv  1  1774981 1774995 10749.9 < 2.2e-16 ***
## cs_lkout  1  1766724 1766738  2493.0 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We can conclude that all the predictors are significant though with being slightly less significant than the others. This is not alerting as the main purpose of the model is to investigate how the setting and appearance predictors interact with each other across the subjects age, increasing or decreasing the risk of the police using force.

Training error and cross validation on different forms

With our chosen imputed dataset, we start the actual modelfitting. As we aim to predict the risk of use of force, we utilize the logistic regression model. The response variable will be the possibility of any use of force against the civilian, only based on our parameters of interest. We initially want to fit the logistic regression as an additive model, mentioned above. In order to answer our question however, we find the interaction of combinations of a setting and appearance predictor relevant for interpretation. In addition we test the nonlinear behaviour of as a predictor.

To evaluate both these interaction models and the non-linear modelling, an ANOVA would be useful, as performed in the for the additive models. However, as the interaction models with different combination are not nested, the ANOVA, will not be useful. In extension, we also test almost every combination, whoch would yield a new predcition for every specific combination and saturates the model, which is at risk of overfitting. To prevent that, we calculate the training error and perform cross-validation and evaluate the models on a combination of those.

We calculate the training error based on the squared deviance residuals, to obtain comparable values, that express the models goodness-of-fit.

trainErr <- function(mod) {
  mean(residuals(mod, type = "deviance") ^ 2)
}

form0 <- bin_force ~ age2 + daytime + inout2 + ac_incid + offunif2 + cs_lkout + cs_furtv


form1 <- bin_force ~ age2*daytime + inout2 + ac_incid + offunif2 + cs_lkout + cs_furtv

form2 <- bin_force ~ age2 + daytime*cs_lkout + inout2 + ac_incid + offunif2 + cs_furtv

form3 <- bin_force ~ age2 + daytime*cs_furtv + inout2 + ac_incid + offunif2 + cs_lkout


form4 <- bin_force ~ inout2*age2 + daytime + ac_incid + offunif2 + cs_lkout + cs_furtv

form5 <- bin_force ~ age2 + daytime + ac_incid + offunif2 + inout2*cs_lkout + cs_furtv

form6 <- bin_force ~ age2 + daytime + ac_incid + offunif2 + cs_lkout + inout2*cs_furtv


form7 <- bin_force ~ age2*ac_incid + daytime + inout2 + offunif2 + cs_lkout + cs_furtv

form8 <- bin_force ~ age2 + daytime + inout2 + offunif2 + cs_lkout*ac_incid + cs_furtv

form9 <- bin_force ~ age2 + daytime + inout2 + offunif2 + cs_lkout + cs_furtv*ac_incid


form10 <- bin_force ~ age2*offunif2 + daytime + inout2 + ac_incid + cs_lkout + cs_furtv

form11 <- bin_force ~ age2 + daytime + inout2 + ac_incid + cs_lkout*offunif2 + cs_furtv

form12 <- bin_force ~ age2 + daytime + inout2 + ac_incid + cs_lkout + cs_furtv*offunif2 


form13 <- bin_force ~ age2 + daytime + inout2 + ac_incid + cs_lkout*cs_furtv*offunif2 

form14 <- bin_force ~ age2 + daytime + inout2 + cs_furtv*offunif2*ac_incid + cs_lkout 

form15 <- bin_force ~ age2 + daytime + cs_furtv*offunif2*inout2 + ac_incid + cs_lkout 

form16 <- bin_force ~ age2 + cs_furtv*offunif2*daytime + inout2 + ac_incid + cs_lkout 

form17 <- bin_force ~ cs_furtv*offunif2*age2 + daytime + inout2 + ac_incid + cs_lkout 

form18 <- bin_force ~ age2*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2)

form19 <- bin_force ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2)

form20 <- bin_force ~ ns(age2, df = 4)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2)


sapply(list(form0, form1, form2, form3, form4, form5, form6, form7, form8, form9, form10, form11, form12, form13, form14, form15, form16, form17, form18, form19, form20), function(form){trainErr(glm(form, data = mice1, family = binomial(link = "logit")))})
##  [1] 1.077201 1.077201 1.077201 1.077170 1.077191 1.077001 1.077096 1.077189
##  [9] 1.076870 1.076295 1.077201 1.077124 1.077169 1.076193 1.076080 1.076179
## [17] 1.077121 1.077128 1.075487 1.075440 1.075403

We perform 5-fold cross-validation, to test the ability for \(80\%\) of the set to predict the remaining responses. This step is especially important for our interaction models, in order to prevent overfitting.

cv_data <- mice1 %>%
  mutate(bin_force = as.numeric(as.character(bin_force)))


testCV <- function(form, data, B = 1, k = 5) {
  n <- nrow(data)
  y <- data$bin_force
  PEcv <- vector("list", B)
  tmp <- numeric(n)
  for(b in 1:B) {
    
    group <- sample(rep(1:k, length.out = n))
    for(i in 1:k) {
      modelcv <- glm(form, family=binomial("logit"), data = data[group != i, ]) # should be changed
      
      muhat <- predict(modelcv, newdata = data[group == i, ], type = "response") # "reponse" takes the inverse of the link
       y_cv <- y[group == i]
       
       tmp[group == i] <- -(2*(y_cv*log(muhat)+(1-y_cv)*log(1-muhat)))
  
    }
    PEcv[[b]] <- tmp
  }
  mean(unlist(PEcv))
}



sapply(list(form0,form1, form2, form3, form4, form5, form6, form7, form8, form9, form10, form11, form12, form13, form14, form15, form16, form17, form18, form19, form20), function(form){testCV(form, cv_data)})
##  [1] 1.077212 1.077218 1.077215 1.077183 1.077201 1.077015 1.077111 1.077205
##  [9] 1.076883 1.076309 1.077218 1.077144 1.077182 1.076210 1.076094 1.076201
## [17] 1.077136 1.077148 1.075531 1.075492 1.075507

With respective outputs for training errors and cross-validation, we find that the model, which both have the best goodness-of-fit and the best predictive strenght, to be each of the setting predictors interacting with each of the appearance predictors. In addition, also the cubic splines seems to fit age the best if \(df=2\).

Interaction model

Given above model selection, we find the interaction model with the cubic splines of age, the appearance predictors and setting predictors, we choose below model to predict the risk of force being used against adolescent under the age of 21.

int_model <- glm(bin_force ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), 
      family = binomial(link = logit), data = mice1)

int_model_no_spl <- glm(bin_force ~ age2*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), 
      family = binomial(link = logit), data = mice1)

anova(int_model_no_spl, int_model, test = "Chisq")
int_model <- glm(bin_force ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), 
      family = binomial(link = logit), data = mice1)

summary(int_model)
## 
## Call:
## glm(formula = bin_force ~ ns(age2, df = 2) * (cs_furtv + cs_lkout) * 
##     (daytime + inout2 + ac_incid + offunif2), family = binomial(link = logit), 
##     data = mice1)
## 
## Coefficients:
##                                         Estimate Std. Error z value Pr(>|z|)
## (Intercept)                           -0.7970016  0.0676139 -11.788  < 2e-16
## ns(age2, df = 2)1                     -0.0053154  0.1322661  -0.040 0.967944
## ns(age2, df = 2)2                     -0.0389032  0.0219268  -1.774 0.076024
## cs_furtv1                              0.2167204  0.0954121   2.271 0.023122
## cs_lkout1                             -0.2887708  0.1432189  -2.016 0.043770
## daytime1                              -0.1071533  0.0560950  -1.910 0.056106
## inout21                               -0.3699098  0.0661220  -5.594 2.21e-08
## ac_incid1                             -0.5360686  0.0571895  -9.374  < 2e-16
## offunif21                             -0.2926986  0.0635063  -4.609 4.05e-06
## ns(age2, df = 2)1:cs_furtv1            0.0925676  0.1859860   0.498 0.618687
## ns(age2, df = 2)2:cs_furtv1            0.0408119  0.0300307   1.359 0.174144
## ns(age2, df = 2)1:cs_lkout1           -0.2010623  0.2789310  -0.721 0.471013
## ns(age2, df = 2)2:cs_lkout1            0.0133210  0.0444565   0.300 0.764451
## ns(age2, df = 2)1:daytime1             0.0077852  0.1102690   0.071 0.943715
## ns(age2, df = 2)2:daytime1            -0.0140456  0.0191767  -0.732 0.463905
## ns(age2, df = 2)1:inout21             -0.0320653  0.1301148  -0.246 0.805343
## ns(age2, df = 2)2:inout21             -0.0704241  0.0221712  -3.176 0.001491
## ns(age2, df = 2)1:ac_incid1            0.3390028  0.1121540   3.023 0.002506
## ns(age2, df = 2)2:ac_incid1            0.0478760  0.0187770   2.550 0.010781
## ns(age2, df = 2)1:offunif21            0.0025059  0.1244390   0.020 0.983934
## ns(age2, df = 2)2:offunif21           -0.0260953  0.0207084  -1.260 0.207622
## cs_furtv1:daytime1                    -0.0936443  0.0802496  -1.167 0.243245
## cs_furtv1:inout21                     -0.0099852  0.0993728  -0.100 0.919962
## cs_furtv1:ac_incid1                    0.2683137  0.0801009   3.350 0.000809
## cs_furtv1:offunif21                    0.0043342  0.0880025   0.049 0.960720
## cs_lkout1:daytime1                    -0.0060065  0.1143414  -0.053 0.958106
## cs_lkout1:inout21                      0.1851329  0.1402928   1.320 0.186963
## cs_lkout1:ac_incid1                    0.2112341  0.1213236   1.741 0.081670
## cs_lkout1:offunif21                   -0.2305807  0.1199082  -1.923 0.054483
## ns(age2, df = 2)1:cs_furtv1:daytime1   0.0917419  0.1573106   0.583 0.559766
## ns(age2, df = 2)2:cs_furtv1:daytime1   0.0039064  0.0264032   0.148 0.882380
## ns(age2, df = 2)1:cs_furtv1:inout21    0.2640421  0.1952074   1.353 0.176176
## ns(age2, df = 2)2:cs_furtv1:inout21    0.0344001  0.0324731   1.059 0.289445
## ns(age2, df = 2)1:cs_furtv1:ac_incid1  0.0304217  0.1567710   0.194 0.846135
## ns(age2, df = 2)2:cs_furtv1:ac_incid1 -0.0147499  0.0256836  -0.574 0.565770
## ns(age2, df = 2)1:cs_furtv1:offunif21 -0.1268912  0.1718594  -0.738 0.460306
## ns(age2, df = 2)2:cs_furtv1:offunif21  0.0519559  0.0278374   1.866 0.061985
## ns(age2, df = 2)1:cs_lkout1:daytime1   0.0038805  0.2235289   0.017 0.986149
## ns(age2, df = 2)2:cs_lkout1:daytime1  -0.0231603  0.0364504  -0.635 0.525174
## ns(age2, df = 2)1:cs_lkout1:inout21    0.1495648  0.2755562   0.543 0.587285
## ns(age2, df = 2)2:cs_lkout1:inout21   -0.0115620  0.0454262  -0.255 0.799092
## ns(age2, df = 2)1:cs_lkout1:ac_incid1  0.0778896  0.2371830   0.328 0.742613
## ns(age2, df = 2)2:cs_lkout1:ac_incid1  0.0167965  0.0382374   0.439 0.660466
## ns(age2, df = 2)1:cs_lkout1:offunif21  0.2214417  0.2339159   0.947 0.343806
## ns(age2, df = 2)2:cs_lkout1:offunif21 -0.0006706  0.0372765  -0.018 0.985646
##                                          
## (Intercept)                           ***
## ns(age2, df = 2)1                        
## ns(age2, df = 2)2                     .  
## cs_furtv1                             *  
## cs_lkout1                             *  
## daytime1                              .  
## inout21                               ***
## ac_incid1                             ***
## offunif21                             ***
## ns(age2, df = 2)1:cs_furtv1              
## ns(age2, df = 2)2:cs_furtv1              
## ns(age2, df = 2)1:cs_lkout1              
## ns(age2, df = 2)2:cs_lkout1              
## ns(age2, df = 2)1:daytime1               
## ns(age2, df = 2)2:daytime1               
## ns(age2, df = 2)1:inout21                
## ns(age2, df = 2)2:inout21             ** 
## ns(age2, df = 2)1:ac_incid1           ** 
## ns(age2, df = 2)2:ac_incid1           *  
## ns(age2, df = 2)1:offunif21              
## ns(age2, df = 2)2:offunif21              
## cs_furtv1:daytime1                       
## cs_furtv1:inout21                        
## cs_furtv1:ac_incid1                   ***
## cs_furtv1:offunif21                      
## cs_lkout1:daytime1                       
## cs_lkout1:inout21                        
## cs_lkout1:ac_incid1                   .  
## cs_lkout1:offunif21                   .  
## ns(age2, df = 2)1:cs_furtv1:daytime1     
## ns(age2, df = 2)2:cs_furtv1:daytime1     
## ns(age2, df = 2)1:cs_furtv1:inout21      
## ns(age2, df = 2)2:cs_furtv1:inout21      
## ns(age2, df = 2)1:cs_furtv1:ac_incid1    
## ns(age2, df = 2)2:cs_furtv1:ac_incid1    
## ns(age2, df = 2)1:cs_furtv1:offunif21    
## ns(age2, df = 2)2:cs_furtv1:offunif21 .  
## ns(age2, df = 2)1:cs_lkout1:daytime1     
## ns(age2, df = 2)2:cs_lkout1:daytime1     
## ns(age2, df = 2)1:cs_lkout1:inout21      
## ns(age2, df = 2)2:cs_lkout1:inout21      
## ns(age2, df = 2)1:cs_lkout1:ac_incid1    
## ns(age2, df = 2)2:cs_lkout1:ac_incid1    
## ns(age2, df = 2)1:cs_lkout1:offunif21    
## ns(age2, df = 2)2:cs_lkout1:offunif21    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1795133  on 1637790  degrees of freedom
## Residual deviance: 1761345  on 1637746  degrees of freedom
## AIC: 1761435
## 
## Number of Fisher Scoring iterations: 4

For our model, it is now possible to plot the predicted possibility of force used against adolescents.

predFrame0 <- expand.grid(cs_furtv = factor(c(0)),
                         daytime = factor(c(0)),
                         ac_incid = factor(1),
                         inout2 = factor(c(0)),
                         offunif2 = factor(c(0)),
                         cs_lkout = factor(c(1)),
                         age2 = 10:21)
predFrame0$predicted_prob <- predict(int_model, type = "response",
                                     newdata = predFrame0, interval = "confidence")
p_pred0 <- ggplot(predFrame0, aes(x = age2, y = predicted_prob)) +
  facet_grid(cs_lkout + cs_furtv ~ ac_incid + daytime + inout2 + offunif2, label = label_both) +
  #geom_ribbon(aes(ymin = lwr, ymax = upr), fill = gray(0.85)) +
  geom_line(color = "cadetblue") + coord_cartesian(ylim = c(0.1,0.4)) +
  geom_hline(yintercept = 0.25, linetype = "dashed", color = "firebrick3")

predFramecom <- expand.grid(cs_furtv = factor(c(1)),
                         daytime = factor(c(0)),
                         ac_incid = factor(1),
                         inout2 = factor(c(0)),
                         offunif2 = factor(c(0)),
                         cs_lkout = factor(c(0)),
                         age2 = 10:21)

predFramecom$predicted_prob <- predict(int_model, type = "response",
                                     newdata = predFramecom, interval = "confidence")

p_predcom <- ggplot(predFramecom, aes(x = age2, y = predicted_prob)) +
  facet_grid(cs_lkout + cs_furtv ~ ac_incid + daytime + inout2 + offunif2, label = label_both) +
  #geom_ribbon(aes(ymin = lwr, ymax = upr), fill = gray(0.85)) +
  geom_line(color = "cadetblue") + coord_cartesian(ylim = c(0.1,0.4)) +
  geom_hline(yintercept = 0.25, linetype = "dashed", color = "firebrick3")

grid.arrange(p_pred0, p_predcom, ncol = 2)

Above is shown part of the prediction given by our model. The graph on the left, shows the possibility of police using force on you when stopped on the vertical axis, with a red line marking \(25\%\). It represents the situation, where one is in a setting, in the nighttime, outside in a high crime area meeting an officer not in uniform, where the civilian appear as being on a lookout without suspicious movements. We interpret the plot, as the risk of force used against the civilian is approximately \(21\%\) and is uniform for all ages between 10 and 21. Our chosen wields to show on the right plot, that in the same setting, where the civilian is appearing to not be on a lookout, but acting with suspicious movements, the risk of force used against them increase drastically to above \(30\%\). Another interesting change that our model seems to imply based on the change of appearance is, that the risk of force used against the civilian depends on the age. The risk increases with the age until around the age of 18, where it descends a bit. This second order polynomial behaviour is due to our utilization of splines, which exactly is justified by the predictive strenght.

predFrame1 <- expand.grid(cs_furtv = factor(c(0,1)),
                         ac_incid = factor(0),
                         daytime = factor(c(0,1)),
                         inout2 = factor(c(0,1)),
                         offunif2 = factor(c(0,1)),
                         cs_lkout = factor(c(0,1)),
                         age2 = 10:21)

predFrame1$predicted_prob <- predict(int_model, type = "response",
                                     newdata = predFrame1, interval = "confidence")

p_pred1 <- ggplot(predFrame1, aes(x = age2, y = predicted_prob)) +
  facet_grid(cs_lkout + cs_furtv ~ ac_incid + daytime + inout2 + offunif2, label = label_both) +
  #geom_ribbon(aes(ymin = lwr, ymax = upr), fill = gray(0.85)) +
  geom_line(color = "cadetblue") + coord_cartesian(ylim = c(0.1,0.4)) +
  geom_hline(yintercept = 0.25, linetype = "dashed", color = "firebrick3")


predFrame2 <- expand.grid(cs_furtv = factor(c(0,1)),
                         ac_incid = factor(1),
                         daytime = factor(c(0,1)),
                         inout2 = factor(c(0,1)),
                         offunif2 = factor(c(0,1)),
                         cs_lkout = factor(c(0,1)),
                         age2 = 10:21)

predFrame2$predicted_prob <- predict(int_model, type = "response",
                                     newdata = predFrame1, interval = "confidence")

p_pred2 <- ggplot(predFrame2, aes(x = age2, y = predicted_prob)) +
  facet_grid(cs_lkout + cs_furtv ~ ac_incid + daytime + inout2 + offunif2, label = label_both) +
  #geom_ribbon(aes(ymin = lwr, ymax = upr), fill = gray(0.85)) +
  geom_line(color = "cadetblue") + coord_cartesian(ylim = c(0.1,0.4)) +
  geom_hline(yintercept = 0.25, linetype = "dashed", color = "firebrick3")

grid.arrange(p_pred1, p_pred2, ncol=1)

For all combinations of setting and appearance, we get the predictive plots, which are to be interpreted as above.

#Residualplot of the final model

cv_data <- mice1 %>%
  mutate(bin_force = as.numeric(as.character(bin_force)))

final_model <- glm(bin_force ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), data = cv_data, family = binomial)

simDiag <- fortify(final_model)

ggplot(simDiag, aes(x = .fitted, y = .resid)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'

set.seed(2023)

yNew1 <- simulate(final_model)[,1]
yNew2 <- simulate(final_model)[,1]
yNew3 <- simulate(final_model)[,1]
yNew4 <- simulate(final_model)[,1]


simGlmNew1 <- glm(yNew1 ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), data = cv_data, family = binomial(link = logit), control = glm.control(maxit = 1000))
simGlmNew2 <- glm(yNew2 ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), data = cv_data, family = binomial(link = logit), control = glm.control(maxit = 1000))
simGlmNew3 <- glm(yNew3 ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), data = cv_data, family = binomial(link = logit), control = glm.control(maxit = 1000))
simGlmNew4 <- glm(yNew4 ~ ns(age2, df = 2)*(cs_furtv + cs_lkout)*(daytime + inout2 + ac_incid + offunif2), data = cv_data, family = binomial(link = logit), control = glm.control(maxit = 1000))

simDiagNew1 <- fortify(simGlmNew1)
simDiagNew2 <- fortify(simGlmNew2)
simDiagNew3 <- fortify(simGlmNew3)
simDiagNew4 <- fortify(simGlmNew4)


p1 <- qplot(.fitted, .resid, data = simDiagNew1) +
  geom_smooth()
p2 <- qplot(.fitted, .resid, data = simDiagNew2) +
geom_smooth()
p3 <- qplot(.fitted, .resid, data = simDiagNew3) +
geom_smooth()
p4 <- qplot(.fitted, .resid, data = simDiagNew4) +
geom_smooth()


grid.arrange(p1,p2,p3,p4, ncol=2)
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
## `geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'