Author: Philip Abraham
Date: March 19, 2018
The prime objective of the analysis was to explore whether there was a statistically significant spike in the number of prostitution related arrests in Miami-Dade County (MDC), during the four days (Friday-Monday) centered around the eight Super Bowls held in MDC since 1971.
A linear regression, poisson regression model, regression tree and boosted regression trees were fitted to the available data. The number of prostitution arrests per each day of the weekend is the outcome variable. Predictor variables include three continuous variables - the seasonally adjusted monthly national unemployment rate, and the annual population of MDC, and the two binary indicator variables - Cuban Marielitos gang effect, and the Super Bowl event weekends.
Miami-Dade County (MDC) area Crime/Arrest records(1), population data(2), and the seasonally adjusted monthly national unemployment rates(3)were obtained for the period from 1971 to 2014. During that time period, the MDC area has hosted eight Super Bowls. The dates(4) are as follows:
17-Jan-71
18-Jan-76
21-Jan-79
22-Jan-89
29-Jan-95
31-Jan-99
4-Feb-07
7-Feb-10
The MDC data sets for crime, population and unemployment were merged together to form a dataframe consisting of 9032 separate observations
Prostitution arrests counts were aggregated by weekends from January 1971 to April 2014. The days - Friday, Saturday, Sunday and Monday were defined as the weekend for this analysis. Monday was included to capture the prostitution arrests made after midnights on Sundays. Table 1 shows a representative sample of the data set.
## Missing 'what' parameter. Defaulting to 'full'
| date | dayL | superbwl | unemprate_SA | popul_mil | mari | arrests | arrestPerMillion |
|---|---|---|---|---|---|---|---|
| 1971-01-01 | Fri | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-02 | Sat | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-03 | Sun | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-04 | Mon | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-08 | Fri | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-09 | Sat | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-10 | Sun | No | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-11 | Mon | No | 6 | 1.33 | No | 1 | 0.75 |
| 1971-01-15 | Fri | Yes | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-16 | Sat | Yes | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-17 | Sun | Yes | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-18 | Mon | Yes | 6 | 1.33 | No | 0 | 0.00 |
| 1971-01-22 | Fri | No | 6 | 1.33 | No | 2 | 1.51 |
| 1971-01-23 | Sat | No | 6 | 1.33 | No | 0 | 0.00 |
Figure 1 shows that the average number of weekend arrests per million people almost doubled during those eight weekends that MDC has hosted the Super Bowls compared with the non-Super Bowl weekends at MDC. But is that increase in the weekend arrest rate statistically significant?
Unemployment rate is colored in to see if there were any obvious patterns. It seems from the plot that arrests generally were lower during higher unemployment periods. Figure 2 shows the relationship between the national unemployment rate and Miami weekend prostitution arrests.
The MDC weekend prostitution arrest count drops as the unemployment rate increases from about four to six percent. However, the arrest rate stays roughly flat when unemployment rate goes over about six percent.
Looking at Figure 3, during the years 1999 and 2000, prostitution rates rises abruptly in MDC. The MDC area population reaches about 2.2 million. After that, the arrest rate takes a dive as the population increased in the MDC area with another smaller bump in arrests about 2010.
Figure 5 shown below splits the arrest count density between superbowl and non-superbowl weekends. For the weekend periods, it seems that generally Fridays tend to have the most prostitution arrest counts (lowest zero counts), followed by Saturdays and Sundays. Mondays, generally have the lowest rates (highest zero counts) of arrests in MDC.
The data was fitted with three different regression models to study the impact of Super Bowls on prostitution crime. Two linear and a tree-based model were examined:
1). Linear Multi-Regression
2). Poisson Regression
3). Boosted Tree
Since the objective of this study is to explore the effect of Super Bowl Event for the host city on the rate of prostitution activities in the area, the model selected has to be parsimonious, interpretable representations of the data that enhance that understanding. A multivariable regression model to fit the data set achieves that objective.
Weekend prostitution arrests(arrests) for the time period 1971 to 2014 is regressed against the seasonally adjusted monthly unemployment rate (unemprate_SA), population(popul_mil) and the categorical variables - Super Bowl weekend (superbwl) and the marielito effect(mari) with both categorical variables associated with two factor levels - Yes or No.
To get a visual sense of any collinearity between the predictor variables, and to see the overall correlations of the prostitution arrests with unemployment and the mdc population, a correlation matrix plot (Figure 4) was generated.
The MDC arrest variable seems to be slightly correlated with the umemployment and population. Arrest counts and unemployment rates are negatively correlated, whereas the arrest rate is positively correlated with the MDC population numbers. There doesn’t seem to be any connection between population and unemployment.
The associated linear regression model equation is:
\(arrest_{i} = \beta_0 + \beta_1 unemprateSA_{i} + \beta_2 popul_mil_{i} + \beta_3 superbwl_{i}+\beta_4 superbwl_{i}*unemprateSA_{i} + \beta_5 mari_{i}\)
An interaction term between superbwl and unemprate_SA was also included in the linear model.
mod_lr <- lm(arrests ~ superbwl*unemprate_SA + popul_mil + mari, data = mdc_sb)
summary(mod_lr)
##
## Call:
## lm(formula = arrests ~ superbwl * unemprate_SA + popul_mil +
## mari, data = mdc_sb)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.718 -3.106 -1.565 0.898 121.075
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.11318 4.39270 -1.392 0.16406
## superbwlNo 8.54620 4.37060 1.955 0.05057 .
## unemprate_SA 1.01190 0.67391 1.502 0.13325
## popul_mil 3.12583 0.18464 16.929 < 2e-16 ***
## mariNo -0.03216 0.45073 -0.071 0.94311
## superbwlNo:unemprate_SA -1.84406 0.67534 -2.731 0.00634 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.561 on 9026 degrees of freedom
## Multiple R-squared: 0.07157, Adjusted R-squared: 0.07106
## F-statistic: 139.2 on 5 and 9026 DF, p-value: < 2.2e-16
The inference for the linear regression model as a whole is as follows:
H0: \(\beta_1\) = \(\beta_2\) = \(\beta_3\) = \(\beta_4\) = \(\beta_5\) =0, and
HA: At least one \(\beta_i\) is different than 0.
The p-value associated with the F-statistic of 140 obtained from this model is close to zero. This means at least one of the \(\beta's\) is non-zero.
The adjusted r-squared value indicates that only about 7 percent of variability in the arrest variable is explained by the multivariable regression model.
The results above show that when unemployment,population and the marielito variables are accounted for in the model and held constant, then a non-Super Bowl event weekend on the average generates an additional nine arrests per weekend over a Super Bowl weekend in the MDC area. However, with a p-value of slightly over 0.05, non-Super Bowl weekends are not a statistically significant predictor of prostitution activity in MDC. The interaction term between non-Super Bowl and unemployment with a p-value <<0.05, is statistically significant in the regression model. Unemployment rates and the marielito effect are not significant factors for the arrest outcome. Though, as expected, the arrest counts for prostitution is strongly associated with the MDC population numbers.
The least-squares regression model assumes:
1. That the residuals (errors) are centered about the regression line.
2. Residuals are normally distributed.
3. The residual variance doesnât change as a function of the fitted values.
Model diagnostic plots are shown above. The upper left plot shows the residuals versus the fitted values. The red line is a smoothed curve that passes through the actual residuals, and it is relatively straight here. However, the red line deviates from the gray dashed line as the fitted values increase; indicating that the residuals are not centered on the regression line at the higher fitted values.
The upper right plot is a qqnorm() plot of the residuals. As mentioned above, one of the assumptions of a least-squares regression is that the errors are normally distributed. This plot evaluates that assumption. If the residuals were normally distributed, they would lie exactly on this line. There is a lot of deviation, particularly near the right ends on this plot, therefore indicating the non-normality of this dataset.
The vertical on the lower left plot display the residuals rescaled so that they have a mean of zero and a variance of one. The linear regression model assumes homoscedasticity, and that the variance in the residuals doesnât change as a function of the predictor variables. That assumption is violated here, because the red line shows the trend is going up.
Lastly, the lower right plot shows the standardized residuals against leverage. Note that the standardized residuals are not centered around zero and reach ten to twenty standard deviations positively away from zero. This would not be expected for a normal distribution. The contours plots shown of Cookâs distance, measures how much the regression would change if a point was deleted. On this plot, the red smoothed line stays close to the horizontal gray dashed line and most of the data points here have a large Cookâs distance < 0.5.
The diagnostic plots indicate that error distribution normality and constant residual variance assumptions of the least-squares regression model cannot be upheld here.
In Figure 5 below, the MDC weekend arrest count density reflects a positively skewed distribution.
There are two problems with applying an ordinary linear regression model to this positively skewed data. First, many observations in the data set has a value of zero. The high number of zeros in the data set prevents the transformation of a skewed distribution into a normal one. Second, it is quite possible that the linear regression model will produce negative predicted values, which do not exist in count data
Poisson regression models are generally used for count data. Since the outcome variable in this analysis is the number of arrests, a poisson regression model would be appropriate to fit this data.
The Poisson model assumes that the mean and variance of the errors are equal, but here that assumption is violated.
mean(mdc_sb$arrests)
## [1] 3.229296
var(mdc_sb$arrests)
## [1] 46.34416
The mean of arrests is substantially lower than the variance of arrest counts, indicating overdispersion. Therefore, a quasi-poisson approach, which is robust to overdispersion, is used to model this data. This approach leads to an estimated dispersion = 10.3, which is clearly larger than one, confirming that over-dispersion is present in the data. Utilizing the quasi-poisson approach leads to the same coefficient estimates as the standard Poisson model, but inference is adjusted for over-dispersion. An interaction term between superbwl and unemprate_SA was also included in the quasi-poisson regression model.
mod_qmle_i <- glm(arrests ~ superbwl*unemprate_SA +
popul_mil + mari, family="quasipoisson", data = mdc_sb)
summary(mod_qmle_i)
##
## Call:
## glm(formula = arrests ~ superbwl * unemprate_SA + popul_mil +
## mari, family = "quasipoisson", data = mdc_sb)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.297 -2.145 -1.355 0.452 28.414
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.94981 0.76356 -1.244 0.213556
## superbwlNo 1.63217 0.74117 2.202 0.027679 *
## unemprate_SA 0.09063 0.10412 0.870 0.384068
## popul_mil 1.05013 0.05757 18.239 < 2e-16 ***
## mariNo 0.05792 0.18963 0.305 0.760053
## superbwlNo:unemprate_SA -0.37380 0.10502 -3.559 0.000374 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 10.26959)
##
## Null deviance: 69208 on 9031 degrees of freedom
## Residual deviance: 59486 on 9026 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Here, unlike the linear regression model, a non-Super Bowl event weekend, is statistically significant to the arrest count outcome. The population variable and the interaction term of superbwlNo:unemprate_SA are also statistically significant in this model. Since the parameter estimates are on the log scale, the parameters are exponentiated to obtain the values for the coefficients
exp(cbind(co=coef(mod_qmle_i),ci=confint(mod_qmle_i)))
## co 2.5 % 97.5 %
## (Intercept) 0.3868137 0.0818397 1.6887024
## superbwlNo 5.1149370 1.2260697 23.2306548
## unemprate_SA 1.0948648 0.8827269 1.3360544
## popul_mil 2.8580085 2.5542789 3.2010455
## mariNo 1.0596270 0.7452216 1.5714485
## superbwlNo:unemprate_SA 0.6881123 0.5628538 0.8548019
Per the quasi-poisson model, on the average there are about five times more prostitution related arrests during non-Super Bowl weekends in the MDC area, compared with Super Bowl weekends.
The residuals obtained from the Quasi-Poisson regression model show a time dependency as shown below in Figure 6
The autocorrelation function (ACF) and partial autocorrelation function (PACF) plots (Figures 7 & 8) below displays residual correlations with time lags.
Since there are correlations among the residuals, the estimated standard errors of the linear regression and the quasi-poisson models will tend to underestimate the true standard errors in a linear model. Therefore, the confidence interval will be narrower than it should be.
The gbm() function within the gbm package in R is used to fit thousand boosted regression trees to the data with a poisson distribution. Usually smaller trees can aid in interpretability, moreover using stumps(single split) leads to an additive model. Therefore, here a single split is specified for each tree. A shrinkage parameter of 0.2 used to control the rate at which boosting algorithm learns.
library(gbm)
set.seed(12345)
boost <- gbm(arrests ~ superbwl+unemprate_SA+popul_mil+mari, mdc_sb,
distribution = "poisson",n.trees = 1000,interaction.depth = 1,shrinkage = 0.2)
A relative influence plot with relative influence statistics is generated below. The unemployment variable (unemprate_SA) is by far the most important variable in terms of affecting the number of prostitution arrests.
## var rel.inf
## unemprate_SA unemprate_SA 90.352338
## popul_mil popul_mil 7.480401
## superbwl superbwl 2.167261
## mari mari 0.000000
A “Partial Dependence” plot below shows the marginal effect of the unemployment variable on the prostitution arrest counts after âaveraging outâ the influence of all other variables. In this case, the arrests decrease as the unemployment rate increases from four to six percent. The dependence of unemployment on arrest count varies above six percent.
The Super Bowl weekends in itself did not seem have a high impact on prostitution arrests compared to the unemployment rate. However, relative to the non-Super Bowl weekends, there seem to be an increase in arrest counts.
The aim of this analysis was to explore whether there was a statistically significant increase in the number of prostitution related arrests in Miami-Dade County, during the weekends (Friday-Monday) centered around the eight Super Bowls held in that area since 1971.
It’s not clear from the linear multi-regression and quasi-poisson models that the Super Bowl event by itself led to any statistically significant increase in the number of prostitution related arrests in Miami-Dade County. However, the results from the linear models points to the interaction of the superbowl event and the unemployment variables as the statistically significant contributor to the prostitution arrest count response. Moreover, as expected, the Miami-Dade County population also plays a big factor in the number of arrest.
The boosted tree model though shows the national unemployment rate as the most significant variable in relation to prostitution arrest counts. The Miami population and the Super Bowl event only had a relatively very small effect on arrest counts compared with the national monthly unemployment rate.
Ultimately, what is derived from the linear and non-linear regression models is that a single predictor, such as the Super Bowl event in Miami, may not be the only contributor to the spike in prostitution activities in Miami. It may be a combination of other factors, such as unemployment and/or population. There may be also other influences, not studied here in this analysis, such as tourism, climate change, increases in police sting operations or sex-trafficking, etc. that could also play a part in affecting the rate of prostitution arrests in Miami.
setwd('..')
source('load_db.R')
source('calc_db.R')
setwd('..')
mdc <- load_mdc()
mdc$calc.broad <- cc_classify(mdc)
mdc$calc.detailed <- cc_classify(mdc,'detailed')
library(dplyr)
mdc_pros <- filter(mdc,calc.broad %in% 'Prostitution')
mdc_pros$date <- 0
for(i in 1:nrow(mdc_pros)){
ifelse(is.na(mdc_pros$arrest.date[i]),
mdc_pros$date[i]<- mdc_pros$case.date[i],
mdc_pros$date[i] <- mdc_pros$arrest.date[i])
}
mdc_pros$date <- as.Date(mdc_pros$date,origin = "1970-01-01")
mdc_pros <- filter(mdc_pros, date >= "1971-01-01")
mdc_pros <- as.data.frame(mdc_pros[,c('date')])
names(mdc_pros) <- c('date')
mdc_pros$year <- format(mdc_pros$date,'%Y')
mdc_pros$month <- format(mdc_pros$date,'%b')
mdc_pros$dayN <- format(mdc_pros$date,'%d') # day number
mdc_pros$dayL <- format(mdc_pros$date,'%a') # tue
mdc_pros$week <- format(mdc_pros$date,'%W') # week number
mdc_sb <- mdc_pros %>% count(date,year,month,dayN,dayL,week)
# Generate Regular Sequences of Dates by days from 1971-01-01 to 2014-04-10
d <- as.data.frame(seq(as.Date("1971/1/1"), as.Date("2014/4/10"), "days"))
names(d) <- 'date'
d$year <- format(d$date,'%Y')
d$month <- format(d$date,'%b')
d$dayN <- format(d$date,'%d') # day number
d$dayL <- format(d$date,'%a') # tue
d$week <- format(d$date,'%W') # week number
d_sb <- d %>% count(date,year,month,dayN,dayL,week)
d_sb$n <- NULL
library(dplyr)
mdc_sb <- right_join(mdc_sb, d_sb)
library(dplyr)
mdc_sb <- mdc_sb %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
# Census Data
source('code/analyze_census.R')
mdc_census <- load_census()
mdc_census <- subset(mdc_census, mdc_census$County=="Miami-Dade County")
mdc_census <- mdc_census[,c("Year","Population")]
mdc_census <- filter(mdc_census, Year >= "1971")
names(mdc_census) <- c("year","population")
mdc_census$year <- as.character(mdc_census$year)
library(dplyr)
mdc_sb <- left_join(mdc_sb, mdc_census, by = "year")
mdc_sb <- mdc_sb %>% mutate(arrestPerMillion = round(n/population*1000000,2))
mdc_sb$superbwl <- NA
# Identifying Frid-Mondays of the superbowls
mdc_sb$superbwl[mdc_sb$date=='1971-01-15'|mdc_sb$date=='1971-01-16'|
mdc_sb$date=='1971-01-17'|mdc_sb$date=='1971-01-18'|
mdc_sb$date=='1976-01-16'|mdc_sb$date=='1976-01-17'|
mdc_sb$date=='1976-01-18'|mdc_sb$date=='1976-01-19'|
mdc_sb$date=='1979-01-19'|mdc_sb$date=='1979-01-20'|
mdc_sb$date=='1979-01-21'|mdc_sb$date=='1979-01-22'|
mdc_sb$date=='1989-01-20'|mdc_sb$date=='1989-01-21'|
mdc_sb$date=='1989-01-22'|mdc_sb$date=='1989-01-23'|
mdc_sb$date=='1995-01-27'|mdc_sb$date=='1995-01-28'|
mdc_sb$date=='1995-01-29'|mdc_sb$date=='1995-01-30'|
mdc_sb$date=='1999-01-29'|mdc_sb$date=='1999-01-30'|
mdc_sb$date=='1999-01-31'|mdc_sb$date=='1999-02-01'|
mdc_sb$date=='2007-02-02'|mdc_sb$date=='2007-02-03'|
mdc_sb$date=='2007-02-04'|mdc_sb$date=='2007-02-05'|
mdc_sb$date=='2010-02-05'|mdc_sb$date=='2010-02-06'|
mdc_sb$date=='2010-02-07'|mdc_sb$date=='2010-02-08'] <- 1
library(dplyr)
mdc_sb <- mdc_sb %>%
mutate_if(is.numeric, funs(ifelse(is.na(.), 0, .)))
mdc_sb$superbwl <- factor(mdc_sb$superbwl,labels = c("Yes","No"), levels=c(1,0))
# National Unemployment Rate - Seasonally Addjusted
ue <- read.csv('code/SupBwl_Prost/UnemploymentRate_SA.csv')
ue$date <- as.character(ue$date)
ue$date <- as.Date(ue$date, "%m/%d/%y")
ue$year <- format(ue$date,'%Y')
ue$month <- format(ue$date,'%b')
ue$date <- NULL
mdc_sb <- left_join(mdc_sb, ue, by = c("year","month"))
mdc_sb$popul_mil <- round((mdc_sb$population)/1000000,2)
myCols <- c("date","year","month","dayN","dayL","week","superbwl",
"unemprate_SA","population","popul_mil","n","arrestPerMillion")
mdc_sb <- mdc_sb[,myCols]
names(mdc_sb) <- c("date","year","month","dayN","dayL","week","superbwl","unemprate_SA","population",
"popul_mil","arrests","arrestPerMillion")
# Mariel Boatlift from Cuba ~ 2700 hardened criminals - April 1980 to April 1981
mdc_sb$mari <- 0
mdc_sb$mari[mdc_sb$date >= '1980-04-01' & mdc_sb$date < '1981-05-01'] <- 1
mdc_sb$mari <- factor(mdc_sb$mari, levels = c(1,0),labels = c('Yes','No'))
## Weekend arrests - Friday-Monday
mdc_sb <- subset(mdc_sb, mdc_sb$dayL=='Fri'|mdc_sb$dayL=='Sat'|mdc_sb$dayL=='Sun'|mdc_sb$dayL=='Mon')
require(knitr)
kable(head(mdc_sb[,c("date","dayL","superbwl","unemprate_SA",
"popul_mil","mari","arrests" ,"arrestPerMillion")],14),
digits = 2, align = c(rep("l", 4), rep("c", 4), rep("r", 4)),
caption = "Weekend Prostitution Arrests Data Set*")
# mean arrests split by superbowl and non-superbowl weekends
means <- aggregate(arrestPerMillion ~ superbwl, mdc_sb, mean)
########## PLOTS
library(ggplot2)
# Violin - arrests- superbowl weekend vs non-superbowl weekend - Frid-Mond
ggplot(mdc_sb, aes(x=superbwl, y=arrestPerMillion)) +
geom_violin(alpha=0.5, color="black")+
geom_jitter(alpha=0.5, aes(color=unemprate_SA),
position = position_jitter(width = 0.05)) +
stat_summary(fun.y=mean, colour="red", geom="line", aes(group = 1))+
geom_text(data = means, aes(label = round(arrestPerMillion,2),
y = arrestPerMillion),size=8) +
labs(title="Figure 1 - Weekend Prostitution Arrests - \nSuper Bowl Weekends versus Non-Super Bowl Weekends",
x="Super Bowl Weekend",
y = "Weekend Prostitution Arrests (per million people)")
library(ggplot2)
ggplot(mdc_sb, aes(x=unemprate_SA, y=arrestPerMillion))+geom_point()+
geom_smooth()+
labs(title="Figure 2 - Weekend Prostitution Arrests v/s Unemployment Rate",
x="Unemployment % Rate (Seasonally Adjusted) ",
y = "Weekend Prostitution Arrests (per million people)")
ggplot(mdc_sb, aes(x=as.integer(year), y=arrestPerMillion, color=population))+geom_point()+
geom_smooth() +
labs(title="Figure 3 - Weekend Prostitution Arrests v/s Population",
x="Years", y = "Weekend Prostitution Arrests (per million people)")
## histogram counts broken by days of week -Frid-Mon
library(ggplot2)
ggplot(mdc_sb, aes(arrests, fill = dayL)) + coord_cartesian(xlim = c(0,7))+
geom_histogram(aes(y = (..count..)/sum(..count..)),bins=250) +
facet_grid(superbwl ~ dayL, margins=FALSE, scales="free_y")+
labs(title=" Figure 5 - Arrests per Day - Superbowl (Yes) vs. Non-Super Bowl (No) Weekend",
x="Arrests per Day", y = "Density")
myCols = c("arrests","unemprate_SA","population")
dat <- mdc_sb[,myCols]
dat$population <- as.numeric(dat$population)
library(Hmisc)
# ++++++++++++++++++++++++++++
# flattenCorrMatrix
# ++++++++++++++++++++++++++++
# cormat : matrix of the correlation coefficients
# pmat : matrix of the correlation p-values
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
res2<-rcorr(as.matrix(dat))
# Correlation data frames
corrdf <- flattenCorrMatrix(res2$r, res2$P)
dar <- round(cor(dat),1)
library(corrplot)
# Create upper triangle correlation matrices with numbers
corrplot(dar, method="number", type = "upper", bg="darkgrey",
title="Figure 4 - Prostitution Arrest Correlations",
number.cex=2, mar=c(5, 4, 1, 2) + 0.1)
mod_lr <- lm(arrests ~ superbwl*unemprate_SA + popul_mil + mari, data = mdc_sb)
summary(mod_lr)
# Model Diagnostics
library(car)
par(mfrow=c(2,2))
plot(mod_lr, main='Model Diagnostic Plots')
#shows a Poisson distribution counts probability
plot(table(mdc_sb$arrests)/nrow(mdc_sb),
main='Figure 5 - Arrest Count Density',
xlab = 'Arrest Counts', ylab='Density')
mean(mdc_sb$arrests)
var(mdc_sb$arrests)
mod_qmle_i <- glm(arrests ~ superbwl*unemprate_SA +
popul_mil + mari, family="quasipoisson", data = mdc_sb)
summary(mod_qmle_i)
exp(cbind(co=coef(mod_qmle_i),ci=confint(mod_qmle_i)))
mdc_sb$resid <- mod_qmle_i$residuals
library(ggplot2)
ggplot(data=mdc_sb, aes(x=date, y=resid)) + geom_point() +geom_smooth()+
coord_cartesian(ylim = c(-1,10))+
labs(title="Figure 6 - Quasi-Poisson Regression model Residuals v/s Time",
x="Years", y = "Residuals")
## Time series analysis of qmle model residuals
library(xts)
x <- mdc_sb[,"resid"]
idx <- mdc_sb$date
mdc_sb_xts <- xts(x, order.by = idx)
# Autocorrelations
acf(mdc_sb_xts, main='Figure 7 - Autocorrelation Function')
pacf(mdc_sb_xts, main='Figure 8 - Partial Autocorrelation Function')
library(gbm)
set.seed(12345)
boost <- gbm(arrests ~ superbwl+unemprate_SA+popul_mil+mari, mdc_sb,
distribution = "poisson",n.trees = 1000,interaction.depth = 1,shrinkage = 0.2)
summary(boost)
plot(boost,i='unemprate_SA', type="response", main='Figure 9 - Unemployment vs Arrests')
plot(boost,i='superbwl', type="response",
main='Figure 10 - Super Bowl/Non-Super Bowl Weekends vs Arrests')
Crime and Arrest Data:
Center for Science and Law, Houston, Texas
Miami-Dade County Census:
Center for Science and Law, Houston, Texas
Unemployment Rate: Aged 15-64: All Persons for the United States:
https://fred.stlouisfed.org/series/LRUN64TTUSM156S/download
List of winners, losers, scores and sites of every Super Bowl:
http://www.sportingnews.com/nfl/news/nfl-super-bowl-winners-history-past-last- list-losers-scores-host-sites-championship/sjohe60weh8l13kspgeskl8xs