Regression: linear modelling background

The basic model for regression modelling is

                                                              \(Y\)~ \(X\beta + Є\)

where

The errors represent ‘noise’ while the regression means, \(X\beta\), are the ‘signal’ or systemic part of the model. Although errors from recording values or problems in experimental set up may be their origin, they also include ‘unknown unknowns’ and variables which influence the response but which aren’t included in the model.

The errors are not directly observable and have to be estimated from the ‘residuals’ - the difference between the responses fitted by the model, \(\hat Y\), and the observations, Y. As such they are a indicator of the fitness of the model and are assumed to be random.

Since the errors have zero mean, the predicted means, \(\hat Y\), satisfy the expectation equation \(E(\hat Y) = X\beta\). Because the errors1 ~ N(0,\(\sigma^{2})\), the model predicted values for the response are conditionally normal at each vector sample point \(X_{0}\):

               \(\hat Y|X_{0} \tilde~ N(X_{0}\beta, X_{0}(X{T}X)^{-1}X_{0}^{T}\sigma^{2})\)


It is theoretically possible to fit a model perfectly to the observed responses with zero error provided that the number of independent model parameters, p, which is a function of the number of regressors, equals (or exceeds) the number of observations, n. In that case, and if the data matrix is nonsingular because of the independence, then the parameters satisfy the equation \(β = X^{-1}Y\). Such a model is said to be saturated but will in practice be of little value in predicting a response from future X-values since it is overfitted or too tightly coupled to the original data used to obtain the model parameters 2.

The parameter estimates \(\hatβ\) are calculated by least squares: the values where the total squared perpendicular distances between the regression line and the observed responses is minimised. In the linear model, where the errors are normal, the least square equations and parameter estimates are the same as those derived from maximum likelihood estimation. By definition, the sum of all the model’s errors will be zero as the maximisation of the differentiated normal equations , set to equal zero, occur at the least squares parameter estimates.

The vector of parameter estimates’ \(\hatβ\), have means \((X^{T}X)^{-1}X^{T}Y\) and variances \((X^{T}X)^{-1}var(Y) =(X^{T}X)^{-1} \sigma^{2}\) where \(σ^{2}\) must be estimated by \(\hat\sigma^{2}\)from the residuals: \(Y^{T}(1-H)Y/ (n-p)\). H is called the Hat Matrix ( \(X( X^{T} X)^{-1} X^{T}\) ).

The model’s parameters have a t-distribution. The t-distribution arises from the quotient of two independent random variables:

                                (\(\hat\beta/\sigma(X^{t}X)^{-1/2}\)) / ( \(\hat\sigma^{2}/\sigma^{2})^{1/2}\)

 
The numerator is a standard normal random variable and the denominator is a \(\chi^{2}\) rv with the latter’s degrees of freedom carrying over to the t-distribution. The t-statistic tests whether the probability of the value of \(\hatβ\) (expressed in standard deviations) arises by chance where the null hypothesis is true that the parameter is zero (and doesn’t belong in the model).

In the linear regression model the means of the predicted response are a vanilla function of the regressor variables multiplied by their least squares estimates of their coefficients. However in a generalized linear model the residuals are members of some other family of exponential distributions, rather than normal, and the relation between the predicted model mean and the response is via a link function, which take a number of canonical familial forms.

If the residuals deviate significantly from normality or display a non-random relation with the fitted values or the regressors then they are not i.i.d normally distributed and the model is poorly indicated. Often a better model can be found based upon a transformation of the response, the regressors or both, although transformations of the regressors do not “affect the distribution of errors in the response” 3. For example: the log of the response is used where a multiplicative relation exists between the response and regressors ; an arcsine transformation can be used to linearise a binomial response; and a square root transformation of the response is suitable where there it is a positive value based on counts of events (the square root of the poisson parameter \(λ\) is the standard deviation of such a random variable).

Plots of Studentized residuals against fitted values, regressors, and the influential hat matrix are basic diagnostic tools alongside QQPlots of order statistics and standard normal percentiles to assess residuals normality. The residuals and the response itself are not randomly related but necessarily correlated and so plotting them is otherwise uninformative 4.

The model’s adequacy can be gauged and models compared using two statistics: the F-statistic and Multiple R-Squared. In essence the regression model is decomposing the sum of squared distances between the responses and their mean. The sum of squares is split between those explained by the model and the unexplained residuals sum of squares. The former is the sum of squared distances between the model-fitted values and the response mean, and the latter is the sum of squared distances between the observed responses and fitted values. Geometrically, at each point on the regression line the normal distribution’s characteristic bell-shape is orthogonal to the x-axis and represents the distribution of observations at that set of X values.

The F-statistic is the value of the random variable found as the quotient of the two decomposed sources of sum of squares. A large f-statistic means the systemic variation in the numerator exceeds the random variation in the denominator and carries a low probability of the model quotient arising from chance. This is considered a necessary though not sufficient endorsement of the model.

The multiple R-squared5statistic is the proportion of the total sum of squares between the observations and their mean accounted for by the regression model6.A high figure indicates a model has explanatory power. A low figure shows that better regressors might be missing from the model. A variation is the Adjusted multiple R-squared which is lower than the unadjusted figure as it penalises each model for its number of parameters. Generally small models with just a few regressor variables are preferable to larger ones; the former are more easily interpretable, extensible to future predictions and minimise the risk of over-fitting the model to training data.

When investigating which candidate variables to include as a model’s regressors, correlations between the candidates are a good indicator. Highly correlated pairs should not both go into the model as one can serve as a proxy for the other without significantly altering the R-squared or F-statistic. Models which include correlated regressors typically exhibit high error variances because of their collinearity (which can be evidenced by the diagonal values in the \((X^{T}X)^{-1}\) matrix). On the other hand, high correlations between the response and a regressor candidate are welcome as the association is likely to produce good fit between the model and the observations.

Sometimes certain variables, or their particular properties, will be mandated by domain experts or project sponsors for inclusion in the final model. For instance, experimentalists may require that the intercept goes through the origin to accord with physical laws. In contrast, the pure statistical approach is to look coldly at data within the range collected and so a locally-determined intercept is acceptable if it fits and elucidates the data swarm as is.

As far as predictions of new variate values \(X_{0}\) are concerned, it is important to recognise that the variate has two sources of variation. The first comes from the randomness of the fitted value of the response estimate \(\hat{\bar Y_{0}}\) mean on the regression line which forms the centroid of the distribution of responses at \(X_{0}\). In simple regression this is a function of the error and the size(n) of the original training set used to derive model estimates: \(σ^{2}/n\). The second source of error comes from the conditional distribution itself, \(σ^{2}\). The sources are additive so, in a simple regression, the confidence interval for the new variate \(\hat Y_{0}\) is \(\hat{\bar Y} _{0}\) +/- \((\hatσ^{2}(n+1)/n)^{1/2}\) x \(t _{(\alpha/2,n-p-1)}\) where \(\hat σ^{2}\) = Mean Squared Error, \(t _{(α/2,n-p-1)}\) is the \(\alpha\)/2 critical value from the t-distribution on n-p-1 degrees of freedom. In multiple regression the equations become \(\sigma^{2}x^{T}_{0}(X^{T}X)^{-1}x_{0}\) for the variance of the average case \(var(\hat {\bar Y}_{0})\), and for the individual case, \(var(\hat{Y}_{0})\) = \(\sigma^{2}(1+x^{T}_{0}(X^{T}X)^{-1}X_{0})\) with \(\sigma^{2}\) estimated by \(\hat\sigma^{2}=MSE\) 7.

Generally new model predictions are better where \(X_{0}\) is an interpolation within the swarm of training set data values. Extrapolation is hazardous because the model is fitted to X values laying wholly above or below the new value which is in unknown territory. Confidence intervals fan out from the centre of the X data in both directions, often giving very wide confidence intervals at and outside of the X endpoints.

London Crime: an exploratory analysis case study

This project is an exploratory investigation of London crime, the response variable, to see which wards or boroughs are unusual and to find some publicly available variables that usefully explain it.

Sources

The Crime data were sourced from incidents recorded by the Metropolitan Police Service within the 32 London Boroughs 8. The City of London is under a different police jurisdiction and is not included in this study. The MPS data shows a caesura after May 2018 when a different categorisation of crime was introduced. After the first covid-19 lockdown in March 2020 the profile of crime committed changed drastically while its prevalence oscillated according to public health measures. Both events mean that to get a picture of normal ongoing crime the study has been limited to reported incidents occurring within the 32 boroughs between June 2018 and February 2020 (21 months).

The four potential regressor/explanatory variables considered here are

  • Income deprivation as represented as the percentage of individuals in a LSOA9 reliant on jobseekers allowance, Universal, Child, Working or pension credit and refugees
  • Employment deprivation represented as a percentage of working-age individuals in a LSOA directly claiming jobseekers allowance; Incapacity and carers allowance claimants etc
  • Average household income after housing costs
  • Multiple Deprivation Score link

The final variable, Multiple Deprivation Score (sometimes called IMD or wardIDMxCrime in the sequel) is a modification of the ONS’ own index of multiple deprivation. The latter is a function of seven domains of deprivation. The first two are Income and Employment deprivation which are separated out as variables in their own right above. In the ONS index they were the major factors each carrying the largest weights, 22.5%. They were also the easiest to interprete as their scores are actually proportions of households in an area.

The other areas of deprivation and their weightings are:

  • Education, Skills and Training (13.5%)
  • Health and disability (13.5%)
  • Barriers to housing and services (9.3%)
  • Living Environment (9.3%)
  • Crime (9.3%)

In this analysis the scores found for each domain are used with the same weights. In contrast the ONS recommends using the ranks from each domain after an exponential transformation instead of scores. However the ranks are over all 32844 English LSOAs and it is not useful to make comparisons with other regions when the models are comparing crime within the metropolitan areas. Therefore the scores are considered adequate as a heuristic in this exploratory phase of generating hypotheses for future analysis. The other modification here is that the crime component in the ONS’ index has been factored out as it is the response in the models to be considered and would introduce unwanted correlation via this circularity.

Since two of the candidate modelling variables are also components in the Multiple Deprivation Score neither are used together with it in any of the models tried in order to prevent collinearlity between regressors.

Populations referred to below are the total ward resident populations, ignoring age distributions. The crime data is converted to totals per 1000 resident population for modelling purposes to allow comparisons between wards. There are suggestions in the sequel that London boroughs which host nationally important amenities might have crime rates confounded when the numbers of visitors to them are significant.

Methodology

Because these potential regression variables were collected over different sampling units, an ONS mapping file was used to bring all variables up to ward-level totals using sub-populations as aggregation weights. All data were stored in a custom SQL Server database and a suite of views was created to integrate and aggregate the data in virtual tables. The analysis was carried out using R and the R Studio IDE with the SQL views queried using the RODBC library.

Health warning

As the study focuses solely on London boroughs it could be that they are insufficiently different to yield any interesting contrasts about the types of crime and its total magnitude between areas. A comparison with other cities, either in the UK or international, might be more enlightening.

Generally the exploratory stage of an investigation will look to see which variables are related to the response. This is evident from, for instance, the diagonal banding of plots of scatter points or from statistics such as the Pearson or Spearman correlation coefficients which measure linear association. In multiple regression models it is desirable that the variables are not strongly associated with each other to avoid collinearity and inflated mean-squared errors. As a rule two highly correlated regressor variables are proxies for each other and only one should go in the model. Often the choice is mandated by subject-matter considerations or can be made by comparing the sums of squares each contributes.

Crime is the result of complex social factors , mediations and interactions: it is unlikely that a simple model of the type sought here will have a great deal of explanatory power. In the regression context this means that regression variables might have high t-values and models large f-statistics yet the coefficient of determination (R squared value) for the models will be stubbornly low. This does not invalidate a model as a shorthand interpretative tool but does caution against it over-interpretation and underlines the dangers of extrapolation beyond the range of observed predictor variables. Other factors not in the model are always in play - although larger models are less interpretable than simpler ones and carry the risk of overfitting.

Although the data here might be considered a crime census it is known from self-reporting studies that recorded, investigated and prosecuted crime are decreasing subsets of experienced crime. Attitudes towards reporting types of crime in social groups, as well as practices and policies of the recorders, can lead to under-reporting of certain crimes. Domestic and sexual abuse, condoned or protected by religious codes and marriage vows and so-called ‘victimless’ corporate white-collar financial crime are just two instances. More recently, cyber crime is no longer recorded by the police and complainers are referred to self-regulatory bodies run by financial institutions. In this spirit the recorded crime data is treated as a sample from the reservoir of actual crime – reported and unreported – and so amenable to statistical inference and simplified representation by stochastic modelling10

Exploratory Data Analysis: TotalCrime by Ward v Income Deprivation Proportion

The ward-level data points are graphed by crime category. There are 632 London wards and 11 categories of crime resulting in 6951 scatter points over the array. The ward points are labelled with their more familiar borough names and these are readable for the outliers which stand out from the mass of scatter points in the data swarm.

Income deprivation % has been chosen as the x-axis as it has a better absolute correlation (Spearman ranking correlation = 0.4712203) with total ward crimes per 1000 population (ignoring the crime types shown) than economic deprivation % (0.440649) or income after housing costs (-0.3698345), although the differences are small. The index of multiple deprivation score was added to the panel of potential regressors later after examining, and finding disappointing, models featuring mixtures of the first three. This variable has a Spearman correlation of 0.5224405.

The main points arising from these and other plots not shown are that

  • Westminster has outlying wards for 10 of the 11 categories
  • Hillingdon has a high outlier ward for miscellaneous crime
  • The dark ellipsoidal data swarm in each crime category generally displays only a modest upward slope over the range of income deprivation indicating a low correlation.

As will become clear, it is unlikely that the variables will make a good linear model without transformation.

Models tested using total crime at Ward level

Geometrically, the linear regression model reduces the n-dimensional response space by projecting the best fitting slope into k-dimensional parameter space where k < n and is decided by the modelling process.11. Generally the response will not be linearly related to the regressors, individually or in combination, so we will need to transform the response to achieve linearity. A transformation might also correct problems with the variance, for instance where it varies with the X values or does not exhibit the correct coefficient of variation.12 Hence a number of models should be fitted and compared to see which have explanatory power while conforming to the regression assumptions, such as independent and normally distributed variance at each X data value.

The models that were tested are shown using their Wilkinson formulae. A tilde separates the response on the left from the regressors on the right. Plus signs show regressors with additive effects on the predictions or fitted values. Asterisks indicate regressors have additive effects and multiplicative effects arising from their interactions. All models have a default intercept term.

The models are listed in the order they were tested. Only a log, a square root and an inverse transformation of the response were tried. These are examples of a general system of power transformations, the eponymous Box-Cox transformations.13 The table columns show the model, the f-statistic value (omitting the parameter-dependent degrees of freedom), the multiple R-squared value and finally if all the parameters were found to be significant using a 5% critical value of their t distributions.

Models tried using transformations of total ward crime per 1000 residents as response
Wilkinson_model f_stat Mult_Rsquared params_t5pc
totalcrimeK ~ incomedeprivpc 3.5117 0.005543 N
log(totalcrimeK) ~ incomedeprivpc 79.3010 0.111800 Y
totalcrimeK^0.5 ~ incomedeprivpc 29.0490 0.044080 Y
totalcrimeK ~ wardempdeprvpc 5.6031 0.008815 Y
log(totalcrimeK) ~ wardempdeprvpc 69.7490 0.099680 Y
totalcrimeK^0.5 ~ wardempdeprvpc 28.9850 0.043980 Y
totalcrimeK ~ Net_annual_income_after_housing_costs 1.0904 0.001728 N
log(totalcrimeK) ~ Net_annual_income_after_housing_costs 44.4200 0.065860 Y
totalcrimeK^0.5 ~ Net_annual_income_after_housing_costs 15.0300 0.023300 Y
totalcrimeK ~incomedeprivpc + Net_annual_income_after_housing_costs 2.4600 0.007761 N
totalcrimeK ~incomedeprivpc * Net_annual_income_after_housing_costs 2.1610 0.010220 N
log(totalcrimeK) ~incomedeprivpc + Net_annual_income_after_housing_costs 41.5300 0.116600 N
log(totalcrimeK) ~incomedeprivpc * Net_annual_income_after_housing_costs 28.9400 0.121500 N
totalcrimeK^0.5 ~incomedeprivpc + Net_annual_income_after_housing_costs 15.7800 0.047790 N
totalcrimeK^0.5 ~incomedeprivpc * Net_annual_income_after_housing_costs 11.3100 0.051270 N
totalcrimeK ~wardempdeprvpc + Net_annual_income_after_housing_costs 4.0030 0.012570 N
totalcrimeK ~wardempdeprvpc * Net_annual_income_after_housing_costs 3.8110 0.017880 N
totalcrimeK^0.5 ~wardempdeprvpc + Net_annual_income_after_housing_costs 14.8400 0.045060 N
totalcrimeK^0.5 ~wardempdeprvpc * Net_annual_income_after_housing_costs 10.4400 0.047510 N
log(totalcrimeK) ~wardempdeprvpc + Net_annual_income_after_housing_costs 34.8200 0.099690 N
log(totalcrimeK) ~wardempdeprvpc * Net_annual_income_after_housing_costs 23.4400 0.100700 N
totalcrimeK ~ wardempdeprvpc + incomedeprivpc + Net_annual_income_after_housing_costs 2.9630 0.013960 N
totalcrimeK ~ wardempdeprvpc * incomedeprivpc * Net_annual_income_after_housing_costs 2.7120 0.029520 N
log(totalcrimeK) ~ wardempdeprvpc + incomedeprivpc + Net_annual_income_after_housing_costs 28.1700 0.118600 N
log(totalcrimeK) ~ wardempdeprvpc * incomedeprivpc * Net_annual_income_after_housing_costs 14.3300 0.138500 N
totalcrimeK^0.5 ~ wardempdeprvpc + incomedeprivpc + Net_annual_income_after_housing_costs 10.5600 0.048010 N
totalcrimeK^0.5 ~ wardempdeprvpc * incomedeprivpc * Net_annual_income_after_housing_costs 5.3010 0.056130 N
log(totalcrimeK) ~ log(incomedeprivpc) 69.1000 0.098840 Y
log(totalcrimeK) ~ I(incomedeprivpc^0.5) 79.3000 0.111800 Y
log(totalcrimeK) ~incomedeprivpc + I(incomedeprivpc^2) 40.4600 0.114000 N
1/totwardcrimeperKpop ~ log(wardIDMxCrime) 257.9000 0.290500 Y
1/totwardcrimeperKpop ~ log(wardIDMxCrime)+ Central 162.7000 0.341000 Y
1/totwardcrimeperKpop ~ log(wardIDMxCrime) * Central 121.9000 0.368000 Y
totwardcrimeperKpop ~ log(wardIDMxCrime) * Central 17.4200 0.076830 N
log(totwardcrimeperKpop) ~ log(wardIDMxCrime) * Central 78.0700 0.271600 Y

The earlier models listed above showed low R-squared values and f-tests with non-normal residuals. It appeared that wards in Westminster and its neighbours had wildly increased total crime rates compared to suburban wards. Westminster’s West End ward had 3620 total crimes between June 2018 and Feb 2020 per 1000 residents and its St James’ ward had 3154. By comparison the next highest, Bloomsbury in Camden, recorded only 1004.

To handle this a dummy variable, imaginatively called ‘Central’, was introduced to see if breaking the wards into two groups improved model fitting.The dummy variable identifies wards in Westminster and 5 boroughs adjoining it either side of the Thames (Brent, Camden, Kensington and Chelsea, Lambeth and Southwark) as a subgroup. Generally, dummy variables such as this should have a rationale apart from model-fitting expediency for their inclusion. Here it is that these borough’s host leisure, work and travel amenities of national importance. All six boroughs experience a considerable influx of outsiders which increases their population density at particular times far above their resident populations (for instance Brent hosts Wembley stadium which holds 90,000 on capacity match days or over 27% of Brent’s 329,771 population14). It might be objected that assigning all wards in each of these boroughs special status ignores local differences within them but to start cherry-picking at ward-level is question-begging when analysing ward-level crime.

A narrower dummy variable including just Westminster, Southwark, Lambeth and Camden was also considered because these boroughs host national rail termini. However it performed poorly and its model coefficients showed high p-values under t-tests.

Preferred ward-level model

The best R model using a function of ward-level total crimes per 1000 resident is 1/totwardcrimeperKpop ~ log(wardIDMxCrime) * Central. Expanding out the Wilkinson notation gives

    \(1/totwardcrimeperKpop = \beta_{0}intercept + \beta_{1}log(wardIDMxCrime) + \beta_{2}Central +\beta_{3}log(wardIDMxCrime)*Central\)

(where \(\beta_{n}\) are model parameters fitted by least squares to the data)

The model has a f-statistic of 121.9 on 3 and 628 degrees of freedom, giving a p-value of near zero. Clearly the reciprocal transformation of the response has helped as the f-stat for the model with the raw response and the same explanatory variable, income deprivation as percentage of households, was 3.51 on 1 and 630 df (p-value=0.0614). The chosen model includes interactions between the central groups and log(wardIDMxCrime).

In fact this model did not have the highest f-statistic but it did have the best R-squared value (0.368). The adjusted R-squared value was 0.365 (compared to 0.3389 for the same model bar the interaction between log(IMD score) and the dummy variable).

The coefficients of the preferred model are

  • \(\beta_{0}\) intercept = 0.0177931 (t-value = 28.998, standard error = 0.0006136)
  • \(\beta_{1}\) log(wardIDMxCrime) = -0.0035029 (t-value = -16.864,standard error = 0.0002077)
  • \(\beta_{2}\)Central = -0.0105966 (t-value = -6.085, standard error = 0.0017416)
  • \(\beta_{3}\)log(wardIDMxCrime):Central interaction = 0.00295122 (t-value=5.184, standard error = 0.0005693)
  • \(\hat\sigma\) (mean square error) = 0.002289

All four coefficients are significant to 0.1% under t-tests.

Viewed graphically, the intercept and central coefficients combine to cut the Y-axis and the log(wardIDMxCrime) and log(wardIDMxCrime):Central interaction coefficents also add together to determine the slope. These two sums differ according to the value of the central variable (0 for suburban boroughs and 1 for the six central boroughs), producing the two different affine graphs shown below.

Illustrative graphs and examples

Below are two plots: one for wards in suburban boroughs and one for wards in the six central boroughs in or neighbouring Westminster. Each shows the three versions of the inverse of the total ward crime rate per 1000 population:

  • Observed ward inverse crime rates
  • Fitted values as determined by the model
  • 95% confidence interval for the fitted values

The slope of the suburban regression line is 6 times 15steeper than its central counterpart since the signs of the parameters to the log(IMD) variable and the interaction term are different. This indicates a greater degree of association with the regressor variable, the log of the Index of Mass Deprivation. Both plots show a slight but characteristic fanning out of the confidence interval either side of its centre: the model’s precision decreases away from the mean of the regressor. The confidence intervals are for what the foregoing described as the ‘average’ case - the means of model responses on the regression line.

By way of illustration, one of the wards with zero for the Central dummy variable is Hampton North in the London Borough of Richmond. This had total crime per 1000 residents of 121.22, with inverse 0.008, and an Index of multiple Deprivation score of 17.07, with log=2.837. The ward was coded zero for the Central dummy variable.

Its fitted value is:

                          0.0177931 + 2.837*-0.0035029 + 0 + 0 =0.007855

An addition of one to the IMD score gives

                          0.0177931 + 2.894*-0.0035029 + 0 + 0 =0.007756  

and a reduction of one to the score gives

                          0.0177931 + 2.777*-0.0035029 + 0 + 0 =0.008067

These model predictions are for the reciprocal of the crime rates per 1000 residents. Back transforming to get total crime per 1000 residents yields values of 127.3 (cf. observed response = 121.22); 130.62 for an additional unit of IMD; and 124 for a reduction of a unit. The effect a change of one in the index MD score (+1 is more deprived/-1 less deprived) is about +/- 3 crimes in the ward per 1000 population over 21 months (the time frame of the data set).

However when we use a transformation of the response in linear regression to make the dependent variable, and its variance, normally-distributed, the regression line’s back transformation to the original variable is now centred at the median rather than the mean. For instance, suppose a function f(Y) is applied to a non-normal response,Y. Then

\(P(f(Y)\le med_{f(y)})\) = 0.50
\(P(f(Y) \le \mu_{f(y)})\) = 0.50 since f(Y) is normal, its median, mean and mode coincide
\(P(Y \le f^{-1}(\mu_{f(y)}) )\) = 0.50 so the original variable’s median is the inverse of the transformed variables mean.16

In this case study the the reciprocal of the crime rate has been used as the response. Thus the median of the crime rate equals the reciprocal of the fitted-value (mean) at a particular log IMD score and dummy variable indicator value.

Hampton North had a small studentized residual, 0.1722. In contrast Dulwich Hill ward in Southwark has the 10th largest absolute studentized residual, 2.46. This ward has total crime per 1000 residents of 89.41 (inverse =0.0112), and a IMD score of 17.64 (log(IMD)=2.87). Southwark’s wards are coded with a 1 for the Central dummy variable.

Its fitted value is:

                           0.0177931 + 2.87*-0.0035029 -0.0105966 + 2.87*0.002951=0.005613

An addition of one to the IMD score gives

                           0.0177931 + 2.926*-0.0035029  -0.0105966 + 2.926*0.002951 =0.005583  

and a reduction of one to the score gives

                           0.0177931 + 2.812*-0.0035029 -0.0105966 + 2.812*0.002951 =0.005645

The back transformed predictions are 1/0.005613 = 178.17 for the present IMD score (cf totalcrimes per 1000 residents observed=89.41); 179.16 for an addition of one to the index of multiple deprivation score; and 177.16 when one is taken from the score. For this ward +/- a unit of IMD produces a similar change in total ward crime.

The chosen model fits Hampton North’s observations better than Dulwich Hill’s, with the latter’s prediction being inflated by carrying the extra burden imposed by the dummy variable, since it is evidently not a high crime area unlike most of the other central wards.

In general changes at the margin vary in magnitude as we move along the regression line since the coefficients act multiplicatively with the values of log(IMD).

Residual diagnostics and model checking

Do the residuals meet the assumption that the residuals (the difference between the responses and model predicted values for each observation) are i.i.d normal and random?

For model diagnostics it is best to use studentized residuals rather than the raw residuals. The former are standardized according to a jackknife procedure where for each observation the model, including its residual’s variance, is re-calculated without it. This tests for an observation’s influence independently of its contribution to the overall model parameters.

The first plot below shows studentized residuals v fitted values. There is a little increase in residual variance as the fitted values increase from 0.008 but this is probably acceptable. Note that the response even after a transformation is correlated with the residuals and so is not plotted.

The middle plot shows studentized residuals v log(IMD scores). The residuals appear to have decreasing variance as the log scores move above about 3. As the fitted values have a negative slope when plotted against the log score this decreasing variance is a counterpart to the increasing variance in the previous plot. So if the first is acceptable then this plot is acceptable too!

Finally a QQplot checks for the normality of the studentized residuals. Here the quantiles of the sorted residuals are paired with an equivalent quantile from a standard normal distribution. If the studentized residuals are normal the pairs should lie on a 45\(^{o}\) line which they more-or-less do here apart from evidence of a few outliers at each tail. In fact the 632 studentized residuals (one for each London ward) have 21 values above 1.96 (the critical two-sided point for 5% significance under the t-distribution with 628 degrees of freedom) and 16 below -1.96. We would expect about 2.5% or 16 in each tail so the residuals are a little top-heavy but broadly show a normal distribution.17

Overall the best model doesn’t blatantly contradict the linear model’s assumptions about the residuals and their distribution. We have used a dummy variable to handle some evidence of a sub-population within the data but we have used all the data and avoided cherry-picking the most mathematically tractable wards.

Conclusion

We have only considered models for functions of a dependent variable based on total crime at ward-level. It would clearly be of interest to conduct further analyses and model the crimes (a) by category18 and (b) at borough-level. As the foregoing ignores the type of crime the time series could conceivably be set further back beyond the start here of June 2018 as the changes in classification introduced then would presumably not affect total crime.

The ward totals were plotted by crime type in the exploratory work to get a first idea of the crime data. This showed some wards had higher crime rates far outside the data swarm, with Westminster showing high-end outliers in 10 of the 11 types.

None of the four independent variables, alone or in mixtures, produced good models using the raw totals of ward crime as a response. Each showed low R-squared values and the residuals violated the requirement that they be i.i.d ~ N(0,\(\sigma^{2}\)).

To get over these difficulties log, polynomial and inverse transformations were tried to stabilise the variances which were increasing with values of the independent variable. For the response, an inverse transformation performed best although after it the response now had a contorted meaning as a kind of score for crime immunity! The best performing independent variable was an index of multiple deprivation score crafted from using 6 of the 7 domains included in the ONS’ full version. Here too transformation was found necessary to improve its performance and taking the natural logarithm proved best. These devices improved the proportion of the sum of squares explained by the model to 37%, which is still low but typical when the data is observational rather than experimental.

Although the R-Squared and f- and t-test results improved after the two transformations the residuals were still problematic. Going back to the exploratory analysis suggested that a dummy variable representing Westminster and its neighbours might isolate the effects of a distinct sub-population carrying its own crime effect. It was suggested that the dummy variable registers the effects of large numbers of outsiders visiting the wards in question, a supplement and corrective to using resident ward populations to standardise the crime rates between wards.

The case study only looked at four variables which might link crime and local deprivation. However currents of opinion, from public concerns to newspaper campaigns, Home Office priorities to police ‘canteen culture’, can determine how and if events come to be recorded and coded as types of crime19. It would be useful to see if variables associated with policing could be included in future regression models. Factors such as ward policing patrol numbers, numbers of CCTV cameras and the presence of a ward police station could weigh the effects of crime management measures in addition to the index of mass deprivation considered already.

The linear regression model is sometimes deprecated as too simple to model complex situations. However graph curvature can be modelled by polynomials of regressors and transformations of response, regressors, or both can bring variables with multiplicative and proportional effects into a de facto linear relationship. Additionally, and unlike some competing techniques such as nearest neighbour means, the linear multiple regression model’s parametric restrictions on the residuals protects it from the ‘curse of dimensionality’20. These strengths, alongside its mathematical elegance and interpretational clarity, should ensure its longevity at the heart of statistical modelling.


  1. Normality underlies the f-, t- & \(\chi^{2}\)- distributions of the parameters and errors to be used in tests of model fidelity.↩︎

  2. Statisticians generally prefer small models to large ones,‘the principle of parsimony’, as they perform better in future situations and are easier to interpret↩︎

  3. Draper and Smith ,“Applied Regression Analysis”, Wiley 2nd Ed 1981, p240↩︎

  4. Draper and Smith ,“Applied Regression Analysis”, Wiley 2nd Ed 1981, p147↩︎

  5. The name comes from the fact that it is the square of the Pearson correlation between the observed responses and the model’s fitted values.↩︎

  6. Variance(Fitted Values) / Variance(Observed response)↩︎

  7. Krzanowski “An Introduction to Statistical Modelling”, Arnold 1st ed 2002, p80↩︎

  8. see https://data.london.gov.uk/dataset/recorded_crime_summary for a queryable view of the MPS dataset↩︎

  9. In the ONS methodology, Lower Layer Super Output Areas are census units averaging about 125 households↩︎

  10. This reservoir is called a ‘super-population’. Generally when the analysis concerns a complete population such as here, the observed values are treated as a pseudo-sample from an ideal population in order to retain the power of the statistical inferential apparatus. Doing so allows the census-type dataset to be smoothed and simplified into a model. Aitkin, “Statistical Modelling with R”, OUP,2009,p29↩︎

  11. The Hat matrix mentioned below is also called the projection matrix↩︎

  12. Alternatively we could use a different link function between response mean and the regressors to achieve a similar effect or even abandon the linearity assumption and fit a spline or some non-linear function of the data. However this is not pursued in the sequel.↩︎

  13. The Box-Cox transformation: \(Y(\lambda )=\frac {Y^\lambda -1}{\lambda}\) or \(Y(\lambda )= log Y\) if \(\lambda = 0\). The power, \(\lambda\), which maximizes the likelihood function is typically used.↩︎

  14. ONS Midyear Estimate↩︎

  15. \(-0.0035029 / (-0.0035029 + 0.00295122)\)↩︎

  16. Generally, the mean in the original response is a non-closed form function of the transformed variables median, or may not exist at all such as where a reciprocal transformation is used. However, in the case where f() = log Y is normal, so Y is a lognormal random variable , the mean of Y is \(e^{\mu + \frac{\sigma^{2}}{2}}\)). This is similar to the moment generating function of a normal random variable \(E(e^{tX})= e^{(\mu t + \frac{\sigma^{2} t^{2}}{2})}\) where t is a dummy variable.) Aitkin, “Statistical Modelling with R”, Oxford University Press,2009,p125↩︎

  17. QQplots are difficult to assess but a better model fit would be achieved by using different regressions at the bottom and top of the range of IDM scores so that the response is a spline-like mixture distribution.↩︎

  18. In another study the author conducted a principal components analysis of total London crime by category. This found less than 62% of the total variance in the first two principal components (76% in the first three) with the scores found on each dimension failing to cluster the 11 crime types into distinctive groups↩︎

  19. The crime statistics are an example of knowledge production. All knowledge is the result of a theoretical practice involving the creation and application of conceptual instruments to rank and codify sensual data. The currents of opinion involve different levels of formalisation of these concepts from statute to desk sergeant training and guidance notes. Each theory, along with knowledge, also produces characteristic problems and inconsistencies: which officer is vicariously liable for corporate offences?; where does an online offence take place?; how is an event involving multiple offences and multiple actors coded? See Hindess, “The Use of Official Statistics in Sociology”, Macmillan, 1973↩︎

  20. p multiple independent variables live in a p-dimensional subspace which means data becomes sparse within the hypervolume as p increases and neighbours can only be found at increasing Euclidean distances↩︎