This report summarizes the key insights extracted from the variables available in the CoSA Municipal Court InCode Court case management system database extracts.

The analysis below is based on the Citations, Violations, Violations sent to OmniBase, violations with warrants, and violation administrative history data extracts provided by the City of San Antonio Municipal Court. Please refer to the appendix for a description of the data delivery, import, and sub-setting process.

Descriptive analysis

The sections below present the characteristics of the key variables available in the citations and violtions data tables, and the constructed variables. Descriptive statistics are provided for the continuous variables. Where appropriate, continuous and categorical variables, and bivariate group comparisons, are visualized. The table below summarizes the variables available in the citations file. (See Appendix for a discussion of levels of analysis, including citation vs. violation)

As alluded to above, the unit of analysis for the descriptive statistics above is the citation. Although demographic data is available, the unit of analysis is not the individual, but every specific event when a written notice (i.e. citation) is issued to an entity (individual) with specific characteristics; with absolute certainty there are duplicated individuals in the citations set, although it is neither feasible (due to privacy issues), nor feasible to asses to what extent). This caveat does not invalidate any analysis including demographics, and merely needs to be kept in mind.

Variables in Table 1 are arranged in no particular order (the customary representation is to include continuous variables on top, and proceed with categorical variables). The demographics of court clients for traffic citations skews young (see visualization below), with median age 31 (younger than the median age for San Antonio, which according to census data is 33.8 years even though the underlying distribution in the court data does not include young children).

Distribution of Citations by age

The histogram shows that the rate of citations skews notably younger (the distribution is also fully congruent with insurance companies’ policy to charge elevated premiums for drivers under the age of 25). The mean age is 34.5 and the median is 31, further illustrating the skew (i.e., half of the citations are associated with individuals between 16 and 31 years of age). The variable “age” has been cleaned by removing observations with typos resulting in impossible values for age (e.g., 0 or negative), and also removing observations with perfectly possible but less plausible values (e.g. ages under 16, or ages over 90 [including values up to 120 currently recorded]). Recommendation: while the number of records affected is not very large, data quality can be improved with implementation of an auto check during data entry: 1) Rejecting impossible values, and 2) providing warnings for possible, but less plausible values.

The data allows calculating the age of the vehicle in which the violation was committed (later this is used as crude proxy for personal income). The median vehicle age in the court data is 8 years (mean 8.6). This is marginally interesting, as it appears to be significantly lower than the national vehicle age of 12.2 years. The importance of this - if any - is unclear, as is whether it signifies behavioral patterns correlated with types of vehicles owned, enforcement patterns, or traffic patterns.

Vast majority of court clients (72%) are cited for one violation, but citations with two violations (20%) are common, as are tickets with three violations (7.8%). The meaning of the number of days between violation date and the citation last change date is unclear (see Appendix); it contains a large number of missing values, and likely only applies to citations with convictions. The violation level date ranges are much more clear.

Distribution of Citations by race and gender

The chart above summarizes the race information as recorded in the court database for each citation. Records with unknown or missing race were removed, and the categories “Asian”, “Middle-eastern”, and “Native American” were recoded into “Other”. This distribution needs to be interpreted carefully due to the unknown degree of overlap between the categories “White” and “Hispanic” (as most Hispanic individuals are also - and self-identify as - white). For example, the population of San Antonio is 71.4% white and 64.5% Hispanic, i.e. with a fairly small minority of non-Hispanic whites. The proportion of Black residents in the database (10.5%) is higher than the city-wide share (6.8%). Acknowledging the implementation difficulties, and the dependence on state and other system, we recommend exploring transitioning the race data in the two-question format used by the Census.

The gender distribution of court clients in this data base is 61.2% male, 38.4% female. The reason for this discrepancy is unclear. While recent research has contested the idea that there are significant gender differences in risk perceptions and driving skills, studies on driving behavior have found that the level of concern about risk may differ significantly across genders.

17% of the citations in this extract have been reported to OmniBase. For variation in the likelihood of this outcome see next sections. Vast majority of citations (>97%) are issued to drivers residing in Texas. The importance of the “Defendant driving license on file” and “Aggressive driver flag” is unclear to the UTSA team, but the distirbutions are presented for general information.

Violations descriptive statistics, OmniBase holds by race and gender

The first table in this section presents the distribution of some of the variables described above within the violations file. To reinforce the point that it is not people, but citations and violations that are the unit of analysis, Table 2 shows slight discrepancies in the distributions when considered at the violation level. Broadly speaking, the distribution of the categorical variables hows a marginally greater propensity for certain groups to receive greater number of violations (e.g., Black and Hispanic, male). The demographics variables were joined from the citations file based on citation number, and the OmniBase flag was joined from the omni table, based on the combination of citation and violation number.

Before presenting several bivariate analysis in more detail, the next table also compares the distribution of these variables conditional on whether or not the violation was reported to OmniBase.

This convenient (though not comprehensive) way to compare characteristics by group allows several quick observations. Black and Hispanic court clients are marginally over-represented in the group of violations reported to OmniBase. OmniBase clients are also younger and poorer (as crudely approximated by vehicle age).There is surprisingly little difference between violations reported to OmniBase and not in terms of whether an attorney was involved in the violation.

Comparison between these two tables shows that the distribution of time to disposition (number of days between violation date and status date) is extremely skewed. While the median time to disposition is 212 days, the mean is 845. Unlike table 2, table 3 shows mean times to disposition, and the length of this time in days is significantly longer for violations reported to Omnibase. This should not be surprising since by definition violations that did not resolve on normal process schedule were reported and required client action, thus mainly the magnitude of the difference is of interest: this iteration of results (i.e. without removing any outliers) suggests that violations reported to OmniBase on average take 44% longer time to resolve.

Using the violations file, the table below shows the distribution of OmniBase holds by race. There is some minor, but notable variation: African american court clients are the most over-represented among violations reported to OmniBase, followed by HIspanic. While the rate of OmniBase holds for the entire data set is 17.4%, within the sub-group of Black court clients, this share is 21%, followed by Hispanic (18%), white (16%), and other (8%).

When assessing the distribution by gender, however, although male drivers are much more likely to be court clients (68%) there is absolutely no observable difference in rate of OmniBase holds within the sub-groups of male and female court clients. See table below.

Timelines for cases (cases reported to OmniBase only)

Following is an estimate of the timelines between events for violations reported to OmniBase (i.e. the subset of violations sent to OmniBase). Four time variables (length in days) are calculated based on violation date, date of report to OmniBase, date of OmniBase clearance, and date of violation status. The table reports the median, the mean, and the standard deviation of the computed measures. For all variables the distributions are skewed (contain outliers explaining the significant difference between medians and means). 37,476 records do not have a clearance date yet, most likely because these cases are still pending (Sent to Omnibase is one of the disposition codes that are in teh data extract).

The median number of days elapsing between violation date and an OmniBase report is 128 days, i.e. approximately 4 months, which appears consistent with the routine court processes including automatic resets, follow ups, etc. However, once reported to OmniBase, cases tend to be unresolved for a very significant periods of time: the median time between report to OmniBase and clearance is over two years. This suggests a degree of doubt in the effectiveness of the program as negative incentive, at least as a timely one. OmniBase puts a hold on license renewal (rather than suspend it). Considering that most driver’s licenses are valid for 10 years, unless one’s driver’s license is about to expire, there is no apparent incentive to rush to resolution, hence the rather lengthy period to do so.

There is virtually no distinction between OmniBase clearance dates and status dates, and there should not be, as the clearance is typically the result of completing court requirements first. Consequently, the overall time to disposition is fairly close to the time between reporting and clearance.

Distribution of OmniBase holds by geography (zip code)

The first map below shows the rate of OmniBase holds per 10k population in each zip code, against the backdrop of every zip code median income. This approach is similar to the analysis implemented in the “Driven by Debt: the Failure of the OmniBase program” report prepared by Texas Apleseed and Texas Fair Defence Project.

Similar to their findings, we observe a notable virtually linear pattern of the rate of OmniBase holds inversely related to area median income, i.e. seemingly supporting the argument that the program represents especially problematic burden for the poorest segments of the population. This seems to be further supported by almost linear relationship between zip code median income and rate of OmniBase holds on tha scatter plot following the map.

We used the word “seemingly”, because we find the measure methodologically objectionable. Simply calculating the rate of OmniBase holds per population ignores the obvious detail that the rate of OmniBase holds will be related to overall rate of violations of residents within certain area. The rate of violations is not uniform across city areas. Indeed, it will be spuriously related to both income and rate of OmniBase holds: more intense traffic and greater number of traffic violations are much more likely in inner city areas (which also tend to be poorer) than on serene sub-division roads (which also tend to be more affluent). Perhaps one reason for the use of such sub optimal measure of OmniBase burden has been lack of violation-level court data like the one used in the present analysis

Accordingly, we propose a much more valid measure, avoiding the spurious effect of variation in rate of violations, and it is the percent of violations for each zip code that were reported to OmniBase.

Rate of OmniBase holds and median household income by ZIP code (all San Antonio ZIP codes)

Plot of ZIP code median income vs. rate of OmniBase holds

The scatter plot above reinforces the pattern suggested by the zip code map that there is a notable, virtually linear inverse relationship between income and rate of omni base holds.

However, as noted above, this approach is at least partially methodologically questionable as it does not control for rate of violations which however is necessarily related to the rate of OmniBase holds even if all else is equal, but it is also likely spuriously related to area income: poorer inner city areas are more traffic heavy and will necessarily experience higher rate of violations (and attendant OmniBase holds) than less traffic dense more affluent areas in the outskirts.

To remedy this problem, below we repeat the analysis by replacing “rate of OmniBase holds per population” with “Percent of violations subjected to OmniBase holds”, which eliminates the spurious effect of rate of violations across different areas.

Percent of cases with OmniBase holds and median household income by ZIP code (all San Antonio ZIP codes)

Unlike the previous map showing the rate of OmniBase holds per population, the map below plots the percent of violations with OmniBase holds. Careful examination still demonstrates some degree of connection: at least in the most affluent areas, the percentage of OmniBase cases appears smaller than in all other areas. However, the relationship is not nearly as clear as when rate of OmniBase holds per population is used (as above). Rather than linear, the pattern appears bifurcated: no apparent trend in the lower range of incomes, but a drop (in OmniBase cases) in the most affluent areas. To further examine the nascent pattern, we also provide a plot of the two variables (income and percent of cases with OmniBase holds) below.

Plot of ZIP code median income vs. percent of cases with OmniBase holds

The plot reinforces and clarifies the nature of the somewhat muted relationship implicit in the zip code map. The plot further shows that there is no discernible relationship between income and percent of OmniBase holds up until an area reaches a median income of about $60,000. Only after that level of income an inverse relationship between the two variables becomes notable.

This necessitates introducing some further nuance when contemplating the consequences of the OmniBase program. Generally it is assumed that its burdens most heavily fall on the poorest and most vulnerable segments of the population. The results presented here suggest this concern might be somewhat exaggerated insofar areas with incomes around the area median income (i.e. not impoverished by definition) show very similar patterns in percent of violations referred to OmniBase to the patterns seen in poorer areas.

One interpretation is that a traffic violation and the possible attendant fine represent non-trivial challenge (financial or otherwise, as a disruption) for most individuals or families, including what is approximately considered middle class. Whether it is for financial reasons or for competing priorities and distractions, in all areas with median incomes from $20,000 to $60,000, a stable 15%-20% of violations remain unattended and accordingly get reported to OmniBase. Only after the zip code median income surpasses ~$60,000, some attendant reduction in percent of violations reported to OmniBase begins to drop. Even so, it should be noted that most of even the most affluent areas show non-trivial percent (10%+) of violations sent to OmniBase. This suggests that the implications of the program are not only financial: certain proportion of the population shows propensity to not prioritize case resolution regardless of degree of financial constraints, although certainly more so in lower income areas.

Vehicle age by OmniBase status

The above comparisons were made using aggregated Census data to estimate incomes in different ZIP codes in the city. While sound and widely used approach, its main limitation is that it relies on aggregate data to produce estimates at individual level. This is necessitated in part because there is no way to obtain income data at the individual violation data.

However, the data set contains information on the vehicle subject to citation, including year of manufacture. We propose using this variables as a reliable (aside from data entry errors) and somewhat valid crude proxy for income at the aggregate. While there is a wide variation, as a general trend more affluent individuals and households will drive newer (and likely more expensive, even when of the same vintage) vehicles at the aggregate. Existing studies have established the linear relationship between income and average vehicle age. For example, the data gathered by the Federal Highway Administration in 2017 (below), showing that 1 year difference in average vehicle age might reflect up to ~$25,000 difference in household income.

This is confirmed in the CoSA Municipal court data set. The mean difference in vehicle ages between citations reported to OmniBase and those not is approximately 1 years, i.e. court clients operating older vehicles are more likely to neglect their cases and experience and OmniBase report. While it cannot be exactly determined what difference in income this difference in the age of vehicles signifies, and 1 year may not sound like much, per the study above it may approximate income difference of up to $25,000. (NOTE: vehicles with model years prior to 1973 were removed from this analysis; this cut off was somewhat arbitrary, chosen on the assumption that this is a cutoff that may mostly include vehicles that are realistically still used for commuting or working, as opposed to older and more exotic vehicles that might show inverse correlation with income).

Further, as often acknowledged but rarely explored, means and medians do not say anything meaningful about the underlying distributions of the variable of interest. The plot below presents both boxplots (medians and interquartile ranges) combined with density distributions of vehicle age. The distributions are interesting. In both cases they are bimodal, however for the violations not reported to omni base, they skew to newer vehicles, while in the reported violations the second “hump” of the distribution is larger, i.e. older vehicles are more prevalent in spite of similarity in medians.

Distributions by disposition and OmniBase reports

As is typical in analysis of administrative databases, the municipal court client management system also contains a very large number of categories, significantly larger than it would be practical to analyze: there are 76 distinct disposition codes, and 359 distinct offense codes. Accordingly, some cutoff points were necessary. For dispositions, we selected all dispositions accounting for more than 0.1% of all disposition codes; this resulted in reduction of the total categories to 12 accounting for 98.5% of all violations (+13th category = “Other”). For offenses, we selected all offences representing more than 1% of all offences, which resulted in 13 categories accounting for 87% of all offences.

Below we present the distribution of the top dispositions as well as distribution of dispositions by OmniBase report status. The same analysis is repeated for offence types.

Detailed interpretation of this breakdown should be left to subject matter experts. It appears that violations reported to OmniBase show higher proportions of closed cases, and lower proportion of cases disposed of via alternative means. For example, cases reported to OmniBase are almost twice less likely to be dismissed after a probationary period, as well as twice less likely to be dismissed for completing requirements. The option to dismiss the case after completing driver safety course seems virtually unavailable by definition to such cases. However, they are much more likely (more than twice as likely) to appear in court for plea appearance. This finding appears highly congruent with the Court aspiration to use OmniBase not punitively, but as a means to induce case resolution, i.e. for respondents to take action, including to appear back in court for a plea.

Distributions by offense type and OmniBase reports

THe overall distribution of the (top) offenses and distribution by OmniBase report status are presented below.

The distribution is also (numerically) presented below, contingent on OmniBase report. A close look at the table reveals potentially very interesting, important, and actionable insight.

For the very top offenses (speeding and driving without a license), the proportion of these offenses within the OmniBase-reported group is notably lower. Different offenses, but part of our interpretation is that these offences (in particular speeding) is what generally the public associates with a traffic ticket - a known, common, agreed-upon violation. Conversely, it is not surprising it is resolved at a higher rate. Perhaps to a somewhat lesser degree, the same reasoning applies to driving without a license, or speeding in a school zone.

However, we also see a distinct set of offenses demonstrating a completely diverging pattern - i.e. their share within the group reported to OmniBase is notably higher than within the group not resolved. These include driving without proof of insurance and failure to display registration. We suggest that at least partially the notably higher share of such violations being reported to OmniBase may be explain that in individual and public perception such infractions are not regarded as “real offences” (as opposed to speeding etc.), resulting in lower inclination to resolve the case.

A third category, interesting in its own right is “driving while license invalid”: the share of such violations is 3 times as high in the group reported to OmniBase, for what should be obvious reasons: if a citizen is already indifferent to operating a vehicle with invalid license, preventing renewal of already expired license by itself is not likely to prompt action to resolve the case.

Appendix

Data provided by CoSA Municipal court

The UTSA team received 4 tables from the Municipal court team, with the following total number of records (prior to any subsetting):

  • Citations (n=690,103)
  • Violations (n=904,035)
  • Warrants (n=641,892)
  • OmniBase reports (n=141,104)
  • (Case) History (n=27,547,045)

Traffic violations represent 89.6% of all the violations in the provided extract. As agreed with the court, this iteration of the analysis was to focus on traffic violations only, accordingly they were removed, in conjunction with all associted records from the other tables. After removing the non-traffic violations from the violations file and all records, linked by citation number, in the omni, warrants, history, and citations data, the working set of observations is as listed below:

  • Citations (n=596,299)
  • Violations (n=810,166)
  • Warrants (n=572,510)
  • OmniBase reports (n=140,667)
  • (Case) History (n=24,425,084)

Further subsetting from the citations file includes: A) Dropped 1,934 records with missing data for race; B) Dropped 638 records with missing age (missing values for age are partially accounted by recoding into missing any values under 16 or over 90; although there are plausible scenarios where lower and higher ages are the correct age, the rarity of such scenarios does not justify the possible biases such outliers may introduce); C) Dropped 831 observations with missing data for gender; D) Dropped 641 observations with mising values for state; E) Dropped 4 of the remaining observations prior to 2001. All of the above reduces the final number of citations to 592,750. The final records after removing entries linked by citation number from the rest of the files are as follows:

  • Citations (n=592,750)
  • Violations (n=805,100)
  • Warrants (n=569,556)
  • OmniBase reports (n=140,176)
  • (Case) History (n=24,284,126)

This process is also summarized on the flow chart below

The data extraction process has not been explicitly documented, but per UTSA team’s understanding based on discussion during meetings the extract represents cases that have received some type of resolution/disposition during the past 5 years, regardless of the original violation date. As a result, the data set features citations going as far back as 2001, with a couple of outliers going back to 1972 and 1982 respectively (both cases are non-traffic violations - violation of water conservation rules and petty theft respectively), see chart below.

NOTE: This distribution of citations by year in the table above and the chart below includes all original records, prior to any subsetting. All other analyses presented in this report are based on the subset of data as described in the beginning of this section.

We also provide a bar chart visualizing the distribution of citations by year. Per the chart below, the “laggard” cases, i.e. cases with unusually long time to resolution are likely best defined as those with violation dates in 2015 and before. The

In this extract, the year with peak citations is 2017 (n=151,349). The data also shows a major drop in citations in 2020 (COVID and associated restrictions leading to major drop in mobility) - n=54,470, down from 115,633 in 2019. The number in 2021 appears exceptionally small (n=18,064). This is most likely due to recency - recent cases by definition have had less time to reach resolution. Secondarily, while some COVID restrictions and reduced activity continued in 2021, it the extent to which this accounts for the lower case number vis recency is unclear.

NOTE: Following this exploratory distribution, all records with violation date prior to 2001 were removed. However, further exclusion may be appropriate, as discussed below.

The court provided a table listing all violations sent to OmniBase. When reviewed by year, the data suggest sporadic maintenance and/or implementation in the early years.

Choice of geography

Geospatial analysis at the ZIP code level is by far the most common. Many agencies collect data up to the ZIP code level, and the ZIP code geographies are stable, and data is easily retrieveable as all levels (e.g., from local, to county, to state, to national). These practical advantages are partially diminished by shortcomings. Most notable among them are that 1) ZIP codes were primarily designed as mail route areas (rather than natural geographical groupings), and, related 2) That there is significant variation in socioeconomic conditions within ZIP codes, bringing the possibility of aggregation bias.

In general, smaller aggregation areas such as census tracts are preferable for geospatial analysis, but there are some difficulties that preclude wider application currently. The main among them is that few agencies explicitly collect census tract level flags. This requires an extra step, which is to cross-walk coordinates or address and to link them to a census tract. The UTSA team initially considered an attempt to conduct the geospatial analysis at the census tract data, which proved infeasible within the scope of the project as 1) The street address data contains a lot of “noise” (a lot of observations would need to be discarded due to incomplete, contaminated, or missing addresses) and 2) the existing capabilities for batch processing addresses and cross-walk with census tracts require either substantial computing time, or utilizing paid services.

Final reason to use zip codes was that legal boundaries of cities generally are far too complicated to allow clear delineation of city versus not city observations (e.g. locations that are legally outside of the boundaries of the city might nonetheless be part of organic neighborhood or community best understood as part of the city). In the San Antonio case, this includes the multitude of unincorporated cities within the metropolitan boundaries. For reference, the legal outline of the City of San Antonio is presented below.

For all these reasons the UTSA team settled on using ZIP code levels for court data agregation and for merging with census data, consistent with prevailing practice. This involved selecting all ZIP codes beginning with “782”, which resulted in the area used for the study and reasonably well approximating the boundaries of the City of San Antonio, while also not including the most rural areas found in Bexar county.

The interactive map below presents the final area in an interactive map, including selected (computed) court level and census variables. NOTE: additional variables can be added upon request.

Interactive map of San Antonio Zip codes (click on a ZIP code to see list of variable values)

Date ranges, case studies and (non)utilization of the History file

The main dates used in the analysis presented above are the original violation date, the violation status date, and the violation last changed date; further, sent date and cleared date from the OmniBase table were used. No further dates (e.g. from the history file) have been used in this iteration of the analysis, as explained below.

For this iteration of the analysis the History file was extremely helpful, though mainly as source of anecdotes and case studies helpful to understand some of the court and client outcomes. The data contains multiple entries with administrative significance, but of limited substantive/analytical utility. Moreover, the main outcomes of interest (e.g., OmniBase reports, report and clearance dates, violation dates and last violation status) were readily available in the other data tables provided. Third, it is unclear to what extent the history file consistently tracks cases status - especially old cases, where we encountered significant periods with no activity (see below). Nonetheless, this is a rich data source and the UTSA team would appreciate further planning discussion to potentially take advantage of this data.

The UTSA team reviewed several cases, in particular old ones, to gain better understanding of the nature of the data. For example, citation number L675430 records two violations in 2001 (driving without license and insurance). The violation status is DP, updated in April of 2020. The case history for this file contains 30 entries. The first two entries are from the violation year (2001) and are “warrant issued” (there is no record of actions that may have preceded the warrant). The rest of the entries are from 2020, i.e. there is no case history activity for this citation between 2001 and 2020. We have not made a systematic attempt to evaluate the prevalence of such cases with extended period of no recorded activity (and interesting question, but outside of the scope of the current analysis). The most recent StatusChangeDate is identical to the ViolationStatusDate in the violations file.

Citation L675453 (driving without license plate and insurance) is similar in that it shows no activity between the violation date (in 2001) and 2020 in the history file. However, this is a citation that was reported to OmniBase. The most recent date in the history file is identical to the cleared date in the omni table; however, the history entries for this citation do not list the date when the information was sent to OmniBase. This date is available in the omni file, in the SentDate field.

Finally, citation S854366 (from 2017, speeding), also shows 30 history entries. The case is resolved (dismissed after deferred disposition) within ~11 months, including OmniBase report and clearance. There is a small, likely normal due to administrative lag discrepancy between dates of sending an OmniBase report (12 April 2017 in the history file, 21 April 2017 in the omni file) and OmniBase clearance (14 June 2017 in the history file, same in the omni file). The congruence between the dates, in addition to the greater ease of use, justified using the dates present in the omni file. Another consideration was that the omnibase reports and clearances are not uniformly recorded in the history file (variation in the strings and descriptions used).

Dates and levels of analysis: citation vs violation level of analysis

The analyses presented were performed at either the citation or the violation level as appropriate. For example, basic demographics were analyzed at the citation level (and even then it is important to emphasize that the unit of analysis is citations, not people), while others were analyzed at the violation level. For example, it is impractical to merge status and date from violations and into citations. While in most cases the subset of citations with multiple violations get resolved at once (i.e. the violation status and date for the different violations will be the same), there is a sizable group for which this is not the case. As we established above, the working set of citations is 592,750. Ideally retrieving a list of distinct citation numbers and violation status will result in the same number, however this is not the case, i.e. there is lack of uniformity in dispositions within some cases/citations. For example, there are 641,243 distinct citation number-violation status combinations in the violations file, i.e. a sizable number of citations with more than one violation can have different dispositions at a given point in time. The majority of the distinct status combinations are CL and DA (case closed and case dismissed after probationary period), but not exclusively so (there are other combinations, including D2 [dismissed after completing driver safety course] and JC [judgement/conviction] and many others). For example, citation S984735 has two violations (driving with invalid license and driving while using handheld communication device); the former violation is recorded as case closed in June 2018, while the latter as “dismissed after deferred disposition” in July 2018. Thus we proceed some of the analysis at the violation level, rather than citation level(as above). The same problem applies to slightly greater extent to distinct citation number - violation status date combinations (n=648,983), reflecting that a number of violations on the same citation that resolve at different times.

The variable whose distinct values in combination with citation numbers produces the closest resemblance to the original number of citation numbers is whether an Attorney is involved in the case (n=596,696). It is understandable that in vast majority of cases an attorney will represent a client on all violations in a ticket, and only rarely it may be the case that attorney is involved with one of the violations, but not another, at least this is what this data suggests.

A final reason to proceed at the violation level is that while in the citations file all citations have entries for the original violation date, there are 391,813 missing values in the field LastChangeDate in the citations file. The source of this problem is unclear as all violations and citations in the data extract have both some type of status/disposition, and disposition date. Rather than missing, most of these 391,813 entries are ‘0’’s. In the aforementioned case of citation S984735, for example, the value of CitationLastChangedDate is 0, while both violations have properly entered ViolationStatusDate dates in the violations file. We suspect the reason for the discrepancy may be that the citations LastChangedDate field may apply only to violations with conviction dates, and not other types of disposition.