OAKCC: A system for classifying Oakland crime data

Classifiying crime types with OAKCC:
Summer 2019 Update

Events during the summer of 2019 and described in more detail here wound up requiring a bunch of new monitoring and analysis code, especially of the Description and CrimeType fields associated with daily incident records and how these were being classified by OakCrime.org into the OAKCC classification hierarchy.  The origin and motivation for OAKCC is described here, way back in 2013. This note first summarizes  changes in OPD’s reporting of crime type and descriptions as they varied across Summer 2019,  then describes the classification process used to build the OAKCC system, and concludes with a comparison of OAKCC with OPD’s set of crime types.

Changes in OPD reporting of incidents

Using the same May 24, 2019 date to mark the change from previous reporting to the new system, we can look at the frequencies of crime types and descriptions as used before and after this date.  “Chg” reflects the change in the  daily average of the use of these descriptors; values near zero reflect little change. The summary tables showing statistics for ony those changing dramatically, either increasing or decreasing:

There are only 16 crime types in the system used by OPD.  Some categories (eg, SEX_CRIMES) are no longer being used.  The OTHER  category and not providing any crime type (none) have increased in the new system.

There were 558 different crime Descriptions used across 2019.  Those showing largest changes before/after May 24 are listed below:

The number of missing descriptions has decreased in the new system.  Much of the volume in new incidents now being reported involves increases in violations of court orders and VANDALISM.  More surprising is the increase in UNEXPLAINED_DEATH.#  There also seems to be a class of domestic violence crimes (BATTERY, INJURY_ON_SPOUSE) now being reported.

It is worth repeating for emphasis: The system in place since May now shows data that  OPD’s dependence on Omega’s software had previously been hiding in these categories.

Building OAKCC

OAKCC is a hierarchy of 71 categories listed here: misc/crimeCatOAK.html.  Underbars (eg, LARCENY_BURGLARY_AUTO) are used to capture broader/narrow categories.  The classification process attempts to assign incidents to the most specific category.

The process of assigning incidents to categories depends on three attributes:

  1. 1.Penal codes (a list of California Penal Code references)

  2. 2.Crime type (a text field)

  3. 3.Description (text field)

Code implementing this simple classification process can be found in util.classify() .

Interpretting Penal Codes

The most specific and unambiguous attribute related to a crime incident is its California Penal Code. At present the only incidents for which OPD provides this are those mentioned in the Patrol Logs, posted at Box.com.

Analyzing these has resulted in a mapping from the most frequent to OAKCC categories. This mapping is implemented by the data table found in misc/penalCode2CrimeCat.csv. This includes a small number of (what appear to be) “qualifier” penal codes, e.g., PC664 seems to imply attempted. With these exceptions, the code uses the first PC code in the list and classifies it using the data table. If a PC code is not found in that table, a null (empty string) value is returned

When available, the classification based on penal code is preferred. In the majority of cases, however, assignment is based on two variable text fields.

Using CrimeType and Description attributes

Analysis of years of incidents published by OPD shows great variability in the choice of the texts in the CrimeType and Description fields. Further statistical and lexical analysis of these fields will continue to be an important issue in further work, but preliminary analysis has resulted in a simple heuristic method with a reasonable balance between performance and transparency.

This method uses a data table (found in misc/crimeCatCDMatch.csv ) of almost 200 match rules defined across CrimeType, Description, or both.  Incidents matching a rule specifying both CrimeType and Description are checked first, then those depending only on Description, and finally those depending on CrimeType; the precedence of Description over CrimeType is because Descriptions are more specific than the generic CrimeType.

Comparison of OAKCC vs. CrimeType

These methods are able to assign an  OAKCC category in about than 97% of cases.  So what?  What’s the hierarchy good for?  One way to motivate the system is to contrast it with the  CrimeType text field provided directly by OPD.  The plot below shows a frequency distribution of the CrimeTypes assigned by OPD using 2014-2019 data.  Two categories each have more than 25000 instances.  Approximately 15000 are not assigned a CrimeType.  But other than these and around 10 other categories, all other categories are used much less frequently.

Contrast this with the distribution generated using OAKCC.

The same/corresponding two categories have the same very large instance sets. Only about 3500 are left unclassified. One important difference is that OAKCC has “spread out” the incidents across more categories with significant counts. Further, the plot below re-orders the categories to reflect their hierarchic organization:

This hierarchic organization will (someday!) support “semantic zooming” into the categories in the OakCrime.org interface.

Changes in OAKCC distributions due to system changes in Summer’19

It’s worth considering how the changed distribution of CrimeType and Description text fields impacts the distribution of incidents according to OAKCC. As above, we consider the daily average frequency of incidents before and after May 24 2019, and the plot below shows changes by OAKCC category:

Again, most reporting categories have remained stable, with changes near zero.  Blue highlights categories that show increases of 2.5 incidents/day or more, and red highlights a few that showed decreases by 0.8 incident/day or more.  As above, domestic violence, court orders and vandalism categories reflect large increases in reporting. The hierarchic relationships help to show trade-offs among related categories, e.g., decreases in the broader category LARCENY_THEFT are balanced by more specific classifications into the LARCENY_THEFT_GRAND, LARCENY_THEFT_PETTY, and LARCENY_THEFT_VEHICLE.


Appreciating the semantic distinctions among various crime classifications is critical to interpretation of all crime statistics.   OAKCC demonstrates that it can provide a classification system that is richer and more refined than OPD’s simple CrimeTypes, and the new classification algorithms generated in reaction to the Summer’19 data changes have proven themselves robust.  The next step forward will require specification of Penal Code and UCR codes for all incidents.

#Karen Ivy, working with the 12Y+13X NCPC, also noticed the increase in the UNEXPLAINED_DEATH category, and asked OPD’s Captain Chris Bolton about it:  It is most often used when deceased persons are found in circumstances where cause of death is not immediately known but the elements of murder are not present.  It is also used to document suicides.  By their nature, these classifications of reports are not criminal offenses at all and should be fully excluded.

Leave a Reply

Your email address will not be published. Required fields are marked *