CrimeCatOAK: Summer 2019 Update
We have developed a local, Oakland PD-specific crime classification system called simply CrimeCatOAK. The origin and motivation for this system is described here, way back in 2013.
Events during the summer of 2019 (described in more detail here) wound up requiring a bunch of new monitoring and analysis code, especially of the Description and CrimeType fields associated with daily incident records and how these were being classified by OakCrime.org into the CrimeCatOAK classification hierarchy. This note summarizes features of the new classification system developed from this work and now in use, and compares it to an alternative.
CrimeCatOAK is a hierarchy of 71 categories listed here. The process of assigning incidents to categories depends on three attributes:
1.Penal codes (a list of California Penal Code references)
2.Crime type (a text field)
3.Description (text field)
Code implementing this simple classification process is available here.
Interpretting Penal Codes
The most specific and unambiguous attribute related to a crime incident is its California Penal Code. At present the only incidents for which OPD provides this are those mentioned in the Patrol Logs, posted at Box.com.
Analyzing these has resulted in a mapping from the most frequent to CrimeCatOAK categories. This mapping is implemented by the data table found in misc/penalCode2CrimeCat.csv. This includes a small number of (what appear to be) “qualifier” penal codes, e.g., PC664 seems to imply attempted. With these exceptions, the code i uses the first PC code in the list and classifies it using the data table. If a PC code is not found in that table, a null (empty string) value is returned
When available, the classification based on penal code is preferred. In the majority of cases, however, assignment is based on two variable text fields.
Using CrimeType and Description attributes
Analysis of years of incidents published by OPD shows great variability in the choice of the texts in the CrimeType and Description fields. Further statistical and lexical analysis of these fields will continue to be an important issue in further work, but preliminary analysis has resulted in a simple heuristic method with a reasonable balance between performance and transparency.
This method uses a data table (found in misc/crimeCatCDMatch.csv ) of almost 200 match rules defined across CrimeType, Description, or both. Incidents matching a rule specifying both CrimeType and Description are checked first, then those depending only on Description, and finally those depending on CrimeType; the precidence of Description over CrimeType is because the Descriptions are in general more specific than the generic CrimeType.
Comparison of CrimeCatOAK vs. CrimeType
These methods are able to assign a CrimeCatOAK category in about than 97% of cases. So what? What’s the hierarchy good for? One way to motivate the system is to contrast it with the CrimeType text field provided directly by OPD. The plot below shows a frequency distribution of the CrimeTypes assigned by OPD using 2014-2019 data. Two categories each have more than 25000 instances. Approximately 15000 are not assigned a CrimeType. But other than these and around 10 other categories, all other categories are used much less frequently.
Contrast this with the distribution generated using CrimeCatOAK.
The same/corresponding two categories have the same very large instance sets. Only about 3500 are left unclassified. One important difference is that, relying on along with penal codes when avaiable, and both CrimeType and Description text fields, CrimeCatOAK has “spread out” the incidents across more categories with significant frequencies. Further, the plot below re-orders the categories to reflect their hierarchic organization:
This hierarchic organization will support “semantic zooming” into the categories by OakCrime.org.