I keep thinking the situation surrounding Oakland publishing its crime data will improve and my efforts to sweep up the data from various sources will become obsolete, but it seems not for the foreseeable future. And between those neighborhoods who are contracting for their own private patrols, other neighborhoods installing their own sets of camera, and the Domain Awareness Center (DAC) proposing to watch over us all, there has certainly been increased interest from Oakland citizens.
So please find below a new, improved version of Oakland crime data, 2007-present. If you poke around the rest of the “OPD crime statistics” pages (ie, the parent of this page and its siblings) you’ll find lots of gory details regarding the construction of the data set, the hierarchic crime classification system organizing them, etc. But assuming most people just want the bare minimum to get and understand the data, I’ll summarize here:
This data set is approximately the union of data from three sources:
- OPD1: A retrospective dataset with data from 2007-2012 provided by the Oakland’s Department of IT in March, 2013.
- USC: Under a contract with the City of Oakland, the Urban Strategies Council has a data set covering the years 2003-2011 that is now available via OpenOakland
- PRR: Someone clever citizen filed a Public Records Request, and (thanks to OpenOakland’s fantastic RecordTrac facility!) we can all benefit from the data the city provided in response
In a perfect world, there would be “unique IDs” and parallel sets of attributes associated with all these data that would allow a direct merge; we do not live in a perfect world(: All data sources do have one critically important field CASENUMBER (aka RD) that is used as the primary basis for merging data from two sources. But even from a single source, it is often the case that the same CASENUMBER is associated with different dates/times, different addresses, different beats, etc.
My merging code (which I will update on my github site as soon as I clean up this new version) is mostly trying to be smart about exploiting the most solid sources of data from each of the three streams. The figure below gives an overview of the result. As described elsewhere, the bulk of the data is from DIT, augmented with additional details made available by the USC data set. The PRR is currently the only source of data for 2013; that’s the good news. The bad news is that, based on comparison during periods prior to 2013, this source provides only about half of the volume of crime incidents as reported otherwise! Finally, I have finally written a regularized facility for periodically grabbing the 90-day window of data provided via OPD’s current FTP crime reporting site and adding any new data to this full data set. I began doing that in February, 2014 and data from this source is labeled as “2014.”
Combining all of these and merging as much as possible, the current data set contains data on 436620 “cases” involving 590048 “charges” (ie., multiple charges may be associated with the same case/CASENUMBER).
The other major improvement I’ve made recently is to publish this data in three alternative formats. Without further ado, please help yourself to any/all of:
- OPD_140315.csv.zip (16.3 MB, zipped) comma separated with a header line showing these fields:
- Idx, OPD_RD, OIdx, Date, Time, CType, Desc, Beat, Addr, Lat, Long, UCR, Statute, CrimeCat
- OPD_140315.json.zip (15 MB zipped) JSON for a dictionary
- cid – > [date,time,beat,addr,lat,long, [ctype,desc,ucr,statute,cc]+ ]
- OPD_140315.db.zip a SqlLite database created via
- CREATE TABLE INCIDENT (incididx int, rd text, date text, beat text, addr text, lat real, long real)
- CREATE TABLE CHARGE (chgidx int, rd text, rdchgidx int, ctype text, desc text, ucr text, statute text, crimeCat text)
- (Note that this URL points off to data.OpenOakland.org, a CKAN site with issues, because it is too large to post on my WordPress site. Let me know if you have troubles getting it and we’ll work something else out.)
All three versions should contain exactly the same data, but I have not yet done all the round-tripping tests to ensure this is true. I will be considering the JSON version primary, and have performed most testing of it.
I’ll be posting my revised code, documentation and some examples showing how to use it ASAP. But give it a try and let me know any reactions.