Jane is pleased to announce our latest data release. The University of Wisconsin’s (UW-Madison) crime log data is freely available on the campus police department’s (UWPD) website, though the site only displays the past 60-days worth of data at any given time. Requests for older logs–provided in the form of PDFs–are promptly filled by the department. To make this data more easily accessible to researchers and others interested in using the data, we’ve made the crime logs available in the form of .xlsx, .ods, .csv and .html files. The following summarizes how the campus crime log data for September 1, 2017-August 31, 2018 (available on the resources page) has been sorted and cleaned.
“Daily campus crime log” files:
- Some words are cut off in the original PDFs and are thus cut off in the new file formats
- Typos and other inconsistencies have not been corrected
- Duplicate entries exist, as each PDF represents a 60-day snapshot of the log
- An additional column, date_extracted, has been added to the new formats; date_extracted denotes the date of the original PDF on which the entry appears
Additionally, our data import via Tabula resulted in a few erroneous entries for Report #. These errors were corrected manually and include:
Report # Appeared As | Time Reported | Edited To |
108P0M1273 | 8/30/18 1:57 PM | 1801273 |
81:8001 A0M53 | 7/18/18 11:35 AM | 1801053 |
81801054 | 7/18/18 12:02 PM | 1801054 |
61:85051 P0M50 | 7/18/18 6:56 AM | 1801050 |
M1801007 | 7/11/18 2:37 PM | 1801007 |
M1801002 | 7/10/18 4:27 PM | 1801002 |
1158 0A1M000 | 7/10/18 7:22 AM | 1801000 |
1158 0A1M000 | 7/10/18 7:22 AM | 1801000 |
Next, we deduped the new formats by event_number, date_reported, date_occurred, report_number, location, offense, and disposition. Deduputing resulted in the following:
- Sept 1, 2017-Aug 31, 2018: decreased from 4377 to 2365 entries
Notes for “Dedupe of daily campus crime logs” files:
- date_extracted has been excluded from the dedupe process
- There may be more than one entry for a single incident if disposition changed over time
The last dataset available (as of the date of this blog post) is noted as “Clery data.” This set includes any entry that references sexual assault, domestic or dating violence, or stalking as the offense. The time_occurred field was manually edited (where the entry contained something other than date/time) in order to allow the field to be split into time_occured 1 (i.e date) and time_occurred 2 (i.e. time). Other minor typos were also resolved.