Data Update – January

Data Update – January

In place of my usual data update, I’d like to start 2018 off with our first data release. This release, which is available on Jane’s new Resources page, highlights some of the leg work we’ve done so far. It is our hope that others might use our work for their own research, study, and discussion purposes.

The University of Wisconsin’s (UW-Madison) crime log data is freely available on the campus police department’s (UWPD) website, though the site only displays the past 60-days worth of data at any given time. Requests for older logs–provided in the form of PDFs–are promptly filled by the department. To make this data more easily accessible to researchers and others interested in using the data, we’ve made the crime logs available in the form of .xlsx, .ods, .csv and .html files. Each file includes all available data points for three consecutive academic years: FA16-SU17; FA15-SU16; and FA14-SU15.

Notes for “Daily campus crime log” files:

  • Some words are cut off in the original PDFs and are thus cut off in the new file formats
  • Typos and other inconsistencies have not been corrected
  • Duplicate entries exist, as each PDF represents a 60-day snapshot of the log
  • An additional column, date_extracted, has been added to the new formats; date_extracted denotes the date of the original PDF on which the entry appears

Next, we deduped the new formats by event_numberdate_reporteddate_occurred, report_number, location, offense, and dispositionDeduputing resulted in the following:

  • Sept 1, 2016-Aug 31, 2017: decreased from 4085 to 2149 entries
  • Sept 1, 2015-Aug 31, 2016: decreased from 2785 to 1401 entries
  • Sept 1, 2014-Aug 31, 2015: decreased from 2996 to 1603 entries

Notes for “Dedupe of daily campus crime logs” files:

  • date_extracted has been excluded from the dedupe process
  • There may be more than one entry for a single incident if disposition changed over time

The last dataset available (as of the date of this blog post) is noted as “Clery data.” This set includes any entry that references sexual assault, domestic or dating violence, or stalking as the offense.

Notes for “Clery data” files:

  • Leading and trailing white spaces have been trimmed
  • Typos and inconsistencies have been corrected using OpenRefine
    • FA16-SU17 cluster by offense; key collision, cologne-phonetic (1 error resolved)
    • FA16-SU17 cluster by offense; nearest neighbor, levenshtein (1 error resolved)
    • FA16-SU17 cluster by location; key collision, ngram-fingerprint (1 match)
    • FA15-SU16 cluster by offense; key collision, ngram-fingerprint (1 error resolved)
    • FA15-SU16 cluster by location; key collision, ngram-fingerprint (1 error resolved)
    • FA14-SU15 cluster by offense nearest neighbor, levenshtein (3 errors resolved)
    • FA14-SU15 cluster by location; key collision, metaphone3 (1 error resolved)
  • Some typos have been corrected (e.g. VIOLENC is edited to VIOLENCE)
  • time_reported  has been split into date and time
  • time_occurred has been split into date and time; spaces were removed to simplify this process

For more on the Clery Crime Data Visualization Project data, check out last year’s November and December data updates or contact us at jane@janespeaks.org for more information. Check back again next month for Jane’s newest update.

Leave a Reply

Your email address will not be published. Required fields are marked *

css.php