Data is Messy – Digital History Learning Journal

A good long hike begins with preparation, a plan and map. I knew this before I was soaked by a cold downpour on Mount Washington. Part of preparation is knowing things will go wrong. Despite having a map, you may get lost or the trail may be impassable. A weather forecast is fine until a thunderstorm blows in. You might find someone who is injured or needs water. Later, you’ll want to tell your friends what to see and what to avoid if they follow in your footsteps.

Hiking outside is wonderful, but it can be…

Wrangling data is rewarding but it is messy. It can be treacherous as well. To wrangle data, the digital historian needs to prepare, plan and map out where they are going in case they need to backtrack. That was something I learned this week.

Less mess with better OCR?

One example of messy data is the OCR’d text of the Shawville Equity. As HIST3814o classmate Sarah Cole posted in her blog this week, the OCR of the Equity newspaper reads text horizontally across columns, making the task of isolating the text of a particular news story laborious. It is to bad that the “C” in OCR does not stand for context.

OCRFeeder looks like a promising tool to use to do OCR with context if needed. It has a graphic interface to layout individual stories using rectangles so that they can be processed in context. It also works with PDFs directly. I found it was a challenge to install though. Notes about this are in my fail log for this week. Speaking of failures, I only found OCRFeeder because I wanted a better tool for OCR because my personal results using command line Tesseract were not usable. OCRFeeder uses Tesseract for OCR too, so it must be using it much better than I was, a productive fail.

The user interface of ocrfeeder. — The user interface of OCRFeeder.

Gathering data with a crowd is messy.

Dr. Shawn Graham, Guy Massie, and Nadine Feuerherm’s experience with the The HeritageCrowd Project showed both the great potential reward and the complication of crowd sourcing history. The HeritageCrowd Project wanted to tap the large reservoir of local historical knowledge possessed by the residents of the Pontiac region of the Upper Ottawa Valley. However, the response of the crowd that had this valuable information for the website was complicated, even messy. Some people misunderstood what the website was for or how to contribute to it. There was a fairly low response rate of submissions of approximately 1 person out of 4,000 residents. Some potential contributors were reluctant because they felt their knowledge was not professional enough. Advance planning for research using crowd sourcing is likely even more important than for individual projects given the complexity of working with different people and the likelihood of losing the crowd if plans change or don’t work out.

Losing the crowd is very messy.

Gathering data with a crowd can be messy. Losing the crowd, messier still. When I started this course, I read Dr. Graham’s How I Lost the Crowd a transparent account of what happened to the HeritageCrowd Project’s website when it was hacked. This brought back my experience when I was a volunteer running a website that contained the personal data of several hundred people and it was hacked. It was compromised three different times in attacks that increased in damage level. It is beside the point for me to write about that, still raw, experience here. However it is very important for digital historians to heed Dr. Graham’s examples to back up work in a disciplined manner, take notes in case you need to rebuild a web site and pay real attention to security: use secure design, updated software and monitoring and review security warnings.

By the way, the story of the hacking I endured on my volunteer website worked out. We were transparent too and told everyone involved what had happened. We moved to a more secure internet provider. We were able to restore the site from a backup. We patched the security and implemented monitoring because we knew we, an all volunteer gardening association, were now a target. This took several months of work. This all could have been avoided if I had been more proactive about the Heartbleed bug in April 2014.

The data might be clean but it’s not neat.

One of the ideas I had for the final project of this course was inspired by a recent course I took, Dr. Joanna Dean’s HIST 3310 Animals in History. Each year the Shawville Equity publishes coverage of the Shawville Fair, an agricultural exhibition featuring farm animals. According to the fair’s website it has run since 1856 and I thought it would be interesting to trace if the breeds of animals shown at the fair had changed over the years, indicating either evolution in agricultural practices or societal change. However, despite the long history of both the fair and coverage in the Equity starting September 27 1883 on page 3, there are years where the edition covering the fair is missing, such as 1937. Details of the fair are covered regularly, but the coverage varies from lists of prize winners, the data I would like, to descriptions of what took place. Also, in my sampling, I could not see changes in patterns of animal use at the fair over the years to write about. Maybe the longevity and consistency is what is historically significant?

When data is not neat it needs to be cleaned up. Where it is missing, the historian is faced with the question of whether or not to interpolate or do other interpretation based on what data is available. Our course workbook has this observation: “cleaning data is 80% of the work in digital history“. In relation to this, a member of the class asked, “if this is the case, then why don’t historians keep a better record of the data cleaning process?” Excellent question. If cleaning the data is 80% of DH, it is also a dirty secret. Cleaning data changes data and makes it something new, as highlighted by Ryan Cordell in qijtb The Raven. While we may be changing data for the better, it’s really to better suit the purposes we have for it and so we may also change how our data is interpreted by other historians. To mitigate the risk of misinterpretation or error, it is important to document how the work of DH takes place so that DH work holds up to scrutiny and can be reproduced. DH work is complicated, and sometimes the historian may have to reproduce their own work in order to correct an error. Documenting also helps explain how we spend 80% of our time.

Less mess with better OCR?

Gathering data with a crowd is messy.

Losing the crowd is very messy.

The data might be clean but it’s not neat.

Leave a Reply Cancel reply