Grooming data: Module 3 of HIST 3814o

This week in our course we considered what it means for historians to use data and adapt it to their purposes.  We also learned techniques to do that, which I will touch on below.

Yet, one of the biggest learning opportunities came from a productive failure of the availability of our course’s technical environment, DHBox. DHBox, available for access from our university, over a virtual private network, has been a marvel to use. However, even this virtual environment encountered an issue at the start of this course week when it became encumbered with too many files from its users.

The lack of availability of DHBox presented a great learning opportunity for students to adapt to. Mac computer users were able to use the terminal on their PC to replicate DHBox.  For learning regular expressions “RegEx”, sophisticated text editors such as Notepad++ could also be used instead of DHBox.

Unix running on Windows. Very cool.

Sarah Cole used the lack of availability of DHBox as an opportunity to set up her Windows machine to run Cygwin, a Unix command line application.  This is fantastic, I never thought I would see Unix running on Windows.  I had a similar experience, Windows 10 can run Ubuntu. With this installed, I was able to use my Windows computer in a whole new way.  I would not have thought to look for this if DHBox had been available.

Without going on at length, the lack of availability of DHBox this week highlighted not only the dependence digital historians have on technology, but also how we must plan to build technical resiliency into our projects.  We should plan for technical failure. Despite the lack of availability of DHBox this week, our class was able to move ahead because alternatives had been considered in advance and we were able to use them.

RegEx.

I learned a lot from the exercise to use Regular Expressions.  I had used RegEx a few times before, but I never really understood them. Using them was more like incantation, a spell to make data look different. Most of the time, I need to understand what the statements in a program mean before I use them; it is unwise not to.  This week with RegEx, I am finally starting to get a grasp of it. One thing I did which helped me was to make a copy of each file as I worked through each step of the exercise.  This simple thing allowed me to easily restart a step if I made a mistake.

OpenRefine.

Having spent a lot of time in previous years correcting data in spreadsheets and databases, I was impressed with OpenRefine‘s capabilities and ease of use. This week I only scratched a bit of what OpenRefine can do, but already it seems like an indispensable tool for working with structured data in columns and rows.

Wikipedia API.

I was thinking about our final project for the course and found out Wikipedia has an API thanks to the example here. This is something I would like to follow up and use for DH work in the future.

Last word.

This week I again learned a tremendous amount not only from the exercises but from the chance to follow some ideas that our course challenged me to think about. DHBox was down, and so I was able to use Ubuntu on my Windows PC instead. My OCR results for an exercise from the previous week were poor.  Looking for alternatives led to me trying OCRFeeder. Challenges made me curious to see how others in the class met them, and I learned a lot from that as well.

Leave a Reply