More data, better searching, incremental progress.

This week’s progress is represented by incremental improvement. These improvements are not earth shattering, but are necessary to get the full value out of a resource like this.

Full text index

I added full text indexes called ftname to the entities_people, entities_organizations and entities_locations tables.

This allowed the website to present a list of possible related entities.  For example, using a full text search the database can return possible matches for Carmen Burke.  The people table has an entry for Burke Secrétoire-Trésorière.

Under organizations there is listed:

Municipal OfficeCarmen Burke Secrétaire-trésorière0S29
Municipal OfficeCarmen Burke Secrétaire-trésorièrePRIVATE INSTRUCTION
Municipal OUceCarmen Burke Secrétalre-trésoriére

Each of these entries may return an additional reference about Carmen Burke, although I expect a lot of overlap due to the names of different entities appearing multiple times in an article, yet being stored in the database with different spellings due to OCR errors. Regardless, the feature to look up possible related entities will allow a researcher to make sure more needles are found in the haystack of content.

Better search

There is now a small form to search for entities starting with the first 2 letters in the select field.

Characters in names being misinterpreted as HTML

A simple improvement was made to the listing of entity names from the database. Due to OCR errors some characters were represented by less-than brackets (<) and an entity named OOW<i resulted in <i being interpreted as the start of an <i> italic HTML tag, which meant that all the content that followed on the web page was in italics. I didn’t want to tamper with the data itself in order to preserve its integrity so I looked at some options in php to deal with presenting content. The php function htmlspecialchars resulted in a lot of data just not being returned by the function and so empty rows were listed rather than content. Using the following statement

 str_replace("<","&amp;",$row['ent_name'])

was the least harmful way to present data that had a < in it by replacing it with the HTML glyph &amp;.

Accents in content were mishandled

As noted in last week’s blog, the web pages were presenting Carmen Burke’s French language title of Présidente incorrectly, per below:

Carmen Burke Pr�sidente

Luckily, the database had stored the data correctly:

Carmen Burke Présidente

I say luckily because I did not check that the processing program was storing accented characters correctly and I should have given I know that paper has French language content too. Lesson learned.

Setting the character set in the database connection fixed the presentation, per below.

mysql_set_charset('utf8',$link);

‘utf8’ is the character set supporting accented characters I want to use. $link represents the database connection.

Completed processing of Equity editions 2000-2010

Yesterday I downloaded the editions of the Equity I was missing from 2000-2010 using wget.

wget http://collections.banq.qc.ca:8008/jrn03/equity/src/2000/ -A .txt -r --no-parent -nd –w 2 --limit-rate=20k

I also completed processing the 2000-2010 editions in the R program. This ran to completion while I was out at a wedding so this is much faster than it used to be.

I backed up the database on the web and then imported all of the data again from my computer so that the website now has a complete database of entities from 1883-2010.  According to the results, there are 629,036 person, 500,055 organization and 114,520 location entities in the database.

 

 

A web site to browse results and better processing performance.

Database indexes make a difference.

Since my last blog post the R program that is processing The Equity files has been running 24 hours a day and as of Saturday it reached 1983. However, I noticed that the time it took to process each issue was getting longer and it seemed that this is taking far too long in general.

I went back to an earlier idea I had to add more indexes to the tables in the MySql database of results. I had been reluctant to do this since adding an additional index to a database table can make updating that table take longer due to the increased time to update the additional index. At first I added an index just to the `entities_people` table to have an index on the name column. [KEY `names` (`name`)] Adding this index made no visible difference to the processing time, likely because this table already had an index to keep the name column unique. [UNIQUE KEY `name_UNIQUE` (`name`)]

Then I added indexes to the cross reference tables that relate each of the entities (people, locations, organizations) to the source documents. [KEY `ids` (`id_entities_person`,`id_source_document`)]

After adding these indexes, processing time really sped up. During the short time I have spent writing these paragraphs three months of editions have been processed. Its no surprise that adding indexes also improved the response time of web pages returning results from the database.

Browse The Equity by topic on this basic web site.

A simple web site is now available to browse by topic and issue. As you can see, thousands of person, location and organization entities have been extracted. Currently only the first 10,000 of them are listed in search results given I don’t want to cause the people hosting this web site any aggravation with long running queries on their server. I plan to improve the searching so that it’s possible to see all of the results but in smaller chunks. I would like to add a full text search, but I am somewhat concerned about that being exploited to harm the web site.

As of today there is a list of issues and for each issue there is a list of the people, organizations and locations that appear in it. All of the issues that an entity’s name appears in can also be listed, such as the Quyon Fair. Do you see any problems in the data as presented? Are there other ways you would like to interact with it? I plan to make the database available for download in the future, once I get it to a more finalized form.

There is a lot of garbage or “diamonds in the rough” here. I think it’s useful to see that in order to show the level of imperfection of what has been culled from scanned text of The Equity, but also to find related information. Take, for example, Carmen Burke:

Carmen Burke President
Carmen Burke Pr�sidente
Carmen Burke Sec TreesLiving
Carmen Burke Secr�ta
Carmen Burke Secr�tair
Carmen Burke Secr�taire
Carmen Burke Secr�taire-tr�sori�re
Carmen Burke Secr�taire-tr�sori�re Campbell
Carmen Burke Secr�taire-tr�sorl�re Campbell
Carmen Burke Secr�taire-Tr�sort�re
Carmen Burke Secr�talre-tr�sori�reWill
Carmen Burke Secr�talre-tr�sori�reX206GENDRON
Carmen Burke Secr�talre-tr�sorl�reOR
Carmen Burke Secretary
Carmen Burke Secretary Treasurer
Carmen Burke Secretary-treasurerLb Premier Jour
Carmen Burke Seter

Cleaning these results is a challenge I continue to think about. A paper I will be looking at in more depth is OCR Post-Processing Error Correction Algorithm Using Google’s Online Spelling Suggestion by Youssef Bassil and Mohammad Alwani.

Happy searching, I hope you find some interesting nuggets of information in these preliminary results. Today the web database has the editions from 1883-1983 and I will be adding more in the coming weeks.

This week’s progress on an improved finding aid for the Shawville Equity.

This week’s progress is represented by the simple list below that shows which issues of the Equity contain information about Alonzo Wright.

Alonzo Wright

83471_1883-10-25
83471_1884-11-27
83471_1885-01-22
83471_1885-07-16
83471_1886-03-18
83471_1894-01-11
83471_1898-09-01
83471_1900-03-29

The list is generated from a database containing all of the entities found in the Equity that I have processed so far. Here is the select statement:

Select entities_people.name, source_documents.source_document_name, source_documents.source_document_base_url, source_documents.source_document_file_extension_1 from entities_people left join people_x_sourcedocuments on entities_people.id_entities_person = people_x_sourcedocuments.id_entities_person left join source_documents on people_x_sourcedocuments.id_source_document = source_documents.id_source_document where entities_people.name = "Alonzo Wright" group by source_documents.source_document_name order by entities_people.name

Currently I am processing issue 981 of the Equity published on 21 March, 1901 and each issue takes several minutes of processing time. Although I have more than 5000 more issues to go before I meet my processing goal this is solid progress compared to earlier this week.

Overcoming memory issues.

I had a sticky problem where my R Equity processing program would stop due to memory errors after processing only a few editions. Given I want to process about 6000 editions, it would be beyond tedious to restart the program after each error.

I modified the database and program to store a processing status so that the program could pick up after the last edition it finished rather than starting at the first edition each time and this was successful, but it wasn’t a fix.

Since I was dealing with this error:

 Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  :   java.lang.OutOfMemoryError: GC overhead limit exceeded

I tried to reduce the amount of garbage collection/GC that Java was doing. I removed some of the dbClearResult(rs) statements with the theory that this was causing the underlying Java to do garbage collection and this seemed to to work better.

Later I got another error message:

     java.lang.OutOfMemoryError: Java heap space

So I upped my memory usage here:

     options(java.parameters = "-Xmx4096m")

Per this article: I tried this:

     options(java.parameters = "-Xms4096m -Xmx4096m")

I still got “java.lang.OutOfMemoryError: GC overhead limit exceeded”.

I commented out all of the output to html files which was part of the original program. This seemed to improved processing.

#outputFilePeopleHtml <- "equityeditions_people.html" #outputFilePeopleHtmlCon<-file(outputFilePeopleHtml, open = "w")

These files were large, maybe too large. Also, with the re-starting of the program, they were incomplete because they only had results from the most recent run of the program. To generate output files, I’ll write a new program to extract the data from the database after all the editions are processed.

After all that though, I still had memory errors and my trial and error fixes were providing only marginal improvement if any real improvement at all.

Forcing garbage collection

Per this article, I added this command to force garbage collection in java and free up memory.

gc()

But I still got out of memory errors. I then changed the program to remove all but essential objects from memory and force garbage collection after each edition is processed:

objectsToKeep<-c("localuserpassword","inputCon", "mydb", "urlLine", "entities", "printAndStoreEntities" ) 
rm(list=setdiff(ls(),objectsToKeep )) 
gc()

After doing this the processing program has been running for several hours now and has not stopped. So far, this has been the best improvement. The processing program has just passed March 1902 and it has stored its 61,906th person entity into the database.

 

 

Continued work on a finding aid for The Equity based on NLP

This week I added a small relational database using MySQL to the project and have stored a record for each issue, one for each of the entities found in the issue as well as a cross reference table for each type of entity.

Problems

I have run into memory issues. I am able to process several dozen issues and then get:

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  :   java.lang.OutOfMemoryError: GC overhead limit exceeded

In my R Program, I have tried this

options(java.parameters = "- Xmx2048m")

However, the memory issue resurfaces on subsequent runs. I probably need to clean up the variables I am using better. I would appreciate a diagnosis of the cause of this if you can help.

I have also had SQL errors when trying to insert some data from the OCR’d text such as apostrophes (‘) and backslashes (\) and so I replace those characters with either an html glyph like &apos; or empty spaces:

entitySql = gsub("'", "&apos;", entity)
entitySql = gsub("\n", "", entitySql)
entitySql = gsub("\'", "", entitySql)
entitySql = gsub("\\", "", entitySql, fixed=TRUE)

I also need to control for data to ensure not to try to insert an entity I already have in what is supposed to be a list of unique entities to avoid the error I have below:

Error in .local(conn, statement, ...) : could not run statement: Duplicate entry 'Household Department ; Health Department ; Young Folks’ Depart' for key 'name_UNIQUE'

Future plans for this project

This week I plan to continue making adjustments to the program to make it run better and consume less memory. I also want to start generating output from the database so that I can display all of the issues a particular entity (person, location or organization) appears in.

Make the process repeatable

Later, I want to take this program and database and reset it to work on another English language newspaper that has been digitized. I plan to document the set up so that it can be used more easily by other digital historians.

Clean up of text errors from OCR

In the OCR’d text of The Equity there are many misspelled words due to errors from the OCR process. I would like to correct these errors but there are two challenges to making corrections by redoing the OCR. The first challenge is volume, there are over 6000 issues to The Equity to deal with. The second is that I was not able to achieve a better quality OCR result than what is available on the Province of Quebec’s Bibliotheque and Archives web site. In a previous experiment I carefully re-did the OCR of a PDF of an issue of The Equity. While the text of each column was no longer mixed with other columns the quality of the resulting words was no better. The cause of this may be that the resolution of the PDF on the website is not high enough to give a better result and that to truly improve the OCR I would need access to a paper copy to make a higher resolution image of it.

While it seems that improving the quality of the OCR is not practical for me, I would still like to clear up misspellings. One idea is to apply machine learning to see if it is possible to correct the text generated by OCR. The article OCR Error Correction Using Character Correction and Feature-Based Word Classification by Ido Kissos and Nachum Dershowitz looks promising, so I plan to work on this a little later in the project. Perhaps machine learning can pick up the word pattern and correct “Massachusetts Supremo Court” found in the text of one of the issues.

Making an improved finding aid for The Equity.

It is an honour for me to be named the George Garth Graham Undergraduate Digital History Research Fellow for the fall 2017 semester. I wish to thank Dr. Shawn Graham for running this fellowship. I was challenged and greatly enjoyed learning about DH in the summer of 2017 and I’m excited to continue to work in this area.

Making an improved finding aid for The Equity.

I am keeping a open notebook and Github repository for this project to improve the finding aid I previously worked on for the Shawville Equity.

I wanted to experiment with natural language processing of text in R and worked with Lincoln Mullen’s lesson on the Rpubs website. After preparing my copy of Mullen’s program I was able to extract people from the 8 September 1960 issue such as:

 [24] "Walter Yach’s" 
[25] "Mrs. Irwin Hayes" 
[26] "D. McMillan"
[27] "Paul Martineau"
[28] "Keith Walsh"
[29] "Ray Baker"
[30] "Joe Russell" 

I was also able to list organizations and places. With places though, there is obvious duplication:

 [90] "Canada"
[91] "Spring"
[92] "Canada"
[93] "Ottawa"
[94] "Ottawa"
[95] "Ottawa"
[96] "Ottawa"  

I want to remove the duplication from these lists and also clean them up. Also, I’d like to catalog when entities from these lists appear in other issues. For example, if I wanted to research a person named Olive Hobbs, I would like to see all of the issues she appears in. Given there are over 6000 editions of the Equity between 1883 and 2010, there are far too many entities and editions to record using static web pages, so I’ll want to use a database to store entities in as well as what issues they appear in. I intend to put the database on the web site so that it can be searched. Also, the database may be a source for further statistical analysis or text encoding.

I will use MySQL since I can run it on my machine locally in order to get faster performance yet also publish it to my web site. Also, the community edition of MySQL is free, which is always appreciated. One thing I am giving attention to is how to make sure the database remains secure when I deploy this to the web.

My progress this week has been to run a program that generates 3 files, one each for people, locations and organizations.  I ran the program on the editions of the Equity in 1970, but I got the error below just as the program started to process the first edition in August 1970.  I will have to tune up the program.

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  : 
  java.lang.OutOfMemoryError: GC overhead limit exceeded

I was also able to store some data into MySQL and will be able to leverage the database more in coming weeks.

Warping Maps – Where is Watership Down?

An exercise I had wanted to accomplish in Dr. Graham’s HIST 3814 was Module 4’s Simple Mapping and Georectifying.  This involves taking a map and using the Harvard World MapWarp website to display the map as a layer above a map of the Earth from today, such as Google Maps.  The website uses specific points that the subject map shares with today’s map and “warps” the subject map to match the scale, projection and location of the map from today. There are all kinds of uses for this.

One of them is plotting the locations from Richard Adams’ Watership Down.  Mythgard.org has an excellent page of locations from the book based on the map below. As a young reader, I remember plotting the location of Watership Down on my National Geographic map of Great Britain in pencil. However, I missed a key fact.  In this map, the direction north is left, not up.

Watership Down Book Map
Map from Watership Down. With credit to cartographer Marliyn Hemmett, author Richard Adams and Ed Powell of the Mythgard.org website for posting this.

Despite the orientation of the map, it is easy to select and plot locations from this map from circa 1972 to a map of this part of England today.  While there are new roads, the courses of rivers, pylon lines and railroads remain. Even the old Roman road, Caesar’s Belt, is still visible on today’s map. Here is the resulting map.

Plotting the railway map of the Kingston and Pembroke Railway.

During HIST 2809, I completed an assignment comparing two railway maps from the 1890’s. One of them was the Kingston and Pembroke Railway (K&P) that travelled between Kingston and Renfrew, Ontario.  Much of the area that the K&P traversed is now wilderness and the former railway is now a trail and so plotting the many stops the K&P had in 1899 on a map from today is an interesting comparison. Here is the resulting map.

 

Final Project – Start of Journey.

The white-space below is meant to be a picture, a network diagram of topics in the newspaper created using the Fruchterman & Reingold algorithm.  However, there is not much of a visible network.

 

Even as the image of above represents failure, it also represents the thrill of doing digital history this week.  The final week of the course has truly been thrilling because now I can start to really use what we have been learning and also see even more of its potential. The fail log of this week’s work is here.

The final week is also daunting. It’s been a week of long hours and decisions about when to move on to another part of the project rather than refining.  I had a strong urge to continue to program and experiment, but realized I needed to document my findings in order for this to be a work of history. For presenting my project I decided to use Mahara because it offers more flexibility than a blog, but also looks more polished than a scratch built web site. However, I forgot that Mahara has high PITA factor with multiple mouse-clicks to do things and I reflected that using Markdown would have been more efficient and future proof.

As alluded to above I ran into errors.  I worked on doing a topic model of a collection of editions of the Shawville Equity that contained the results of a provincial election. However, as I ran the program, I ran into an error, shown on the diagram below.

Running the code past the error generated the set of almost totally disconnected dots visualized above.  This was a productive fail because it caused me to consider two things.

The simplest of these was to do with stopwords.  I had a stopwords file, but it was not working.  I examined the documentation for mallet.import and it indicated that it required stopwords be listed on each line, and my stopwords file was separated by commas, with many words on a single line.  I got a new stopwords file and that fixed that issue.

The other item I considered was my collection.  I had thought there would be themes across provincial elections, but an examination of the topics from the model did not support that.  In fact, given that the issues and people running in provincial elections change over the years, there would likely be few common topics spanning more than a decade in provincial election coverage.

The error and Dr. Graham’s help prompted me to look at my dataset and expand it.  Using all of the text files for the Shawvwille Equity from 1970-1980 caused the program to run without an error.  It also provided a more complex visualization of the topics covered during these eleven years. I want to understand what the ball of topics are at the bottom left of the visualization below.

I also completed work on a finding aid for The Equity by writing an R program. This program lists all of the editions of the Equity and notes which editions were published after specified dates. I started with election dates and want to include other events such as fairs. This program can be adapted for other digitized newspapers archived in a similar manner as the Equity has been.

The repository for this week’s work is here. Learning continues!

Consequences of Digital History

In Mapping Texts: Combining Text-Mining and Geo-Visualization to Unlock the Research Potential of Historical Newspapers, the authors set the context for their ambitious project to develop analytical models to gain historical insight into the content of 232,500 newspaper pages. One of the items they discuss is the abundance of electronic sources for research as “historical records of all kinds are becoming increasingly available in electronic forms.” Historians living in this age of abundance face new challenges, as highlighted by HIST3814o classmate Printhom who asks ”will there be many who cling to the older methods or will historians and academics see the potentially paradigm shifting benefits of technology.”

As pointed out in Mapping Texts, the abundance of sources can overwhelm researchers. In fact, being able to deal with a research project encompassing hundreds of thousands of pages of content also represents a significant financial and technical barrier for researchers. If history is shaped by the people who write it, who will be the people who overcome these barriers to producing digital history and who will be the historians and their communities who will be disenfranchised by a lack of digital access?

Digital Historian Michelle Moravec has been described as someone who “[writes] about marginalized individuals using a marginalized methodology and [publishes] in marginalized places.” Clearly her use of digital history, rather than being a barrier, has been a means to gain a deeper understanding of history, such as her work regarding the relationship of the use of language and the history of women’s suffrage in the United States.

At the same time even with an abundance of available resources, there are also choices made the determine what is available, whether or not an investment will be create a digital source and if there will be a price to access it.

Another consideration is whether the providers of digital history have an agenda beyond history. In her podcast Quantifying Kissinger Micki Kaufman quotes the former U.S. Secretary of State, “everything on paper will be used against me.” Kaufman outlines some of the controversies caused when formerly secret information was made public or, in a humorous case, when publicly available information was republished on Wikileaks. These issues point to the fact that it may be more than just digital historians benefiting from the release of digital sources. Those releasing data may have an agenda to advance, to burnish a reputation or discredit.

Analysing Data: Module 4 of HIST3814o

A finding aid for the Shawville Equity.

This wasn’t an exercise for this week, but I started the week wanting to create a list of the Equity files to make it easier to navigate to follow up annual events.  Last week I was looking at the annual reporting of the Shawville Fair and it took a bit of guess work to pick the right edition of the Equity.  Over time, this led to sore eyes from looking at pdfs I didn’t need to and discomfort in my mouse hand.

To get a list of URLs I used WGET to record all of the directories that hold the Equity files on the BANQ website to my machine, without the files, using the spider option.

wget –spider –force-html -r -w1 -np http://collections.banq.qc.ca:8008/jrn03/equity/

ls -R listed all of the directories and I piped that to a text file.  Using grep to extract only lines containing “collections.banq.qc.ca:8008/jrn03/equity/src/” gave me a list of directories, to which I prepended “http://” to make “http://collections.banq.qc.ca:8008/jrn03/equity/src/1990/05/02/”.  I started editing this list in excel, but realized it would be more educational and, just as importantly, reproducible if I wrote an R script.  I am thinking this R script would note conditions such as:

  • The first edition after the first Sunday of November every 4 years would have municipal election coverage.
  • The annual dates of the Shawville and Quyon fairs.
  • Other significant annual events as found in tools such as Antconc.

That’s not done yet.

Using R.

I am excited by the power of this language and have not used it before. I also liked seeing how well integrated it is with git.  I was able to add my .Rproj file and script in a .R file and push these to GitHub.

The illustrations below come from completing Dr. Shawn Graham’s tutorial on Topic Modelling in R.

Here is a result of using the wordcloud package in R:

Below R produced a diagram to represent words that are shared by topics:

Here is a beautiful representation of links between topics based on their distribution in documents using the cluster and iGraph packages in R:

I can see that the R language and packages offer tremendous flexibility and potential for DH projects.

Exercise 5, Corpus Linguistics with AntConc.

I loaded AntConc locally.  For the corpus, at first, I used all of the text files for the Shawville Equity between 1900-1999 and then searched for occurrences of flq in the corpus. My draft results are here.

When I did a word list, AntConc failed with an error, so I decided to use a smaller corpus of 11 complete years, 1960-1970, roughly the start of the Quiet Revolution until the October Crisis.

Initially, I did not include a stop word list, I wanted to compare the most frequently found word “the” with “le” what appears to be the most frequently found French word.  “The” occurred 322563 times.  “Le” was the 648th most common word in the corpus and was present 1656 times.

The word English occurred 1226 times, while Français occurred 20 times during 1960-1970.

Keyword list in Antconc.

To work with the keyword list tool of AntConc I used text files from the Equity for the complete years 1960-1969 for the corpus. For comparison, I used files from 1990-1999. I used Matthew L. Jockers stopwords list as well. In spite of having a stopwords list, I still had useless keywords such as S and J.

Despite this, it is interesting to me that two words in the top 40 keywords concern religion: rev and church and two other keywords may concern church: service and Sunday. Sunday is the day of the week that is listed first as a keyword. Is this an indicator that Shawville remained a comparatively more religious community than others in Quebec who left their churches in the wake of the Quiet Revolution?

Details of these results are in my fail log.

Exercise 6 Text Analysis with Voyant.

After failing to load a significant amount of text files from the Shawville Equity into Voyant, instead I put them into a zip files and hosted them on my website.  This allowed me to create a corpus in Voyant for the Shawville Equity. I was able to load all of the text files from 1960-1999, but this caused Voyant to produce errors.  I moved to using a smaller corpus of 1970-1980.

Local map in the Shawville Equity.

I tried to find a map of Shawville in the Shawville Equity by doing some text searches with terms like “map of Shawville”, but I did not find anything yet. If you find a map of Shawville in the Equity, please let me know.

Conclusion.

This was an excellent week of learning for me, particularly seeing what R can do.  I would like more time to work with the mapping tools in the future.

 

 

Grooming data: Module 3 of HIST 3814o

This week in our course we considered what it means for historians to use data and adapt it to their purposes.  We also learned techniques to do that, which I will touch on below.

Yet, one of the biggest learning opportunities came from a productive failure of the availability of our course’s technical environment, DHBox. DHBox, available for access from our university, over a virtual private network, has been a marvel to use. However, even this virtual environment encountered an issue at the start of this course week when it became encumbered with too many files from its users.

The lack of availability of DHBox presented a great learning opportunity for students to adapt to. Mac computer users were able to use the terminal on their PC to replicate DHBox.  For learning regular expressions “RegEx”, sophisticated text editors such as Notepad++ could also be used instead of DHBox.

Unix running on Windows. Very cool.

Sarah Cole used the lack of availability of DHBox as an opportunity to set up her Windows machine to run Cygwin, a Unix command line application.  This is fantastic, I never thought I would see Unix running on Windows.  I had a similar experience, Windows 10 can run Ubuntu. With this installed, I was able to use my Windows computer in a whole new way.  I would not have thought to look for this if DHBox had been available.

Without going on at length, the lack of availability of DHBox this week highlighted not only the dependence digital historians have on technology, but also how we must plan to build technical resiliency into our projects.  We should plan for technical failure. Despite the lack of availability of DHBox this week, our class was able to move ahead because alternatives had been considered in advance and we were able to use them.

RegEx.

I learned a lot from the exercise to use Regular Expressions.  I had used RegEx a few times before, but I never really understood them. Using them was more like incantation, a spell to make data look different. Most of the time, I need to understand what the statements in a program mean before I use them; it is unwise not to.  This week with RegEx, I am finally starting to get a grasp of it. One thing I did which helped me was to make a copy of each file as I worked through each step of the exercise.  This simple thing allowed me to easily restart a step if I made a mistake.

OpenRefine.

Having spent a lot of time in previous years correcting data in spreadsheets and databases, I was impressed with OpenRefine‘s capabilities and ease of use. This week I only scratched a bit of what OpenRefine can do, but already it seems like an indispensable tool for working with structured data in columns and rows.

Wikipedia API.

I was thinking about our final project for the course and found out Wikipedia has an API thanks to the example here. This is something I would like to follow up and use for DH work in the future.

Last word.

This week I again learned a tremendous amount not only from the exercises but from the chance to follow some ideas that our course challenged me to think about. DHBox was down, and so I was able to use Ubuntu on my Windows PC instead. My OCR results for an exercise from the previous week were poor.  Looking for alternatives led to me trying OCRFeeder. Challenges made me curious to see how others in the class met them, and I learned a lot from that as well.