This week’s progress on an improved finding aid for the Shawville Equity.

This week’s progress is represented by the simple list below that shows which issues of the Equity contain information about Alonzo Wright.

Alonzo Wright

83471_1883-10-25
83471_1884-11-27
83471_1885-01-22
83471_1885-07-16
83471_1886-03-18
83471_1894-01-11
83471_1898-09-01
83471_1900-03-29

The list is generated from a database containing all of the entities found in the Equity that I have processed so far. Here is the select statement:

Select entities_people.name, source_documents.source_document_name, source_documents.source_document_base_url, source_documents.source_document_file_extension_1 from entities_people left join people_x_sourcedocuments on entities_people.id_entities_person = people_x_sourcedocuments.id_entities_person left join source_documents on people_x_sourcedocuments.id_source_document = source_documents.id_source_document where entities_people.name = "Alonzo Wright" group by source_documents.source_document_name order by entities_people.name

Currently I am processing issue 981 of the Equity published on 21 March, 1901 and each issue takes several minutes of processing time. Although I have more than 5000 more issues to go before I meet my processing goal this is solid progress compared to earlier this week.

Overcoming memory issues.

I had a sticky problem where my R Equity processing program would stop due to memory errors after processing only a few editions. Given I want to process about 6000 editions, it would be beyond tedious to restart the program after each error.

I modified the database and program to store a processing status so that the program could pick up after the last edition it finished rather than starting at the first edition each time and this was successful, but it wasn’t a fix.

Since I was dealing with this error:

 Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  :   java.lang.OutOfMemoryError: GC overhead limit exceeded

I tried to reduce the amount of garbage collection/GC that Java was doing. I removed some of the dbClearResult(rs) statements with the theory that this was causing the underlying Java to do garbage collection and this seemed to to work better.

Later I got another error message:

     java.lang.OutOfMemoryError: Java heap space

So I upped my memory usage here:

     options(java.parameters = "-Xmx4096m")

Per this article: I tried this:

     options(java.parameters = "-Xms4096m -Xmx4096m")

I still got “java.lang.OutOfMemoryError: GC overhead limit exceeded”.

I commented out all of the output to html files which was part of the original program. This seemed to improved processing.

#outputFilePeopleHtml <- "equityeditions_people.html" #outputFilePeopleHtmlCon<-file(outputFilePeopleHtml, open = "w")

These files were large, maybe too large. Also, with the re-starting of the program, they were incomplete because they only had results from the most recent run of the program. To generate output files, I’ll write a new program to extract the data from the database after all the editions are processed.

After all that though, I still had memory errors and my trial and error fixes were providing only marginal improvement if any real improvement at all.

Forcing garbage collection

Per this article, I added this command to force garbage collection in java and free up memory.

gc()

But I still got out of memory errors. I then changed the program to remove all but essential objects from memory and force garbage collection after each edition is processed:

objectsToKeep<-c("localuserpassword","inputCon", "mydb", "urlLine", "entities", "printAndStoreEntities" ) 
rm(list=setdiff(ls(),objectsToKeep )) 
gc()

After doing this the processing program has been running for several hours now and has not stopped. So far, this has been the best improvement. The processing program has just passed March 1902 and it has stored its 61,906th person entity into the database.

 

 

Continued work on a finding aid for The Equity based on NLP

This week I added a small relational database using MySQL to the project and have stored a record for each issue, one for each of the entities found in the issue as well as a cross reference table for each type of entity.

Problems

I have run into memory issues. I am able to process several dozen issues and then get:

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  :   java.lang.OutOfMemoryError: GC overhead limit exceeded

In my R Program, I have tried this

options(java.parameters = "- Xmx2048m")

However, the memory issue resurfaces on subsequent runs. I probably need to clean up the variables I am using better. I would appreciate a diagnosis of the cause of this if you can help.

I have also had SQL errors when trying to insert some data from the OCR’d text such as apostrophes (‘) and backslashes (\) and so I replace those characters with either an html glyph like &apos; or empty spaces:

entitySql = gsub("'", "&apos;", entity)
entitySql = gsub("\n", "", entitySql)
entitySql = gsub("\'", "", entitySql)
entitySql = gsub("\\", "", entitySql, fixed=TRUE)

I also need to control for data to ensure not to try to insert an entity I already have in what is supposed to be a list of unique entities to avoid the error I have below:

Error in .local(conn, statement, ...) : could not run statement: Duplicate entry 'Household Department ; Health Department ; Young Folks’ Depart' for key 'name_UNIQUE'

Future plans for this project

This week I plan to continue making adjustments to the program to make it run better and consume less memory. I also want to start generating output from the database so that I can display all of the issues a particular entity (person, location or organization) appears in.

Make the process repeatable

Later, I want to take this program and database and reset it to work on another English language newspaper that has been digitized. I plan to document the set up so that it can be used more easily by other digital historians.

Clean up of text errors from OCR

In the OCR’d text of The Equity there are many misspelled words due to errors from the OCR process. I would like to correct these errors but there are two challenges to making corrections by redoing the OCR. The first challenge is volume, there are over 6000 issues to The Equity to deal with. The second is that I was not able to achieve a better quality OCR result than what is available on the Province of Quebec’s Bibliotheque and Archives web site. In a previous experiment I carefully re-did the OCR of a PDF of an issue of The Equity. While the text of each column was no longer mixed with other columns the quality of the resulting words was no better. The cause of this may be that the resolution of the PDF on the website is not high enough to give a better result and that to truly improve the OCR I would need access to a paper copy to make a higher resolution image of it.

While it seems that improving the quality of the OCR is not practical for me, I would still like to clear up misspellings. One idea is to apply machine learning to see if it is possible to correct the text generated by OCR. The article OCR Error Correction Using Character Correction and Feature-Based Word Classification by Ido Kissos and Nachum Dershowitz looks promising, so I plan to work on this a little later in the project. Perhaps machine learning can pick up the word pattern and correct “Massachusetts Supremo Court” found in the text of one of the issues.

Making an improved finding aid for The Equity.

It is an honour for me to be named the George Garth Graham Undergraduate Digital History Research Fellow for the fall 2017 semester. I wish to thank Dr. Shawn Graham for running this fellowship. I was challenged and greatly enjoyed learning about DH in the summer of 2017 and I’m excited to continue to work in this area.

Making an improved finding aid for The Equity.

I am keeping a open notebook and Github repository for this project to improve the finding aid I previously worked on for the Shawville Equity.

I wanted to experiment with natural language processing of text in R and worked with Lincoln Mullen’s lesson on the Rpubs website. After preparing my copy of Mullen’s program I was able to extract people from the 8 September 1960 issue such as:

 [24] "Walter Yach’s" 
[25] "Mrs. Irwin Hayes" 
[26] "D. McMillan"
[27] "Paul Martineau"
[28] "Keith Walsh"
[29] "Ray Baker"
[30] "Joe Russell" 

I was also able to list organizations and places. With places though, there is obvious duplication:

 [90] "Canada"
[91] "Spring"
[92] "Canada"
[93] "Ottawa"
[94] "Ottawa"
[95] "Ottawa"
[96] "Ottawa"  

I want to remove the duplication from these lists and also clean them up. Also, I’d like to catalog when entities from these lists appear in other issues. For example, if I wanted to research a person named Olive Hobbs, I would like to see all of the issues she appears in. Given there are over 6000 editions of the Equity between 1883 and 2010, there are far too many entities and editions to record using static web pages, so I’ll want to use a database to store entities in as well as what issues they appear in. I intend to put the database on the web site so that it can be searched. Also, the database may be a source for further statistical analysis or text encoding.

I will use MySQL since I can run it on my machine locally in order to get faster performance yet also publish it to my web site. Also, the community edition of MySQL is free, which is always appreciated. One thing I am giving attention to is how to make sure the database remains secure when I deploy this to the web.

My progress this week has been to run a program that generates 3 files, one each for people, locations and organizations.  I ran the program on the editions of the Equity in 1970, but I got the error below just as the program started to process the first edition in August 1970.  I will have to tune up the program.

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  : 
  java.lang.OutOfMemoryError: GC overhead limit exceeded

I was also able to store some data into MySQL and will be able to leverage the database more in coming weeks.

Warping Maps – Where is Watership Down?

An exercise I had wanted to accomplish in Dr. Graham’s HIST 3814 was Module 4’s Simple Mapping and Georectifying.  This involves taking a map and using the Harvard World MapWarp website to display the map as a layer above a map of the Earth from today, such as Google Maps.  The website uses specific points that the subject map shares with today’s map and “warps” the subject map to match the scale, projection and location of the map from today. There are all kinds of uses for this.

One of them is plotting the locations from Richard Adams’ Watership Down.  Mythgard.org has an excellent page of locations from the book based on the map below. As a young reader, I remember plotting the location of Watership Down on my National Geographic map of Great Britain in pencil. However, I missed a key fact.  In this map, the direction north is left, not up.

Watership Down Book Map
Map from Watership Down. With credit to cartographer Marliyn Hemmett, author Richard Adams and Ed Powell of the Mythgard.org website for posting this.

Despite the orientation of the map, it is easy to select and plot locations from this map from circa 1972 to a map of this part of England today.  While there are new roads, the courses of rivers, pylon lines and railroads remain. Even the old Roman road, Caesar’s Belt, is still visible on today’s map. Here is the resulting map.

Plotting the railway map of the Kingston and Pembroke Railway.

During HIST 2809, I completed an assignment comparing two railway maps from the 1890’s. One of them was the Kingston and Pembroke Railway (K&P) that travelled between Kingston and Renfrew, Ontario.  Much of the area that the K&P traversed is now wilderness and the former railway is now a trail and so plotting the many stops the K&P had in 1899 on a map from today is an interesting comparison. Here is the resulting map.