Making an improved finding aid for The Equity.

It is an honour for me to be named the George Garth Graham Undergraduate Digital History Research Fellow for the fall 2017 semester. I wish to thank Dr. Shawn Graham for running this fellowship. I was challenged and greatly enjoyed learning about DH in the summer of 2017 and I’m excited to continue to work in this area.

I am keeping a open notebook and Github repository for this project to improve the finding aid I previously worked on for the Shawville Equity.

I wanted to experiment with natural language processing of text in R and worked with Lincoln Mullen’s lesson on the Rpubs website. After preparing my copy of Mullen’s program I was able to extract people from the 8 September 1960 issue such as:

 [24] "Walter Yach’s" 
[25] "Mrs. Irwin Hayes" 
[26] "D. McMillan"
[27] "Paul Martineau"
[28] "Keith Walsh"
[29] "Ray Baker"
[30] "Joe Russell"

I was also able to list organizations and places. With places though, there is obvious duplication:

 [90] "Canada"
[91] "Spring"
[92] "Canada"
[93] "Ottawa"
[94] "Ottawa"
[95] "Ottawa"
[96] "Ottawa"

I want to remove the duplication from these lists and also clean them up. Also, I’d like to catalog when entities from these lists appear in other issues. For example, if I wanted to research a person named Olive Hobbs, I would like to see all of the issues she appears in. Given there are over 6000 editions of the Equity between 1883 and 2010, there are far too many entities and editions to record using static web pages, so I’ll want to use a database to store entities in as well as what issues they appear in. I intend to put the database on the web site so that it can be searched. Also, the database may be a source for further statistical analysis or text encoding.

I will use MySQL since I can run it on my machine locally in order to get faster performance yet also publish it to my web site. Also, the community edition of MySQL is free, which is always appreciated. One thing I am giving attention to is how to make sure the database remains secure when I deploy this to the web.

My progress this week has been to run a program that generates 3 files, one each for people, locations and organizations. I ran the program on the editions of the Equity in 1970, but I got the error below just as the program started to process the first edition in August 1970. I will have to tune up the program.

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  : 
  java.lang.OutOfMemoryError: GC overhead limit exceeded

I was also able to store some data into MySQL and will be able to leverage the database more in coming weeks.

Making an improved finding aid for The Equity.

Leave a Reply Cancel reply