I heard last week from Dr. Graham that one of the grade 11 Law and Society classes at Pontiac High School are using the finding aid for The Equity. I’m glad to hear it’s being accessed for research and thank the class and its teacher for making use of this.
As noted previously, I plan to work on refining the finding aid, including correcting OCR errors. I had been thinking of using Google, but it’s against their terms of service to submit huge numbers of requests. Fair enough. Google’s director of research Peter Norviq published an article about spelling correction using an off-line method. Here is an article describing how to do this in R. To use this I will need to add in additional words, such as local place names, to Noviq’s spell check corpus big.txt. As promising as this seems to be, I will leave this work for another time.
This week’s research faces a different challenge, warts and all. I have been researching the British Gazette for Dr. Y. Aleksandra Bennett’s HIST 4500 seminar on British Society and the Experience of the First World War. The Gazette is a trove of official announcements. One of my areas of inquiry concerns allotment gardens for food production in World War I. The Gazette contains information about the regulations that governed these food gardens during the Great War. The Gazette also contains announcements about the discovery of Potato Wart virus in separate allotment gardens and each notice has the location infected. With 194 of these notices, I believe this is a potentially useful body of data to derive some patterns from. At a minimum I would like to list all of the locations of the allotments in Britain and plot them on a map. Was potato wart a regional or national problem? What was the extent and time-line of the issue? This of course assumes the Gazette is a reliable source for this information.
Getting 194 pages from the Gazette is doable manually, but we can write a program to do that, and then re-use the program for other things.
Using what I learned in HIST 3814, I checked if the Gazette has an API, which it does. In fact there are lots of options to download data in json, XML and some other formats.
I tried to use the json api for the British Gazette but it gave me errors:
> json_file <- "https://www.thegazette.co.uk/all-notices/notice/data.json?end-publish-date=1918-11-11&text=potatoes+wart+schedule&start-publish-date=1914-08-03&location-distance-1=1&service=all-notices&categorycode-all=all&numberOfLocationSearches=1" > json_data <- fromJSON(file=json_file) Error in fromJSON(file = json_file) : unexpected character: "
I decided to switch to xml, which has worked fine. Below the program accesses the Gazette and puts the xml into a data frame:
library(XML) xml_file <- "https://www.thegazette.co.uk/all-notices/notice/data.feed?end-publish-date=1918-11-11&text=potatoes+wart+schedule&start-publish-date=1914-08-03&location-distance-1=1&service=all-notices&categorycode-all=all&numberOfLocationSearches=1" xmlfile <- xmlTreeParse(readLines(xml_file)) topxml <- xmlRoot(xmlfile) topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue)) xml_df <- data.frame(t(topxml), row.names=NULL) totalPagesReturned<-as.integer(xml_df$total)
As you can see, the URL provides the information to pull the material we want:
Without going on at too much length, the program works through a list of search results ten entries at a time. In this case 194 of them. For each entry, the program then downloads the pdf of the Gazette page the search results appear on. The pdf is converted into text and then parsed. Most of the time, the location of each allotment garden follows the word “SCHEDULE” and using this, we can get a list of all the allotments mentioned and when the notices were published. Here is the list in .csv and on Google docs.
I intend to use natural language processing to extract the location of the allotment as well as the name of the organization that ran it. I think I can use the program for some other extraction as well as composing citations.