Correcting the spelling of OCR errors

I have been reviewing the list of entities extracted from the editions of The Equity and have seen errors I could have corrected in the finding aid.  One of them is that some entities appear in the same list twice. An example comes from this page listing locations below.

Fort Coulongc
Fort Coulonge Williams
Fort Coulonge
Fort CoulongeSt. John’s
Fort Coulongo

<–snip–>

Fort Coulonge St.
Fort Coulonge Tel.
Fort Coulonge
Fort Coulonge – River
Fort Coulonge – St. Andrew

 

Why is this? The first entity for Fort Coulonge has a tab character (ASCII code 9) separating the two words while the second one listed has a space (code 32), as expected. In this finding aid, each entity is meant to be unique to make it easier to reference and so this is a problem. I could correct this in the database with SQL UPDATE statements to merge the information for entities containing tab characters with the entity containing spaces, but it’s also an opportunity to reprocess the data from scratch and make some more corrections.

The last time I processed The Equity for entities it took about 2 weeks of around the clock processing, counting time when the program repeatedly stopped due to out of memory errors. However, with performance improvements, I expect reprocessing will be faster.

I would also like to add some spelling correction for OCR errors. The first spelling correction I tried was the method that comes from Rasmus Bååth’s Research Blog and Peter Norvig’s technique. A large corpus of words from English texts is used as the basis for correcting the spelling of a word. The word to be checked is compared to the words in the corpus and a match based on probability is proposed. My results from this technique did not offer much correction and in fact produced some erroneous corrections. I think this is because my person, location and organization entities often contain names.

I tried the R function which_misspelled() which did not produce an improvement in spelling correction either. I’ve spent a fair amount of time on this, is this a failure?

Peter Norvig’s technique is trainable. Adding additional words to Norvig’s corpus used for spell checking seems to give better results. I even got a few useful corrections such as changing Khawville to shawville. To start to train the Norvig spell checker I entered all of the communities in Quebec listed on this Wikipedia category. Then I viewed the output as each term was checked so see if the Norvig spell checker was failing to recognize a correctly spelled word.  An example of this is when the name Beatty gets corrected to beauty. I added the correctly spelled names that Norvig’s method was not picking up to his corpus.

Below is a sample of terms I have added to Norvig’s corpus to improve the spell check results for entities found in The Equity.

Ottawa, Porteous, Hiver, Du, Lafleur, Varian, Ont, Mercier, Duvane, Hanlan, Farrell, Robertson, Toronto, Jones, Alexandria, Chicago, England, London, Manchester, Renfrew, Pontiac, Campbell, Forresters Falls, UK, Cuthbertson, Steele, Gagnon, Fort Coulonge, Beresford, Carswell, Doran, Dodd, Allumette, Nepean, Rochester, Latour, Lacrosse, Mousseau, Tupper, Devine, Carleton, Laval, McGill, Coyne, Hodgins, Purcival, Brockville, Eganville, Rideau, McLean, Hector, Langevin, Cowan, Tilley, Jones, Leduc, McGuire

Below is an example of the output from the spelling correction using Norvig’s method.  As you can see, I need to add terms from lines 4,6,8 and 10 because the Norvig method is returning an incorrect result. Even then, this method may still return an error for a correctly spelled name.  Despite adding “McLean” to the corpus, this method still corrects “McLean” as “clean”.

Data frame showing terms that may have been misspelled in the left column next to the suggested corrections from Norvig's method.
Data frame showing terms that may have been misspelled in the left column next to the suggested corrections from Norvig’s method.

 

The full R program is here. Below is the detail of the portion of the function used for spelling correction

# Read in big.txt, a 6.5 mb collection of different English texts.
raw_text <- paste(readLines("C:/a_orgs/carleton/hist3814/R/graham_fellowship/norvigs-big-plus-placesx2.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))

setwd("C:/a_orgs/carleton/hist3814/R/graham_fellowship")

#Rasmus Bååth's Research Blog
#http://www.sumsar.net/blog/2014/12/peter-norvigs-spell-checker-in-two-lines-of-r/
correctNorvig <- function(word) {
 # Calculate the edit distance between the word and all other words in sorted_words.
 edit_dist <- adist(word, sorted_words)
 # Calculate the minimum edit distance to find a word that exists in big.txt 
 # with a limit of two edits.
 min_edit_dist <- min(edit_dist, 2)
 # Generate a vector with all words with this minimum edit distance.
 # Since sorted_words is ordered from most common to least common, the resulting
 # vector will have the most common / probable match first.
 proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
 # In case proposals_by_prob would be empty we append the word to be corrected...
 proposals_by_prob <- c(proposals_by_prob, word)
 # ... and return the first / most probable word in the vector.
 proposals_by_prob[1]
}

<!--- snip ---- much of the program is removed --->


# correctedEntity is what will be checked for spelling
# nameSpellChecked is the resulting value of the spelling correction, if a correction is found.  

nameSpellChecked=""
 
 correctedEntityWords = strsplit(correctedEntity, " ")
 correctedEntityWordsNorvig = strsplit(correctedEntity, " ")
 
 #sometimes which_misspelled() fails and so it is in a tryCatch()
 misSpelledWords <-tryCatch(
 {
 which_misspelled(correctedEntity, suggest=TRUE)
 },
 error=function(cond) {
 NULL
 },
 warning=function(cond) {
 NULL
 },
 finally={
 NULL
 })
 
 
 if(is.null(misSpelledWords)){
 #The R spell checker has not picked up a problem, so no need to do further checking.
 misSpelled=FALSE
 } else {
 for(counter in 1:length(misSpelledWords[[1]])){
 misSpelled=TRUE
 wordNum = as.integer(misSpelledWords[[1]][counter])
 correctedEntityWords[[1]][wordNum] = misSpelledWords[counter,3]
 correctedEntityWordsNorvig[[1]][wordNum] = correctNorvig(correctedEntityWordsNorvig[[1]][wordNum])
 }
 correctedEntitySpellChecked = paste(correctedEntityWords[[1]],collapse=" ")
 correctedEntityNorvig = paste(correctedEntityWordsNorvig[[1]],collapse=" ")
 nameSpellChecked=""
 if(!str_to_upper(correctedEntity)==str_to_upper(correctedEntityNorvig)){
 #We have found a suggested correction
 nameSpellChecked=correctedEntityNorvig
 
 print(paste(correctedEntity,misSpelled,correctedEntitySpellChecked,correctedEntityNorvig,sep=" --- "))
 
 #keep a vector of the words to make into a dataframe so that we can check the results of the spell check. Remove this after training of the spell checker is done.
 spellCheckOrig<-c(spellCheckOrig,correctedEntity)
 spellCheckMisSpelled<-c(spellCheckMisSpelled,misSpelled)
 spellCheckCorrect<-c(spellCheckCorrect,correctedEntitySpellChecked)
 spellCheckNorvig<-c(spellCheckNorvig,correctedEntityNorvig) 
 
 }
 }

#Clean up any symbols that will cause an SQL error when inserted into the database
 nameSpellCheckedSql = gsub("'", "''", nameSpellChecked)
 nameSpellCheckedSql = gsub("’", "''", nameSpellCheckedSql)
 nameSpellCheckedSql = gsub("\'", "''", nameSpellCheckedSql)
 nameSpellCheckedSql = gsub("\\", "", nameSpellCheckedSql, fixed=TRUE)

The next step is to finish reprocessing the Equity editions and use the corrected spelling field to improve the results in the “possible related entities” section of each entity listed on the web site for the finding aid.

 

Plotting data using R.

This week I have continued work with the National Library of Wales’ Welsh Newspapers Online. Working with this collection I wanted to see if there was a significant pattern with the number of newspaper stories found in search results for my research on allotment gardening in Wales during World War I.  I used this R program to search Welsh Newspapers Online and store the results in a MySQL database. My previous post here explains how the web page parsing in program works.

Below is a graph of the number of newspaper stories containing the words “allotment” and “garden” published each month during World War I:

Graph of the number of newspaper stories containing allotment and garden published each month.
Graph of the number of newspaper stories containing allotment and garden published each month.

The number of newspaper stories in Welsh papers containing allotment and garden rises significantly in 1917 after a poor harvest in 1916 and the establishment of the British Ministry of Food on 22 December, 1916 [1].

Below is the R program used to make the graph.  Initially I had problems graphing the data for each month. If I just used numbers for the months where August 1914 was month 1 and November 1918 was month 52 the graph was harder to interpret.  Using a time series helped, see this line in the program below: qts1 = ts(dbRows$count, frequency = 12, start = c(1914, 8)).

library(RMySQL)
rmysql.settingsfile<-"C:\\ProgramData\\MySQL\\MySQL Server 5.7\\newspaper_search_results.cnf"

rmysql.db<-"newspaper_search_results"
storiesDb<-dbConnect(RMySQL::MySQL(),default.file=rmysql.settingsfile,group=rmysql.db)

searchTermUsed="AllotmentAndGarden"
query<-paste("SELECT (concat('1 ',month(story_date_published),' ',year(story_date_published))) as 'month',count(concat(month(story_date_published),' ',year(story_date_published))) as 'count' from tbl_newspaper_search_results WHERE search_term_used='",searchTermUsed,"' GROUP BY year(story_date_published),month(story_date_published) ORDER BY year(story_date_published),month(story_date_published);",sep="")
print(query)
rs = dbSendQuery(storiesDb,query)
dbRows<-dbFetch(rs)
dbRows$month = as.Date(dbRows$month,"%d %m %Y")
qts1 = ts(dbRows$count, frequency = 12, start = c(1914, 8)) 
plot(qts1, lwd=3,col = "darkgreen", xlab="Month of the war",ylab="Number of newspaper stories", main=paste("Number of stories in Welsh Newspapers matching the search Allotment and Garden",sep=""),sub="For each month of World War I.")

dbDisconnect(storiesDb)

It appears that a lot of stories were published about allotment gardening in the last two years of World War I in Wales. Were these stories published in newspapers throughout Wales or only in some areas? To answer this question we need to know the location of each newspaper that published a story and relate that to the stories published in the database.

I referenced a list of all the Welsh newspapers available on-line. Each newspaper also has a page of metadata about it. To gather data, I used an R program to parse the list of newspapers and lookup each newspaper’s metadata.  This program extracted the name of the place the newspaper was published and stored that into a database.

Below is the detail of the geocoding and inserting of values into the database. I removed a tryCatch() handler for the geocode statement for readability.

NewspaperDataPlaceGeoCode= geocode(paste(NewspaperDataPlace,",",NewspaperDataCountry,sep=""))

 NewspaperDataPlaceLat = NewspaperDataPlaceGeoCode[[2]]
 NewspaperDataPlaceLong = NewspaperDataPlaceGeoCode[[1]]
 

query<-paste("INSERT INTO `newspaper_search_results`.`tbl_newspapers`(`newspaper_id`,`newspaper_title`,`newspaper_subtitle`,`newspaper_place`,`newspaper_country`,`newspaper_place_lat`,`newspaper_place_long`) VALUES ('",newspaperNumber,"','",sqlInsertValueClean(NewspaperDataTitle),"',LEFT(RTRIM('",sqlInsertValueClean(NewspaperDataSubTitle),"'),255),'",NewspaperDataPlace,"','",NewspaperDataCountry,"',",NewspaperDataPlaceLat,",",NewspaperDataPlaceLong,");",sep="")
The tbl_newspapers table with the geocoded location of publication.
The tbl_newspapers table with the geocoded location of publication.

I used R’s ggmap to plot the locations of the newspapers on a map of Wales.[2] Below, the title, latitude and longitude is selected from tbl_newspapers and then put into the dataframe named df.

query="SELECT `newspaper_title`,`newspaper_place_lat`,`newspaper_place_long` FROM `tbl_newspapers`;"
rs = dbSendQuery(newspapersDb,query)
dbRows<-dbFetch(rs)

df <- data.frame(x=dbRows$newspaper_place_long, y=dbRows$newspaper_place_lat,newspaperTitle=dbRows$newspaper_title)

#opens a map of Wales
mapWales <- get_map(location = c(lon = -4.08292, lat = 52.4153),color = "color",source = "google",maptype = "roadmap",zoom = 8)

ggmap(mapWales, base_layer = ggplot(aes(x = x, y = y, size = 3), data = df)) + geom_point(color="blue", alpha=0.3)
dbDisconnect(newspapersDb)

Ggmap plots the locations of the newspapers in the df dataframe on to the mapWales map:

Sites where Welsh newspapers were published.
Sites where Welsh newspapers were published.

The map above shows the locations where each newspaper in the Welsh National Library collection was published. To make this usable with the collection of stories about allotment gardens that were printed during World War I, I will change the program to join the table of stories to the table of newspaper locations and plot only the locations of the newspapers that printed the stories in the collection of search results above.

To improve on this, instead of just plotting the publication location, I would like to plot the area the newspaper circulated in.  I plan to see if I can reliably get this information from the newspaper metadata.

Thanks to Jiayi (Jason) Liu for the article Ggmap. See: https://rpubs.com/jiayiliu/ggmap_examples

 

 

 


[1] Records of the Ministry of Food. British National Archives.

[2] D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

 

 

Searching an on-line newspaper without an API.

Last week’s research focused on getting notices from the British Gazette via their API. The notices are searchable and returned as XML which could be handled easily as a dataFrame in R.

This week’s focus is a web site that I can’t find an API for, the National Library of Wales’ Welsh Newspapers Online. This site is a tremendous source of news stories and it easy for a person to search and browse. I have asked the Welsh Library if the web site has an API. In the meantime, this week’s program uses search result webpages to get data. It’s a much uglier and error prone process to reliable get data out of a web page, but I expect the benefits of doing that will save me some time in pulling together newspaper articles for research so that I can focus on reading them rather than downloading them. I hope this approach is useful for others. Here is the program.

To search Welsh Newspapers Online using R, a URL can be composed with these statements:

searchDateRangeMin = "1914-08-03"
searchDateRangeMax = "1918-11-20"
searchDateRange = paste("&range%5Bmin%5D=",searchDateRangeMin,"T00%3A00%3A00Z&range%5Bmax%5D=",searchDateRangeMax,"T00%3A00%3A00Z",sep="")
searchBaseURL = "http://newspapers.library.wales/"
searchTerms = paste("search?alt=full_text%3A%22","allotment","%22+","AND","+full_text%3A%22","society","%22+","OR","+full_text%3A%22","societies","%22",sep="")
searchURL = paste(searchBaseURL,searchTerms,searchDateRange,sep="")

These assemble the Search URL:

http://newspapers.library.wales/search?alt=full_text%3A%22allotment%22+AND+full_text%3A%22society%22+OR+full_text%3A%22societies%22&range%5Bmin%5D=1914-08-03T00%3A00%3A00Z&range%5Bmax%5D=1918-11-20T00%3A00%3A00Z

This search generates 1,009 results.  To loop through them in pages of 12 results at a time this loop is used:

for(gatherPagesCounter in 84:(floor(numberResults/12)+1)){

How does the program know the number of results returned by the search? It looks through the search results page line by line until it finds a line like:

<input id="fl-decade-0" type="checkbox" class="facet-checkbox" name="decade[]" value="1910" facet />

If you like, take a look at the source of this page and find the line above by searching for 1910.

The line above is unique on the page and does not change regardless of how many search results. Following that line we have the one below containing the number of results:

<label for="fl-decade-0"> 1910 - 1919 <span class="facet-count" data-facet="1910">(1,009)</span></label>

Here is the part of the program that searches for the line above and parses the second line to get the numeric value of 1009 we want:

# find number of results
 for (entriesCounter in 1:550){
      if(thepage[entriesCounter] == '<input id=\"fl-decade-0\" type=\"checkbox\" class=\"facet-checkbox\" name=\"decade[]\" value=\"1910\" facet />') {
           print(thepage[entriesCounter+1])
           tmpline = thepage[entriesCounter+1]
           tmpleft = gregexpr(pattern ='"1910',tmpline)
           tmpright = gregexpr(pattern ='</span>',tmpline)
           numberResults = substr(tmpline, tmpleft[[1]]+8, tmpright[[1]]-2)
           numberResults = trimws(gsub(",","",numberResults))
           numberResults = as.numeric(numberResults)
      }
 }

Getting this information returned from an API would be easier to work with, but we can handle this.

For testing purposes, I’m using 3 pages of search results for now. Here is a sample of the logic used in most of the program:

for(gatherPagesCounter in 1:3){
      thepage = readLines(paste   (searchURL,"&page=",gatherPagesCounter,sep=""))
      # get rid of the tabs
      thepage = trimws(gsub("\t"," ",thepage))
      for (entriesCounter in 900:length(thepage)){
           if(thepage[entriesCounter] == '<h2 class=\"result-title\">'){
<...snip...>
               entryTitle = trimws(gsub("</a>","",thepage[entriesCounter+2]))

A page number is appended to the search URL noted above…

thepage = readLines(paste(searchURL,"&page=",gatherPagesCounter,sep=""))

…so that we have a URL like below and the program can step through pages 1,2,3…85:

http://newspapers.library.wales/search?alt=full_text%3A%22allotment%22+AND+full_text%3A%22society%22+OR+full_text%3A%22societies%22&range%5Bmin%5D=1914-08-03T00%3A00%3A00Z&range%5Bmax%5D=1918-11-20T00%3A00%3A00Z&page=2

This statement removes tab characters to make each line cleaner for the purposes of looking for the lines we want:

thepage = trimws(gsub("\t"," ",thepage))

The program loops through the lines on the page looking for each line that signifies the start of an article returned from the search: <h2 class=”result-title”>

for (entriesCounter in 900:length(thepage)){
 if(thepage[entriesCounter] == '<h2 class=\"result-title\">'){

Once we know which line number <h2 class=”result-title”> is on, we can get items like the article title that happen to be 2 lines below the line we found:

entryTitle = trimws(gsub("</a>","",thepage[entriesCounter+2]))

This all breaks to smithereens if the web site is redesigned, however it’s working ok for my purposes here which are temporary. I hope it works for you. This technique can be adapted to other similar web sites.

Download each article:

For each search result returned the program also downloads the text of the linked article. As it does that, the program takes a pause for 5 seconds so as not to put a strain on the web servers at the other end of this.

# wait 5 seconds - don't stress the server
p1 <- proc.time()
Sys.sleep(5)
proc.time() - p1

Results of the program:

  • A comma separated value (.csv) file of each article with the newspaper name, article title, date published, URL and a rough citation.
  • A separate html file with the title and text of each article so that I can read it off-line.
  • An html index to these files for easy reference.

Reading through the articles I can make notes in a spreadsheet listing all of the articles, removing the non-relevant ones and classifying the others. I have applied natural language processing to extract people, location and organization entities from the articles, but I am still evaluating if that provides useful data due to the frequency of errors I’m seeing.

It’s time to let this loose!

The British Gazette, R and Potato Wart virus.

I heard last week from Dr. Graham that one of the grade 11 Law and Society classes at Pontiac High School are using the finding aid for The Equity. I’m glad to hear it’s being accessed for research and thank the class and its teacher for making use of this.

As noted previously, I plan to work on refining the finding aid, including correcting OCR errors. I had been thinking of using Google, but it’s against their terms of service to submit huge numbers of requests. Fair enough. Google’s director of research Peter Norviq published an article about spelling correction using an off-line method. Here is an article describing how to do this in R. To use this I will need to add in additional words, such as local place names, to Noviq’s spell check corpus big.txt. As promising as this seems to be, I will leave this work for another time.

This week’s research faces a different challenge, warts and all. I have been researching the British Gazette‎ for Dr. Y. Aleksandra Bennett’s HIST 4500 seminar on British Society and the Experience of the First World War. The Gazette is a trove of official announcements. One of my areas of inquiry concerns allotment gardens for food production in World War I. The Gazette contains information about the regulations that governed these food gardens during the Great War. The Gazette also contains announcements about the discovery of Potato Wart virus in separate allotment gardens and each notice has the location infected. With 194 of these notices, I believe this is a potentially useful body of data to derive some patterns from. At a minimum I would like to list all of the locations of the allotments in Britain and plot them on a map. Was potato wart a regional or national problem? What was the extent and time-line of the issue? This of course assumes the Gazette is a reliable source for this information.

Getting 194 pages from the Gazette is doable manually, but we can write a program to do that, and then re-use the program for other things.

Using what I learned in HIST 3814, I checked if the Gazette has an API, which it does. In fact there are lots of options to download data in json, XML and some other formats.

I started work on an R program to use the Gazette’s API to search for notices, download them and then parse them for the content I’m looking for. My fail log is here.

I tried to use the json api for the British Gazette but it gave me errors:

> json_file <- "https://www.thegazette.co.uk/all-notices/notice/data.json?end-publish-date=1918-11-11&text=potatoes+wart+schedule&start-publish-date=1914-08-03&location-distance-1=1&service=all-notices&categorycode-all=all&numberOfLocationSearches=1"

> json_data <- fromJSON(file=json_file)

Error in fromJSON(file = json_file) : unexpected character: "

I decided to switch to xml, which has worked fine. Below the program accesses the Gazette and puts the xml into a data frame:

library(XML)
xml_file <- "https://www.thegazette.co.uk/all-notices/notice/data.feed?end-publish-date=1918-11-11&text=potatoes+wart+schedule&start-publish-date=1914-08-03&location-distance-1=1&service=all-notices&categorycode-all=all&numberOfLocationSearches=1"

xmlfile <- xmlTreeParse(readLines(xml_file)[1])

topxml <- xmlRoot(xmlfile)

topxml <- xmlSApply(topxml,function(x) xmlSApply(x, xmlValue))

xml_df <- data.frame(t(topxml), row.names=NULL)

totalPagesReturned<-as.integer(xml_df$total)

As you can see, the URL provides the information to pull the material we want:

https://www.thegazette.co.uk/all-notices/notice/data.feed?end-publish-date=1918-11-11&text=potatoes+wart+schedule&start-publish-date=1914-08-03&location-distance-1=1&service=all-notices&categorycode-all=all&numberOfLocationSearches=1

Without going on at too much length, the program works through a list of search results ten entries at a time. In this case 194 of them. For each entry, the program then downloads the pdf of the Gazette page the search results appear on. The pdf is converted into text and then parsed. Most of the time, the location of each allotment garden follows the word “SCHEDULE” and using this, we can get a list of all the allotments mentioned and when the notices were published. Here is the list in .csv and on Google docs.

I intend to use natural language processing to extract the location of the allotment as well as the name of the organization that ran it. I think I can use the program for some other extraction as well as composing citations.

More data, better searching, incremental progress.

This week’s progress is represented by incremental improvement. These improvements are not earth shattering, but are necessary to get the full value out of a resource like this.

Full text index

I added full text indexes called ftname to the entities_people, entities_organizations and entities_locations tables.

This allowed the website to present a list of possible related entities.  For example, using a full text search the database can return possible matches for Carmen Burke.  The people table has an entry for Burke Secrétoire-Trésorière.

Under organizations there is listed:

Municipal OfficeCarmen Burke Secrétaire-trésorière0S29
Municipal OfficeCarmen Burke Secrétaire-trésorièrePRIVATE INSTRUCTION
Municipal OUceCarmen Burke Secrétalre-trésoriére

Each of these entries may return an additional reference about Carmen Burke, although I expect a lot of overlap due to the names of different entities appearing multiple times in an article, yet being stored in the database with different spellings due to OCR errors. Regardless, the feature to look up possible related entities will allow a researcher to make sure more needles are found in the haystack of content.

Better search

There is now a small form to search for entities starting with the first 2 letters in the select field.

Characters in names being misinterpreted as HTML

A simple improvement was made to the listing of entity names from the database. Due to OCR errors some characters were represented by less-than brackets (<) and an entity named OOW<i resulted in <i being interpreted as the start of an <i> italic HTML tag, which meant that all the content that followed on the web page was in italics. I didn’t want to tamper with the data itself in order to preserve its integrity so I looked at some options in php to deal with presenting content. The php function htmlspecialchars resulted in a lot of data just not being returned by the function and so empty rows were listed rather than content. Using the following statement

 str_replace("<","&amp;",$row['ent_name'])

was the least harmful way to present data that had a < in it by replacing it with the HTML glyph &amp;.

Accents in content were mishandled

As noted in last week’s blog, the web pages were presenting Carmen Burke’s French language title of Présidente incorrectly, per below:

Carmen Burke Pr�sidente

Luckily, the database had stored the data correctly:

Carmen Burke Présidente

I say luckily because I did not check that the processing program was storing accented characters correctly and I should have given I know that paper has French language content too. Lesson learned.

Setting the character set in the database connection fixed the presentation, per below.

mysql_set_charset('utf8',$link);

‘utf8’ is the character set supporting accented characters I want to use. $link represents the database connection.

Completed processing of Equity editions 2000-2010

Yesterday I downloaded the editions of the Equity I was missing from 2000-2010 using wget.

wget http://collections.banq.qc.ca:8008/jrn03/equity/src/2000/ -A .txt -r --no-parent -nd –w 2 --limit-rate=20k

I also completed processing the 2000-2010 editions in the R program. This ran to completion while I was out at a wedding so this is much faster than it used to be.

I backed up the database on the web and then imported all of the data again from my computer so that the website now has a complete database of entities from 1883-2010.  According to the results, there are 629,036 person, 500,055 organization and 114,520 location entities in the database.

 

 

A web site to browse results and better processing performance.

Database indexes make a difference.

Since my last blog post the R program that is processing The Equity files has been running 24 hours a day and as of Saturday it reached 1983. However, I noticed that the time it took to process each issue was getting longer and it seemed that this is taking far too long in general.

I went back to an earlier idea I had to add more indexes to the tables in the MySql database of results. I had been reluctant to do this since adding an additional index to a database table can make updating that table take longer due to the increased time to update the additional index. At first I added an index just to the `entities_people` table to have an index on the name column. [KEY `names` (`name`)] Adding this index made no visible difference to the processing time, likely because this table already had an index to keep the name column unique. [UNIQUE KEY `name_UNIQUE` (`name`)]

Then I added indexes to the cross reference tables that relate each of the entities (people, locations, organizations) to the source documents. [KEY `ids` (`id_entities_person`,`id_source_document`)]

After adding these indexes, processing time really sped up. During the short time I have spent writing these paragraphs three months of editions have been processed. Its no surprise that adding indexes also improved the response time of web pages returning results from the database.

Browse The Equity by topic on this basic web site.

A simple web site is now available to browse by topic and issue. As you can see, thousands of person, location and organization entities have been extracted. Currently only the first 10,000 of them are listed in search results given I don’t want to cause the people hosting this web site any aggravation with long running queries on their server. I plan to improve the searching so that it’s possible to see all of the results but in smaller chunks. I would like to add a full text search, but I am somewhat concerned about that being exploited to harm the web site.

As of today there is a list of issues and for each issue there is a list of the people, organizations and locations that appear in it. All of the issues that an entity’s name appears in can also be listed, such as the Quyon Fair. Do you see any problems in the data as presented? Are there other ways you would like to interact with it? I plan to make the database available for download in the future, once I get it to a more finalized form.

There is a lot of garbage or “diamonds in the rough” here. I think it’s useful to see that in order to show the level of imperfection of what has been culled from scanned text of The Equity, but also to find related information. Take, for example, Carmen Burke:

Carmen Burke President
Carmen Burke Pr�sidente
Carmen Burke Sec TreesLiving
Carmen Burke Secr�ta
Carmen Burke Secr�tair
Carmen Burke Secr�taire
Carmen Burke Secr�taire-tr�sori�re
Carmen Burke Secr�taire-tr�sori�re Campbell
Carmen Burke Secr�taire-tr�sorl�re Campbell
Carmen Burke Secr�taire-Tr�sort�re
Carmen Burke Secr�talre-tr�sori�reWill
Carmen Burke Secr�talre-tr�sori�reX206GENDRON
Carmen Burke Secr�talre-tr�sorl�reOR
Carmen Burke Secretary
Carmen Burke Secretary Treasurer
Carmen Burke Secretary-treasurerLb Premier Jour
Carmen Burke Seter

Cleaning these results is a challenge I continue to think about. A paper I will be looking at in more depth is OCR Post-Processing Error Correction Algorithm Using Google’s Online Spelling Suggestion by Youssef Bassil and Mohammad Alwani.

Happy searching, I hope you find some interesting nuggets of information in these preliminary results. Today the web database has the editions from 1883-1983 and I will be adding more in the coming weeks.

This week’s progress on an improved finding aid for the Shawville Equity.

This week’s progress is represented by the simple list below that shows which issues of the Equity contain information about Alonzo Wright.

Alonzo Wright

83471_1883-10-25
83471_1884-11-27
83471_1885-01-22
83471_1885-07-16
83471_1886-03-18
83471_1894-01-11
83471_1898-09-01
83471_1900-03-29

The list is generated from a database containing all of the entities found in the Equity that I have processed so far. Here is the select statement:

Select entities_people.name, source_documents.source_document_name, source_documents.source_document_base_url, source_documents.source_document_file_extension_1 from entities_people left join people_x_sourcedocuments on entities_people.id_entities_person = people_x_sourcedocuments.id_entities_person left join source_documents on people_x_sourcedocuments.id_source_document = source_documents.id_source_document where entities_people.name = "Alonzo Wright" group by source_documents.source_document_name order by entities_people.name

Currently I am processing issue 981 of the Equity published on 21 March, 1901 and each issue takes several minutes of processing time. Although I have more than 5000 more issues to go before I meet my processing goal this is solid progress compared to earlier this week.

Overcoming memory issues.

I had a sticky problem where my R Equity processing program would stop due to memory errors after processing only a few editions. Given I want to process about 6000 editions, it would be beyond tedious to restart the program after each error.

I modified the database and program to store a processing status so that the program could pick up after the last edition it finished rather than starting at the first edition each time and this was successful, but it wasn’t a fix.

Since I was dealing with this error:

 Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  :   java.lang.OutOfMemoryError: GC overhead limit exceeded

I tried to reduce the amount of garbage collection/GC that Java was doing. I removed some of the dbClearResult(rs) statements with the theory that this was causing the underlying Java to do garbage collection and this seemed to to work better.

Later I got another error message:

     java.lang.OutOfMemoryError: Java heap space

So I upped my memory usage here:

     options(java.parameters = "-Xmx4096m")

Per this article: I tried this:

     options(java.parameters = "-Xms4096m -Xmx4096m")

I still got “java.lang.OutOfMemoryError: GC overhead limit exceeded”.

I commented out all of the output to html files which was part of the original program. This seemed to improved processing.

#outputFilePeopleHtml <- "equityeditions_people.html" #outputFilePeopleHtmlCon<-file(outputFilePeopleHtml, open = "w")

These files were large, maybe too large. Also, with the re-starting of the program, they were incomplete because they only had results from the most recent run of the program. To generate output files, I’ll write a new program to extract the data from the database after all the editions are processed.

After all that though, I still had memory errors and my trial and error fixes were providing only marginal improvement if any real improvement at all.

Forcing garbage collection

Per this article, I added this command to force garbage collection in java and free up memory.

gc()

But I still got out of memory errors. I then changed the program to remove all but essential objects from memory and force garbage collection after each edition is processed:

objectsToKeep<-c("localuserpassword","inputCon", "mydb", "urlLine", "entities", "printAndStoreEntities" ) 
rm(list=setdiff(ls(),objectsToKeep )) 
gc()

After doing this the processing program has been running for several hours now and has not stopped. So far, this has been the best improvement. The processing program has just passed March 1902 and it has stored its 61,906th person entity into the database.

 

 

Continued work on a finding aid for The Equity based on NLP

This week I added a small relational database using MySQL to the project and have stored a record for each issue, one for each of the entities found in the issue as well as a cross reference table for each type of entity.

Problems

I have run into memory issues. I am able to process several dozen issues and then get:

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  :   java.lang.OutOfMemoryError: GC overhead limit exceeded

In my R Program, I have tried this

options(java.parameters = "- Xmx2048m")

However, the memory issue resurfaces on subsequent runs. I probably need to clean up the variables I am using better. I would appreciate a diagnosis of the cause of this if you can help.

I have also had SQL errors when trying to insert some data from the OCR’d text such as apostrophes (‘) and backslashes (\) and so I replace those characters with either an html glyph like &apos; or empty spaces:

entitySql = gsub("'", "&apos;", entity)
entitySql = gsub("\n", "", entitySql)
entitySql = gsub("\'", "", entitySql)
entitySql = gsub("\\", "", entitySql, fixed=TRUE)

I also need to control for data to ensure not to try to insert an entity I already have in what is supposed to be a list of unique entities to avoid the error I have below:

Error in .local(conn, statement, ...) : could not run statement: Duplicate entry 'Household Department ; Health Department ; Young Folks’ Depart' for key 'name_UNIQUE'

Future plans for this project

This week I plan to continue making adjustments to the program to make it run better and consume less memory. I also want to start generating output from the database so that I can display all of the issues a particular entity (person, location or organization) appears in.

Make the process repeatable

Later, I want to take this program and database and reset it to work on another English language newspaper that has been digitized. I plan to document the set up so that it can be used more easily by other digital historians.

Clean up of text errors from OCR

In the OCR’d text of The Equity there are many misspelled words due to errors from the OCR process. I would like to correct these errors but there are two challenges to making corrections by redoing the OCR. The first challenge is volume, there are over 6000 issues to The Equity to deal with. The second is that I was not able to achieve a better quality OCR result than what is available on the Province of Quebec’s Bibliotheque and Archives web site. In a previous experiment I carefully re-did the OCR of a PDF of an issue of The Equity. While the text of each column was no longer mixed with other columns the quality of the resulting words was no better. The cause of this may be that the resolution of the PDF on the website is not high enough to give a better result and that to truly improve the OCR I would need access to a paper copy to make a higher resolution image of it.

While it seems that improving the quality of the OCR is not practical for me, I would still like to clear up misspellings. One idea is to apply machine learning to see if it is possible to correct the text generated by OCR. The article OCR Error Correction Using Character Correction and Feature-Based Word Classification by Ido Kissos and Nachum Dershowitz looks promising, so I plan to work on this a little later in the project. Perhaps machine learning can pick up the word pattern and correct “Massachusetts Supremo Court” found in the text of one of the issues.

Making an improved finding aid for The Equity.

It is an honour for me to be named the George Garth Graham Undergraduate Digital History Research Fellow for the fall 2017 semester. I wish to thank Dr. Shawn Graham for running this fellowship. I was challenged and greatly enjoyed learning about DH in the summer of 2017 and I’m excited to continue to work in this area.

Making an improved finding aid for The Equity.

I am keeping a open notebook and Github repository for this project to improve the finding aid I previously worked on for the Shawville Equity.

I wanted to experiment with natural language processing of text in R and worked with Lincoln Mullen’s lesson on the Rpubs website. After preparing my copy of Mullen’s program I was able to extract people from the 8 September 1960 issue such as:

 [24] "Walter Yach’s" 
[25] "Mrs. Irwin Hayes" 
[26] "D. McMillan"
[27] "Paul Martineau"
[28] "Keith Walsh"
[29] "Ray Baker"
[30] "Joe Russell" 

I was also able to list organizations and places. With places though, there is obvious duplication:

 [90] "Canada"
[91] "Spring"
[92] "Canada"
[93] "Ottawa"
[94] "Ottawa"
[95] "Ottawa"
[96] "Ottawa"  

I want to remove the duplication from these lists and also clean them up. Also, I’d like to catalog when entities from these lists appear in other issues. For example, if I wanted to research a person named Olive Hobbs, I would like to see all of the issues she appears in. Given there are over 6000 editions of the Equity between 1883 and 2010, there are far too many entities and editions to record using static web pages, so I’ll want to use a database to store entities in as well as what issues they appear in. I intend to put the database on the web site so that it can be searched. Also, the database may be a source for further statistical analysis or text encoding.

I will use MySQL since I can run it on my machine locally in order to get faster performance yet also publish it to my web site. Also, the community edition of MySQL is free, which is always appreciated. One thing I am giving attention to is how to make sure the database remains secure when I deploy this to the web.

My progress this week has been to run a program that generates 3 files, one each for people, locations and organizations.  I ran the program on the editions of the Equity in 1970, but I got the error below just as the program started to process the first edition in August 1970.  I will have to tune up the program.

Error in .jnew("opennlp.tools.namefind.TokenNameFinderModel", .jcast(.jnew("java.io.FileInputStream",  : 
  java.lang.OutOfMemoryError: GC overhead limit exceeded

I was also able to store some data into MySQL and will be able to leverage the database more in coming weeks.

Warping Maps – Where is Watership Down?

An exercise I had wanted to accomplish in Dr. Graham’s HIST 3814 was Module 4’s Simple Mapping and Georectifying.  This involves taking a map and using the Harvard World MapWarp website to display the map as a layer above a map of the Earth from today, such as Google Maps.  The website uses specific points that the subject map shares with today’s map and “warps” the subject map to match the scale, projection and location of the map from today. There are all kinds of uses for this.

One of them is plotting the locations from Richard Adams’ Watership Down.  Mythgard.org has an excellent page of locations from the book based on the map below. As a young reader, I remember plotting the location of Watership Down on my National Geographic map of Great Britain in pencil. However, I missed a key fact.  In this map, the direction north is left, not up.

Watership Down Book Map
Map from Watership Down. With credit to cartographer Marliyn Hemmett, author Richard Adams and Ed Powell of the Mythgard.org website for posting this.

Despite the orientation of the map, it is easy to select and plot locations from this map from circa 1972 to a map of this part of England today.  While there are new roads, the courses of rivers, pylon lines and railroads remain. Even the old Roman road, Caesar’s Belt, is still visible on today’s map. Here is the resulting map.

Plotting the railway map of the Kingston and Pembroke Railway.

During HIST 2809, I completed an assignment comparing two railway maps from the 1890’s. One of them was the Kingston and Pembroke Railway (K&P) that travelled between Kingston and Renfrew, Ontario.  Much of the area that the K&P traversed is now wilderness and the former railway is now a trail and so plotting the many stops the K&P had in 1899 on a map from today is an interesting comparison. Here is the resulting map.