This week’s progress is represented by incremental improvement. These improvements are not earth shattering, but are necessary to get the full value out of a resource like this.
Full text index
I added full text indexes called ftname to the entities_people, entities_organizations and entities_locations tables.
This allowed the website to present a list of possible related entities. For example, using a full text search the database can return possible matches for Carmen Burke. The people table has an entry for Burke Secrétoire-Trésorière.
Under organizations there is listed:
Municipal OfficeCarmen Burke Secrétaire-trésorière0S29
Municipal OfficeCarmen Burke Secrétaire-trésorièrePRIVATE INSTRUCTION
Municipal OUceCarmen Burke Secrétalre-trésoriére
Each of these entries may return an additional reference about Carmen Burke, although I expect a lot of overlap due to the names of different entities appearing multiple times in an article, yet being stored in the database with different spellings due to OCR errors. Regardless, the feature to look up possible related entities will allow a researcher to make sure more needles are found in the haystack of content.
There is now a small form to search for entities starting with the first 2 letters in the select field.
Characters in names being misinterpreted as HTML
A simple improvement was made to the listing of entity names from the database. Due to OCR errors some characters were represented by less-than brackets (<) and an entity named OOW<i resulted in <i being interpreted as the start of an <i> italic HTML tag, which meant that all the content that followed on the web page was in italics. I didn’t want to tamper with the data itself in order to preserve its integrity so I looked at some options in php to deal with presenting content. The php function htmlspecialchars resulted in a lot of data just not being returned by the function and so empty rows were listed rather than content. Using the following statement
was the least harmful way to present data that had a < in it by replacing it with the HTML glyph &.
Accents in content were mishandled
As noted in last week’s blog, the web pages were presenting Carmen Burke’s French language title of Présidente incorrectly, per below:
Carmen Burke Pr�sidente
Luckily, the database had stored the data correctly:
Carmen Burke Présidente
I say luckily because I did not check that the processing program was storing accented characters correctly and I should have given I know that paper has French language content too. Lesson learned.
Setting the character set in the database connection fixed the presentation, per below.
‘utf8’ is the character set supporting accented characters I want to use. $link represents the database connection.
Completed processing of Equity editions 2000-2010
Yesterday I downloaded the editions of the Equity I was missing from 2000-2010 using wget.
wget http://collections.banq.qc.ca:8008/jrn03/equity/src/2000/ -A .txt -r --no-parent -nd –w 2 --limit-rate=20k
I also completed processing the 2000-2010 editions in the R program. This ran to completion while I was out at a wedding so this is much faster than it used to be.
I backed up the database on the web and then imported all of the data again from my computer so that the website now has a complete database of entities from 1883-2010. According to the results, there are 629,036 person, 500,055 organization and 114,520 location entities in the database.