A web site to browse results and better processing performance.

Database indexes make a difference.

Since my last blog post the R program that is processing The Equity files has been running 24 hours a day and as of Saturday it reached 1983. However, I noticed that the time it took to process each issue was getting longer and it seemed that this is taking far too long in general.

I went back to an earlier idea I had to add more indexes to the tables in the MySql database of results. I had been reluctant to do this since adding an additional index to a database table can make updating that table take longer due to the increased time to update the additional index. At first I added an index just to the `entities_people` table to have an index on the name column. [KEY `names` (`name`)] Adding this index made no visible difference to the processing time, likely because this table already had an index to keep the name column unique. [UNIQUE KEY `name_UNIQUE` (`name`)]

Then I added indexes to the cross reference tables that relate each of the entities (people, locations, organizations) to the source documents. [KEY `ids` (`id_entities_person`,`id_source_document`)]

After adding these indexes, processing time really sped up. During the short time I have spent writing these paragraphs three months of editions have been processed. Its no surprise that adding indexes also improved the response time of web pages returning results from the database.

Browse The Equity by topic on this basic web site.

A simple web site is now available to browse by topic and issue. As you can see, thousands of person, location and organization entities have been extracted. Currently only the first 10,000 of them are listed in search results given I don’t want to cause the people hosting this web site any aggravation with long running queries on their server. I plan to improve the searching so that it’s possible to see all of the results but in smaller chunks. I would like to add a full text search, but I am somewhat concerned about that being exploited to harm the web site.

As of today there is a list of issues and for each issue there is a list of the people, organizations and locations that appear in it. All of the issues that an entity’s name appears in can also be listed, such as the Quyon Fair. Do you see any problems in the data as presented? Are there other ways you would like to interact with it? I plan to make the database available for download in the future, once I get it to a more finalized form.

There is a lot of garbage or “diamonds in the rough” here. I think it’s useful to see that in order to show the level of imperfection of what has been culled from scanned text of The Equity, but also to find related information. Take, for example, Carmen Burke:

Carmen Burke President
Carmen Burke Pr�sidente
Carmen Burke Sec TreesLiving
Carmen Burke Secr�ta
Carmen Burke Secr�tair
Carmen Burke Secr�taire
Carmen Burke Secr�taire-tr�sori�re
Carmen Burke Secr�taire-tr�sori�re Campbell
Carmen Burke Secr�taire-tr�sorl�re Campbell
Carmen Burke Secr�taire-Tr�sort�re
Carmen Burke Secr�talre-tr�sori�reWill
Carmen Burke Secr�talre-tr�sori�reX206GENDRON
Carmen Burke Secr�talre-tr�sorl�reOR
Carmen Burke Secretary
Carmen Burke Secretary Treasurer
Carmen Burke Secretary-treasurerLb Premier Jour
Carmen Burke Seter

Cleaning these results is a challenge I continue to think about. A paper I will be looking at in more depth is OCR Post-Processing Error Correction Algorithm Using Google’s Online Spelling Suggestion by Youssef Bassil and Mohammad Alwani.

Happy searching, I hope you find some interesting nuggets of information in these preliminary results. Today the web database has the editions from 1883-1983 and I will be adding more in the coming weeks.

Leave a Reply