Database indexes make a difference.
Since my last blog post the R program that is processing The Equity files has been running 24 hours a day and as of Saturday it reached 1983. However, I noticed that the time it took to process each issue was getting longer and it seemed that this is taking far too long in general.
I went back to an earlier idea I had to add more indexes to the tables in the MySql database of results. I had been reluctant to do this since adding an additional index to a database table can make updating that table take longer due to the increased time to update the additional index. At first I added an index just to the `entities_people` table to have an index on the name column. [KEY `names` (`name`)] Adding this index made no visible difference to the processing time, likely because this table already had an index to keep the name column unique. [UNIQUE KEY `name_UNIQUE` (`name`)]
Then I added indexes to the cross reference tables that relate each of the entities (people, locations, organizations) to the source documents. [KEY `ids` (`id_entities_person`,`id_source_document`)]
After adding these indexes, processing time really sped up. During the short time I have spent writing these paragraphs three months of editions have been processed. Its no surprise that adding indexes also improved the response time of web pages returning results from the database.
Browse The Equity by topic on this basic web site.
A simple web site is now available to browse by topic and issue. As you can see, thousands of person, location and organization entities have been extracted. Currently only the first 10,000 of them are listed in search results given I don’t want to cause the people hosting this web site any aggravation with long running queries on their server. I plan to improve the searching so that it’s possible to see all of the results but in smaller chunks. I would like to add a full text search, but I am somewhat concerned about that being exploited to harm the web site.
As of today there is a list of issues and for each issue there is a list of the people, organizations and locations that appear in it. All of the issues that an entity’s name appears in can also be listed, such as the Quyon Fair. Do you see any problems in the data as presented? Are there other ways you would like to interact with it? I plan to make the database available for download in the future, once I get it to a more finalized form.
There is a lot of garbage or “diamonds in the rough” here. I think it’s useful to see that in order to show the level of imperfection of what has been culled from scanned text of The Equity, but also to find related information. Take, for example, Carmen Burke:
Carmen Burke President Carmen Burke Pr�sidente Carmen Burke Sec TreesLiving Carmen Burke Secr�ta Carmen Burke Secr�tair Carmen Burke Secr�taire Carmen Burke Secr�taire-tr�sori�re Carmen Burke Secr�taire-tr�sori�re Campbell Carmen Burke Secr�taire-tr�sorl�re Campbell Carmen Burke Secr�taire-Tr�sort�re Carmen Burke Secr�talre-tr�sori�reWill Carmen Burke Secr�talre-tr�sori�reX206GENDRON Carmen Burke Secr�talre-tr�sorl�reOR Carmen Burke Secretary Carmen Burke Secretary Treasurer Carmen Burke Secretary-treasurerLb Premier Jour Carmen Burke Seter
Cleaning these results is a challenge I continue to think about. A paper I will be looking at in more depth is OCR Post-Processing Error Correction Algorithm Using Google’s Online Spelling Suggestion by Youssef Bassil and Mohammad Alwani.
Happy searching, I hope you find some interesting nuggets of information in these preliminary results. Today the web database has the editions from 1883-1983 and I will be adding more in the coming weeks.