Finding aid for the Shawville Equity.

The product of this exercise is a finding aid for the Shawville Equity for recurring events. A second finding aid for the editions from 1883-1999 is available with a list of the most frequently appearing uncommon words in order to allow a distant reading of each edition.

Objectives:

Make it easier to access the edition of the Shawville Equity that has coverage of events such as elections.  Allow for distant reading of editions.

Method: 

1. First finding aid program in R:

Wrote a program in R to generate the finding aid

This program performs the following steps:

  • It stores key dates into lists. (Lines 39-52.)
  • Reads a file of URLs for each edition of the Equity.
  • The date of the edition is calculated by parsing the URL of the edition. Example: http://collections.banq.qc.ca:8008/jrn03/equity/src/1883/06/07/  (Line 83) 
  • The function editionWithinWeekOfDate (lines 7-33) sees if the date of an edition is within 7 days after a significant date. (Line 18.)
  • The function also copies the text file containing the edition to a separate directory to create a corpus for analysis.
  • The program writes an .html and .csv file with the results. 

2. Second finding aid program in R:

Added a second program based on the work of Taylor Arnold and Lauren Tilton.[1]

In addition to the steps above, this program:

  1. Loads a list of the 1000 most common words in English, as per Peter Norvig's website. [2] (Lines 44-58.)
  2. For each edition of The Equity, reads the text file of the edition and creates a list of all the words in the edition. (Lines 150-155.)
  3. Makes a data_frame of all the words and their frequency of appearance in the edition. (Lines 158-160.)
  4. Removes all of the 1000 most common English words, and their variants such as plural forms. (Line 163.)
  5. Removes words that are four characters or less, given that most of these shorter, but uncommon, words are errors from bad OCR. (Line 165.)
  6. Keeps only the words that appear 10 or more times, with the assumption these words are significant to describing the content of that edition of The Equity. (Line 167.)

Results

1. A list of editions with election dates in .html and .csv.

2. A second list of editions in .html and .csv that also lists frequently appearing but relatively uncommon words in order to provide a distant reading of the edition.

Future Steps

  • Add more significant dates.
  • Refine text analysis to generate an improved topic list from each edition. 
  • Add sentiment analysis to connect key events to sentiment expressed in The Equity.

Citations

[1] Arnold, Taylor, Lauren Tilton. "Basic Text Processing in R." [online] The Programming Historian. Available at: https://programminghistorian.org/lessons/basic-text-processing-in-r [Accessed 20 August 2017].

[2] Norvig, Peter. "Natural Language Corpus Data: Beautiful Data." [online] Peter@Norvig.com. Available at: http://norvig.com/ngrams/ [Accessed 20 August 2017].