Finding Data, Experience with tools in module 2 of HIST 3814o.

Exercise 1, Using databases.

It was excellent to see how well organized, detailed and search-able a history database could be with the Epigraphic Database Heidelberg. Data could also be downloaded as json, here is a sample. Among many other things, the database would be a great reference to locate Roman inscriptions when traveling.

The Commonwealth War Graves Commission (CWGC) web site is used to locate the graves of soldiers who fought in the Commonwealth forces during the First and Second World Wars. It can also locate cemeteries and I looked up the Camp Hill cemetery in Halifax, NS.

In World War I my great-grandfather Samuel Hood served with the No.2 Construction Battalion, which is also known as the Black Battalion. Below is an excerpt from the Canadian Encyclopedia about the burial of veterans of No. 2 Construction at Camp Hill cemetery:

Many veterans of the Black Battalion were buried in Camp Hill Cemetery in Halifax. Each grave was marked by a flat, white stone, forcing visitors to crouch down and grope the grass to find loved ones. In 1997–98, Senator Ruck successfully lobbied the Department of Veterans Affairs, and each soldier received a proper headstone and inscription in 1999.

The spreadsheet for the Camp Hill cemetery I downloaded from the CWGC did not have names from No.2 Construction. That may be because members of No.2 Construction who died during World War I were not buried at Camp Hill, while veterans who died later were.

I searched for the unit in the CWGC search engine and could not find reference to it. Is it possible that no soldiers from No.2 Construction died in World War I? No, for example, the battalion’s war diary mentions that on September 23, 1918, 931410 Pte. Some, C. was found dead and likely murdered.

Searching the CWGC’s database using Private Some’s service number brings up his record, but his regiment is listed as Canadian Railway Troops. According to oral history No. 2 Construction sometimes built railways, but they were not Railway Troops. The unit was attached to the Canadian Forestry Corps. Some of the other work No. 2 Construction did in the war was logging and producing lumber.

I began to conclude that the war dead of No. 2 Construction were misclassified in the database. By trial and error I learned I had to type the name into the search box for unit as “2nd Construction Bn.” or “2nd Construction Coy.” Those searches returned the war dead, however the regiments listed for them were different. All this to conclude that even with impressively organized databases the problems with how data was recorded originally may persist.

Exercise 2, Wget.

Using wget will be essential for our final project, I wish I knew about this before for other work.

Exercise 3, Close Reading with TEI

Doing this exercise made me wonder about the benefits of web sites like allowing keen volunteers who care about quality to transcribe their text (with the aid of OCR). This exercise was laborious and to save time I made an .jpeg of each column of text, saved that file, and uploaded it to freeocr.com.

I ran one image through freeocr.com and got some text and some unreadable characters. However, it was faster to use some of the good text than to type it. I converted the second image to just a black and white image and this reduced some of the background smudging that stopped good character recognition. I OCR’d that and the results, albeit for a different column of text, seemed better.

After correcting some syntax issues in the xml, I like how precise it is, it was satisfying to see the text render with different colors for additional information. I did not finish marking up the people in this exercise given I had to move on.

Exercise 4, API’s.

I edited Dr. Ian Milligan’s excellent program to retrieve data from Canadiana.ca.

I ran this and then saw I was getting a lot of files so I canceled the run. I did not want to pollute DHBox (my output.txt was 340 mb) So I refined my search using Shawville as the city. I kept getting huge downloads, regardless of the time period. I also used a couple other towns, for various time periods. “Why am I getting all of this content for New Germany, Nova Scotia during 1919-1939?” is close to what I thought.

I executed:

curl 'http://search.canadiana.ca/search/'${i}'?q=montenegr*&field=&so=score&df=1914&dt=1918&fmt=json' | jq '.docs[] | {key}' >> results.txt

When I looked at results.txt and it had 60 lines. I ran the curl command again; results.txt now had 90 lines. Again; 120 lines. This command was appending my various searches and mixing results.

I got better results when I deleted the working files before running $ ./ canadiana.sh. I added these commands to the top of the retrieval program:

# to clean up from the previous session
rm results.txt
rm cleanlist.txt
rm urlstograb.txt

This is uploaded to:

https://github.com/jeffblackadar/module2e4

Exercise 5, Twitter Archiving

It is very exciting to retrieve data from Twitter using a program. Thanks to my HIST3814o classmate @sarahmcole who posted helpful content about how to install twarc on DHBox.

I thought I would start with something small by searching for tweets related to our class:

$ twarc search hist3814 > search.json

This was a fail: search.json had 0 results.

Manually searching Twitter for HIST3814 showed one result from less than 2 weeks ago on July 11, I was expecting to see that at least.

I got many more results with this:

$ twarc search canada150 > search.json

I installed json2csv in order to convert search.json into a .csv I could view as a spreadsheet. Thanks to a post by @sarahmcole in our course Slack channel I expected to get an error:

      module.js:433           throw err;           ^       SyntaxError: /home/jeffblackadar/search.json: Unexpected token {

I used @sarahmcole’s excellent program and fail log to adapt the json that json2csv was having errors with into a form json2csv could use. Here is a modified version of the program

# code inspired by https://stackoverflow.com/questions/4746190/find-and-replace-within-a-text-file-using-python  
# Thanks to @sarahmcole who wrote this 
 n = 1 
 f1=open("search.json", "r") 
 f2=open("output.json", "w") 
 f2.write("{") 
 for line in f1: 
         addComma="," 
         if n==1: 
                 addComma="" 
         line = addComma+'"tweet' + str(n) + '":'+line 
         n = n+1 
         f2.write(line) 
 f2.write("}") 
 f1.close() 
 f2.close()

Given the .json files related to Canada150 were really big to work with and thus hard to troubleshoot I used a smaller search:

$ twarc search lunenburg > search.json

I ran the program and then checked the json formatting at https://jsonformatter.curiousconcept.com/ , it was “correct”

I then ran:

 $ json2csv -i output.json -f name,location,text

It ran but I didn’t get a the multiple rows of values I expected, I got:

 "name","location","text"  ,,

To spare you, the reader, the full blow-by-blow here is my conclusion for this exercise:

Using a program I can format a json file so that json2csv could parse it without errors, but I ended up with a file that was only 2 rows. The first row was tweet1, tweet2, tweet3, etc. The second row consists of each tweet in full unparsed json format.

Using a program I can produce a json file that looks correctly formatted, but json2csv has errors when parsing it.

Dr. Graham has posted that json2csv may have a new implementation and is not performing as expected.

A bit more with Twitter json

The json for each tweet has a complicated hierarchy. I took one line and put it into a json viewer and there I could see how it was organized. There are a number of json to csv conversion websites and tools on the internet but several of them look pretty shady.

Reading through for solutions to my problems with json2csv I ended up trying jq to make a .csv. I chose to use tilde ~ as my delimiter given that the text of tweets often contain commas.

In jq each json entity starts with a . and some entities (like name) are subentities: .user.name.

Running this:

jq '.user.name+""""+.user.location+""""+.lang+""~""+.text+"""' search.json > outjq.csv

I was able to produce a hacked together file delimited with ~ characters

https://github.com/jeffblackadar/m2e5/blob/master/outjq.csv

Thanks to @dr.graham for the jq tutorial here.

Exercise 6, Using Tesseract OCR

All of the software installed and ran as expected, but the ocr results I saw after I ran $ tesseract file.tiff output.txt were not usable. I am not sure why, but suspect the darkness of the newspaper images as the cause. I tried to edit and upload some clearer images to test Tesseract with, however DHBox kept giving “0 null” errors which would stop the upload. I was able to upload an another image file of text from exercise 3 that was 230k in size. I was able to use Tesseract to recognize text from it.

Leave a Reply