Image Classification – First Results

An objective of this research is to demonstrate a means to automatically classify an image of an artifact using computer vision. I am using a method and code from Dattaraj Rao’s book Keras to Kubernetes: The Journey of a Machine Learning Model to Production. In Chapter 5 he demonstrates the use of machine learning to classify logo images of Pepsi and Coca Cola. I have used his code in an attempt to classify coin images of George VI and Elizabeth II.

Code for this is here:

The images I am using are here.

Below are my initial results; the prediction is shown below the image.

[[1.]] Prediction for /content/drive/My Drive/coin-image-processor/photos/george_vi/gvi3330.png: george_vi
[[0.]] Prediction for /content/drive/My Drive/coin-image-processor/photos/elizabeth_young/eII2903.png: elizabeth_young

…So far so good…

[[0.]] Prediction for /content/george_test_1.jpg: elizabeth_young.

[1 – footnote]

As noted above, this prediction failed.

I am not sure why yet, but here is my experience so far. On my first run through, the prediction failed for the first image of George VI too. I got the correct result when I used a larger image size for training and validation.

train_generator = train_datagen.flow_from_directory(
        target_size=(300, 300),

(above) The original code uses an image size of 150 x 150 so I doubled it in each line of the program where that size is used. I may need to use a larger size than 300 x 300.

The colours of my coin images are somewhat uniform, while Rao’s example uses Coke’s red and white logo versus Pepsi’s logo with blue in it. Does color play a more significant role in image classification using Keras than I thought? I will look at what is happening during model training to see if I can address this issue.

Data Augmentation

I have a small number of coin images yet effective training of an image recognition model requires numerous different images. Rao uses the technique of data augmentation to manipulate a small set of images into a larger set of images that can be used for training by distorting them. This can be particularly useful when training a model to recognize images taken by cameras from different angles as would happen in outdoor photography. A portion of Rao’s code is below. Given the coin images I am using are photographed from above, I have reduced the level of distortion (shear, zoom, width and height shift.)

# Keras to Kubernetes: The Journey of a Machine Learning Model to Production
# Dattaraj Jagdish Rao
# Pages 152-153

 from keras.preprocessing.image import ImageDataGenerator
 import matplotlib.pyplot as plt
 %matplotlib inline
 training_dir = "/content/drive/My Drive/coin-image-processor/portraits/train"
 validation_dir = "/content/drive/My Drive/coin-image-processor/portraits/validation"
 gen_batch_size = 1
 This is meant to train the model for images taken at different angles.  I am going to assume pictures of coins are from directly above, so there is little variation
 train_datagen = ImageDataGenerator(
     fill_mode = "nearest",    
 train_generator = train_datagen.flow_from_directory(
         target_size=(300, 300),
 class_names = ['elizabeth_young','george_vi']
 print ("generating images")
 ROW = 10
 for i in range(ROW*ROW):
     next_set =
Sample of images produced from data augmentation.

My next steps to improve the results I am getting are looking at what is happening as the models are trained and training the models longer using larger image sizes.


Rao, Dattaraj. Keras to Kubernetes: The Journey of a Machine Learning Model to Production. 2019.

[1 – footnote] This test image is from a Google search. The original image is from:

Image inpainting – first results.

First results of image inpainting using Mathias Gruber’s PConv-Keras: (I took a short cut on training the model for this.)

I have a few dozen pictures of Elizabeth II from Canadian 1 cent pieces. I want to see if I can train a model that can in-paint a partial image. Haiyan Wang, Zhongshi He, Dingding Chen, Yongwen Huang, and Yiman He have written an excellent study of this technique in their article “Virtual Inpainting for Dazu Rock Carvings Based on a Sample Dataset” in the Journal on Computing and Cultural Heritage. [1]

I am using Mathias Gruber’s PConv-Keras repository in GitHub to do image inpainting. He has impressive results and as a caveat for my results, I am not yet training the model used for inpainting nearly as long as Gruber does. I am using Google Colab and it is not meant for long running processes so I am using a small number of steps and epochs to train the model. Even with this constraint I am seeing potential results.

The steps used to setup Mathias Gruber’s PConv-Keras in Google Colab are here. Thanks to Eduardo Rosas for these instructions so I could get this set up.

Using Gruber’s PConv-Keras I have been able to train a model to perform image inpainting. My next steps are to refine the model, train it more deeply and look for improved results. The code and results I am working on are on my Google Drive at this time. The images I am using are here.

This week I improved the program to process coin images as well. I see improved results by having a higher tolerance of impurities in the background of the picture when finding whitespace. (I use white_mean = 250, not 254 or 255) This version is in GitHub.

1 Wang, Haiyan, Zhongshi He, Dingding Chen, Yongwen Huang, and Yiman He. 2019. “Virtual Inpainting for Dazu Rock Carvings Based on a Sample Dataset.” Journal on Computing and Cultural Heritage 12 (3): 1-17.

Automatically cropping images

As mentioned in previous posts, I need numerous images to train an image recognition model. My goal is to have many examples of the image of Elizabeth II like the one below. To be efficient, I want to process many photographs of 1 cent coins using a program and so that program must be able to reliably find the centre of the portrait.

To crop the image I used two methods: 1. remove whitespace from the outside inward and 2. find the edge of the coin using OpenCV’s cv2.HoughCircles Function.

Removing whitespace from the outside inward is the simpler of the two methods. To do this I assume the edges of the image are white, color 255 in a grayscale image. If the mean value of pixel colors is 255 for a whole column of pixels, that whole column can be considered whitespace. If the mean value of pixel colors is lower than 255 I assume the column contains part of the darker coin. Cropping the image from the x value of this column will crop the whitespace from the left edge of the image.

for img_col_left in range(1,round(gray_img.shape[1]/2)):
    if np.mean(gray_img,axis = 0)[img_col_left] < 254:

The for loop of this code starts from the first pixel and moves toward the center. If the mean of the column is less than 254 the loop stops since the edge of the coin is found. I am using 254 instead of 255 to allow for some specks of dust or other imperfections in the white background. Using a for loop is not efficient and this code should be improved, but I want to get this working first.

Before the background is cropped, the image is converted to black and white and then grayscale in order to simplify the edges. Here is the procedure at this point.

import numpy as np
import cv2
import time
from google.colab.patches import cv2_imshow

def img_remove_whitespace(imgo):
    print("start " + str(time.time()))
    #convert to black and white - make it simpler?
    # define a threshold, 128 is the middle of black and white in grey scale
    thresh = 128

    # assign blue channel to zeros
    img_binary = cv2.threshold(imgo, thresh, 255, cv2.THRESH_BINARY)[1]
    gray_img = cv2.cvtColor(img_binary, cv2.COLOR_BGR2GRAY)

    # Thanks
    # croppedImage = img[startRow:endRow, startCol:endCol]
    # allow for 254 (slightly less than every pixel totally white to allow some specks)
    #count in from the right edge until the mean of each column is less than 255
    for img_row_top in range(0,round(gray_img.shape[0]/2)):    
        if np.mean(gray_img,axis = 1)[img_row_top] < 254:
    for img_row_bottom in range(gray_img.shape[0]-1,round(gray_img.shape[0]/2),-1):
        if np.mean(gray_img,axis = 1)[img_row_bottom] < 254:
    for img_col_left in range(1,round(gray_img.shape[1]/2)):
        if np.mean(gray_img,axis = 0)[img_col_left] < 254:
    for img_col_right in range(gray_img.shape[1]-1,round(gray_img.shape[1]/2),-1):
        if np.mean(gray_img,axis = 0)[img_col_right] < 254:
    imgo_cropped = imgo[img_row_top:img_row_bottom,img_col_left:img_col_right,0:3]
    print("Whitespace removal")
    # cv2_imshow(imgo_cropped) 
    print("end " + str(time.time()))

A problem with this method is that some images have shadows which interfere with the procedure from seeing the true edge of the coin. (See below.)

Image with shadow. Whitespace detection at the edges won’t work.

For cases like the image above, I tried to use OpenCV’s Hough Circles to find the boundary of the coin. Thanks to Adrian Rosebrock’s tutorial “Detecting Circles in Images using OpenCV and Hough Circles” I was able to apply this here.

In my case, cv2.HoughCircles found too many circles. For example, it found 95 of them in one image. Almost all of these circles are not the edge of the coin. I used several methods to try to find the circle that represented the edge of the coin. I sorted the circles by radius, reasoning the largest circle was the edge. It was not always. I looked for large circles that were completely inside the image but also got erroneous results. (See below.) Perhaps I am using this incorrectly, but I have decided this method is not reliable enough to be worthwhile so I am going to stop using it. The code to use the Hough circles is below. Warning, there is likely a problem with it.

print("Since finding whitespace did not work, we will find circles. This will take more time")      
circles = cv2.HoughCircles(gray_img, cv2.HOUGH_GRADIENT, 1.2, 100)

# ensure at least some circles were found
if circles is not None:

    # convert the (x, y) coordinates and radius of the circles to integers
    circles = np.round(circles[0, :]).astype("int")
    print("There are " + str(len(circles2)) +" circles found in this image")
    for cir in range(0,len(circles2)):    
        x = circles2[cir][0]
        y = circles2[cir][1]
        r = circles2[cir][2]
        if r < good_coin_radius*1.1 and r > good_coin_radius*0.9:
            if (x > (good_coin_radius*0.9) and x < (output.shape[0]-(good_coin_radius*0.9))):
                if (y > (good_coin_radius*0.9) and y < (output.shape[1]-(good_coin_radius*0.9))):
                    print("I believe this is the right circle.")  
          , (x, y), r, (0, 255, 0), 4)        
                    cv2.rectangle(output, (x - 5, y - 5), (x + 5, y + 5), (0, 128, 255), -1)  
                    output = output[x-r:x+r,y-r:y+r]
                    width_half = round(output.shape[0]/2)
                    height_half = round(output.shape[1]/2)
          ,(width_half, height_half), round(r*1.414).astype("int"), (255,255,255), round(r*1.4).astype("int"))
                    output = img_remove_whitespace(output)
False positive Hough circle representing the edge of the coin.

My conclusion is that I am only going to use coins from the left half of each picture I take since the photo flash works better there and there are fewer shadows. I will take care to remove debris around the coins that interferes with finding whitespace. Failing that, the routine removes photos that it can’t crop to the expected size of the coin. This results in a loss of some photos, but this is acceptable here since I need don’t need every photo to train the image recognition model. Below is a rejected image. The cleaned up code I am using right now is here.

Creating a set of images for image recognition.

iPhone camera gantry to take photos in focus at 3x magnification.

I would like to train an image recognition model with my own images to see how well it works. Here I want to use the obverse of coins to make a model to recognize the portraits of Elizabeth II (younger), Elizabeth II (more mature), George VI and Abraham Lincoln.

Initially I used 5 cent coins but I found they reflected too much light to take a good photograph so I switched to 1 cent coins. I also started with a camera on a Microsoft Surface Pro computer, taking pictures of 9 coins at a time in order to try to be efficient, but I did not get the higher image quality I believe I need.

Microsoft Surface Pro camera taking pictures using a Python program in Google Colab.
Photograph taken using the Surface Pro camera.
Photo taken with iPhone: 3x magnification, square layout, flash on white paper background.

The next step is to remove the background using OpenCV in Python, crop the image to have just the coin. I don’t want to have the image recognition model recognize the portrait because her name is printed on it, so I will crop it again to have only the portrait.

I believe this type of image processing can be applied to historical artifacts photographed using a neutral background. I am concerned the coins are too worn and have too little variation in colour to make a good model but that in itself will be useful to learn if it’s the case.

My thanks to the Saskatoon Coin Club for their excellent page describing the obverse designs of Canadian one cent coins.

Computational Creativity and Archaeological Data project

I am doing research for the Computational Creativity and Archaeological Data project. My current challenge has been to use techniques from Computer Vision to analyse images relevant to Computational Research on the Ancient Near East (CRANE) with the goal to provide additional understanding of these images. Computer Vision and machine learning could identify and classify elements in images or provide a possible reconstruction of a partial artifact using an image of it combined with a model of images of related artifacts. Here, I would like to reference Ivan Tyukin, Konstantin Sofeikov, Jeremy Levesley, Alexander N. Gorban, Penelope Allison and Nicholas J. Cooper’s work “Exploring Automated Pottery Identification [Arch-I-Scan]” in Internet Archaeology 50.

In order to classify images I plan to use machine learning. To do this I need a workable method, a machine learning platform and then to create a model that can be used for image recognition.

The method I plan to use is described in the books Deep Learning with Python/R. Using the techniques from the book, I can train a model to recognize different types of images and so far I have been able to make the examples in the book work.

Google Colab is the machine learning platform I am using. I am comfortable using this given I am not using any sensitive data as per guidelines here. Colab offers a free GPU which is a requirement for the efficient processing of these models. (Training one model was going to take 2 hours without a GPU. With a GPU it took a few seconds.)

Python vs. R. Colab has some support for R but it did not work well enough to install what was required. I have switched to Python. This is the notebook of my setup of Colab.

I have made examples from Deep Learning with Python work. Now I want to create my own sets of training data, train models with them and see the results. This will provide a better understanding of the limitations and pitfalls of using machine learning for image recognition.

Fractals: using R, R6 classes and recursion.

Hexagonal Gosper curve.

I am a fan of fractals and recently I wanted to learn more about object-oriented programming in R using classes. Adam Spannbauer has an excellent tutorial using R6 classes and ggplot2 to create fractal trees and I adapted it for L-system line fractals found in Przemyslaw Prusinkiewicz and Aristid Lindenmayer’s book The Algorithmic Beauty of Plants.

An example of an L-system line fractal follows.

Imagine you are a turtle drawing a line according to instructions: + means turn to the right by 90 degrees. – means turn left 90 degrees. F means move forward a set distance. (Let’s assume it’s 100 pixels.)

So F-F-F-F would draw a square.

F also has a special role in that it replicates itself as a “generator”. If the generator for F is F-F+F+FF-F-F+F then the F-F-F-F would become F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F-F-F+F+FF-F-F+F and be drawn like the shape below. Recursion is used to generate subsequent generations of the same repeating pattern which can be seen here.

Koch island, generation 1.

Images of more fractals are on these pages. This project also allowed me to try Github pages and Binder. The repository for the code used is here:

A better search field for the finding aid for The Equity.

As of yesterday my computer completed reprocessing of all of the issues of the Shawville Equity to extract named entities from the text. I have been able to upload all of the new data to the website except the table that relates location entities to the issues of The Equity they were published in.  I’m not sure why I am consistently getting an error 500 when I upload this data, it’s just a set of small numbers.  Unfortunately, the locations portion of the finding aid is broken. I’m still hacking at this.

On the plus side, I have set up a better search tool in the finding aid.  It accepts up to 15 letters only for a full text search, but it seems easier and more flexible to use than what I had before. Also, now there is a bit less duplicated data in the lists of entities.

Correcting the spelling of OCR errors

I have been reviewing the list of entities extracted from the editions of The Equity and have seen errors I could have corrected in the finding aid.  One of them is that some entities appear in the same list twice. An example comes from this page listing locations below.

Fort Coulongc
Fort Coulonge Williams
Fort Coulonge
Fort CoulongeSt. John’s
Fort Coulongo


Fort Coulonge St.
Fort Coulonge Tel.
Fort Coulonge
Fort Coulonge – River
Fort Coulonge – St. Andrew


Why is this? The first entity for Fort Coulonge has a tab character (ASCII code 9) separating the two words while the second one listed has a space (code 32), as expected. In this finding aid, each entity is meant to be unique to make it easier to reference and so this is a problem. I could correct this in the database with SQL UPDATE statements to merge the information for entities containing tab characters with the entity containing spaces, but it’s also an opportunity to reprocess the data from scratch and make some more corrections.

The last time I processed The Equity for entities it took about 2 weeks of around the clock processing, counting time when the program repeatedly stopped due to out of memory errors. However, with performance improvements, I expect reprocessing will be faster.

I would also like to add some spelling correction for OCR errors. The first spelling correction I tried was the method that comes from Rasmus Bååth’s Research Blog and Peter Norvig’s technique. A large corpus of words from English texts is used as the basis for correcting the spelling of a word. The word to be checked is compared to the words in the corpus and a match based on probability is proposed. My results from this technique did not offer much correction and in fact produced some erroneous corrections. I think this is because my person, location and organization entities often contain names.

I tried the R function which_misspelled() which did not produce an improvement in spelling correction either. I’ve spent a fair amount of time on this, is this a failure?

Peter Norvig’s technique is trainable. Adding additional words to Norvig’s corpus used for spell checking seems to give better results. I even got a few useful corrections such as changing Khawville to shawville. To start to train the Norvig spell checker I entered all of the communities in Quebec listed on this Wikipedia category. Then I viewed the output as each term was checked so see if the Norvig spell checker was failing to recognize a correctly spelled word.  An example of this is when the name Beatty gets corrected to beauty. I added the correctly spelled names that Norvig’s method was not picking up to his corpus.

Below is a sample of terms I have added to Norvig’s corpus to improve the spell check results for entities found in The Equity.

Ottawa, Porteous, Hiver, Du, Lafleur, Varian, Ont, Mercier, Duvane, Hanlan, Farrell, Robertson, Toronto, Jones, Alexandria, Chicago, England, London, Manchester, Renfrew, Pontiac, Campbell, Forresters Falls, UK, Cuthbertson, Steele, Gagnon, Fort Coulonge, Beresford, Carswell, Doran, Dodd, Allumette, Nepean, Rochester, Latour, Lacrosse, Mousseau, Tupper, Devine, Carleton, Laval, McGill, Coyne, Hodgins, Purcival, Brockville, Eganville, Rideau, McLean, Hector, Langevin, Cowan, Tilley, Jones, Leduc, McGuire

Below is an example of the output from the spelling correction using Norvig’s method.  As you can see, I need to add terms from lines 4,6,8 and 10 because the Norvig method is returning an incorrect result. Even then, this method may still return an error for a correctly spelled name.  Despite adding “McLean” to the corpus, this method still corrects “McLean” as “clean”.

Data frame showing terms that may have been misspelled in the left column next to the suggested corrections from Norvig's method.
Data frame showing terms that may have been misspelled in the left column next to the suggested corrections from Norvig’s method.


The full R program is here. Below is the detail of the portion of the function used for spelling correction

# Read in big.txt, a 6.5 mb collection of different English texts.
raw_text <- paste(readLines("C:/a_orgs/carleton/hist3814/R/graham_fellowship/norvigs-big-plus-placesx2.txt"), collapse = " ")
# Make the text lowercase and split it up creating a huge vector of word tokens.
split_text <- strsplit(tolower(raw_text), "[^a-z]+")
# Count the number of different type of words.
word_count <- table(split_text)
# Sort the words and create an ordered vector with the most common type of words first.
sorted_words <- names(sort(word_count, decreasing = TRUE))


#Rasmus Bååth's Research Blog
correctNorvig <- function(word) {
 # Calculate the edit distance between the word and all other words in sorted_words.
 edit_dist <- adist(word, sorted_words)
 # Calculate the minimum edit distance to find a word that exists in big.txt 
 # with a limit of two edits.
 min_edit_dist <- min(edit_dist, 2)
 # Generate a vector with all words with this minimum edit distance.
 # Since sorted_words is ordered from most common to least common, the resulting
 # vector will have the most common / probable match first.
 proposals_by_prob <- c(sorted_words[ edit_dist <= min(edit_dist, 2)])
 # In case proposals_by_prob would be empty we append the word to be corrected...
 proposals_by_prob <- c(proposals_by_prob, word)
 # ... and return the first / most probable word in the vector.

<!--- snip ---- much of the program is removed --->

# correctedEntity is what will be checked for spelling
# nameSpellChecked is the resulting value of the spelling correction, if a correction is found.  

 correctedEntityWords = strsplit(correctedEntity, " ")
 correctedEntityWordsNorvig = strsplit(correctedEntity, " ")
 #sometimes which_misspelled() fails and so it is in a tryCatch()
 misSpelledWords <-tryCatch(
 which_misspelled(correctedEntity, suggest=TRUE)
 error=function(cond) {
 warning=function(cond) {
 #The R spell checker has not picked up a problem, so no need to do further checking.
 } else {
 for(counter in 1:length(misSpelledWords[[1]])){
 wordNum = as.integer(misSpelledWords[[1]][counter])
 correctedEntityWords[[1]][wordNum] = misSpelledWords[counter,3]
 correctedEntityWordsNorvig[[1]][wordNum] = correctNorvig(correctedEntityWordsNorvig[[1]][wordNum])
 correctedEntitySpellChecked = paste(correctedEntityWords[[1]],collapse=" ")
 correctedEntityNorvig = paste(correctedEntityWordsNorvig[[1]],collapse=" ")
 #We have found a suggested correction
 print(paste(correctedEntity,misSpelled,correctedEntitySpellChecked,correctedEntityNorvig,sep=" --- "))
 #keep a vector of the words to make into a dataframe so that we can check the results of the spell check. Remove this after training of the spell checker is done.

#Clean up any symbols that will cause an SQL error when inserted into the database
 nameSpellCheckedSql = gsub("'", "''", nameSpellChecked)
 nameSpellCheckedSql = gsub("’", "''", nameSpellCheckedSql)
 nameSpellCheckedSql = gsub("\'", "''", nameSpellCheckedSql)
 nameSpellCheckedSql = gsub("\\", "", nameSpellCheckedSql, fixed=TRUE)

The next step is to finish reprocessing the Equity editions and use the corrected spelling field to improve the results in the “possible related entities” section of each entity listed on the web site for the finding aid.


Plotting data using R.

This week I have continued work with the National Library of Wales’ Welsh Newspapers Online. Working with this collection I wanted to see if there was a significant pattern with the number of newspaper stories found in search results for my research on allotment gardening in Wales during World War I.  I used this R program to search Welsh Newspapers Online and store the results in a MySQL database. My previous post here explains how the web page parsing in program works.

Below is a graph of the number of newspaper stories containing the words “allotment” and “garden” published each month during World War I:

Graph of the number of newspaper stories containing allotment and garden published each month.
Graph of the number of newspaper stories containing allotment and garden published each month.

The number of newspaper stories in Welsh papers containing allotment and garden rises significantly in 1917 after a poor harvest in 1916 and the establishment of the British Ministry of Food on 22 December, 1916 [1].

Below is the R program used to make the graph.  Initially I had problems graphing the data for each month. If I just used numbers for the months where August 1914 was month 1 and November 1918 was month 52 the graph was harder to interpret.  Using a time series helped, see this line in the program below: qts1 = ts(dbRows$count, frequency = 12, start = c(1914, 8)).

rmysql.settingsfile<-"C:\\ProgramData\\MySQL\\MySQL Server 5.7\\newspaper_search_results.cnf"


query<-paste("SELECT (concat('1 ',month(story_date_published),' ',year(story_date_published))) as 'month',count(concat(month(story_date_published),' ',year(story_date_published))) as 'count' from tbl_newspaper_search_results WHERE search_term_used='",searchTermUsed,"' GROUP BY year(story_date_published),month(story_date_published) ORDER BY year(story_date_published),month(story_date_published);",sep="")
rs = dbSendQuery(storiesDb,query)
dbRows$month = as.Date(dbRows$month,"%d %m %Y")
qts1 = ts(dbRows$count, frequency = 12, start = c(1914, 8)) 
plot(qts1, lwd=3,col = "darkgreen", xlab="Month of the war",ylab="Number of newspaper stories", main=paste("Number of stories in Welsh Newspapers matching the search Allotment and Garden",sep=""),sub="For each month of World War I.")


It appears that a lot of stories were published about allotment gardening in the last two years of World War I in Wales. Were these stories published in newspapers throughout Wales or only in some areas? To answer this question we need to know the location of each newspaper that published a story and relate that to the stories published in the database.

I referenced a list of all the Welsh newspapers available on-line. Each newspaper also has a page of metadata about it. To gather data, I used an R program to parse the list of newspapers and lookup each newspaper’s metadata.  This program extracted the name of the place the newspaper was published and stored that into a database.

Below is the detail of the geocoding and inserting of values into the database. I removed a tryCatch() handler for the geocode statement for readability.

NewspaperDataPlaceGeoCode= geocode(paste(NewspaperDataPlace,",",NewspaperDataCountry,sep=""))

 NewspaperDataPlaceLat = NewspaperDataPlaceGeoCode[[2]]
 NewspaperDataPlaceLong = NewspaperDataPlaceGeoCode[[1]]

query<-paste("INSERT INTO `newspaper_search_results`.`tbl_newspapers`(`newspaper_id`,`newspaper_title`,`newspaper_subtitle`,`newspaper_place`,`newspaper_country`,`newspaper_place_lat`,`newspaper_place_long`) VALUES ('",newspaperNumber,"','",sqlInsertValueClean(NewspaperDataTitle),"',LEFT(RTRIM('",sqlInsertValueClean(NewspaperDataSubTitle),"'),255),'",NewspaperDataPlace,"','",NewspaperDataCountry,"',",NewspaperDataPlaceLat,",",NewspaperDataPlaceLong,");",sep="")

The tbl_newspapers table with the geocoded location of publication.
The tbl_newspapers table with the geocoded location of publication.

I used R’s ggmap to plot the locations of the newspapers on a map of Wales.[2] Below, the title, latitude and longitude is selected from tbl_newspapers and then put into the dataframe named df.

query="SELECT `newspaper_title`,`newspaper_place_lat`,`newspaper_place_long` FROM `tbl_newspapers`;"
rs = dbSendQuery(newspapersDb,query)

df <- data.frame(x=dbRows$newspaper_place_long, y=dbRows$newspaper_place_lat,newspaperTitle=dbRows$newspaper_title)

#opens a map of Wales
mapWales <- get_map(location = c(lon = -4.08292, lat = 52.4153),color = "color",source = "google",maptype = "roadmap",zoom = 8)

ggmap(mapWales, base_layer = ggplot(aes(x = x, y = y, size = 3), data = df)) + geom_point(color="blue", alpha=0.3)

Ggmap plots the locations of the newspapers in the df dataframe on to the mapWales map:

Sites where Welsh newspapers were published.
Sites where Welsh newspapers were published.

The map above shows the locations where each newspaper in the Welsh National Library collection was published. To make this usable with the collection of stories about allotment gardens that were printed during World War I, I will change the program to join the table of stories to the table of newspaper locations and plot only the locations of the newspapers that printed the stories in the collection of search results above.

To improve on this, instead of just plotting the publication location, I would like to plot the area the newspaper circulated in.  I plan to see if I can reliably get this information from the newspaper metadata.

Thanks to Jiayi (Jason) Liu for the article Ggmap. See:




[1] Records of the Ministry of Food. British National Archives.

[2] D. Kahle and H. Wickham. ggmap: Spatial Visualization with ggplot2. The R Journal, 5(1), 144-161. URL



Searching an on-line newspaper without an API.

Last week’s research focused on getting notices from the British Gazette via their API. The notices are searchable and returned as XML which could be handled easily as a dataFrame in R.

This week’s focus is a web site that I can’t find an API for, the National Library of Wales’ Welsh Newspapers Online. This site is a tremendous source of news stories and it easy for a person to search and browse. I have asked the Welsh Library if the web site has an API. In the meantime, this week’s program uses search result webpages to get data. It’s a much uglier and error prone process to reliable get data out of a web page, but I expect the benefits of doing that will save me some time in pulling together newspaper articles for research so that I can focus on reading them rather than downloading them. I hope this approach is useful for others. Here is the program.

To search Welsh Newspapers Online using R, a URL can be composed with these statements:

searchDateRangeMin = "1914-08-03"
searchDateRangeMax = "1918-11-20"
searchDateRange = paste("&range%5Bmin%5D=",searchDateRangeMin,"T00%3A00%3A00Z&range%5Bmax%5D=",searchDateRangeMax,"T00%3A00%3A00Z",sep="")
searchBaseURL = ""
searchTerms = paste("search?alt=full_text%3A%22","allotment","%22+","AND","+full_text%3A%22","society","%22+","OR","+full_text%3A%22","societies","%22",sep="")
searchURL = paste(searchBaseURL,searchTerms,searchDateRange,sep="")

These assemble the Search URL:

This search generates 1,009 results.  To loop through them in pages of 12 results at a time this loop is used:

for(gatherPagesCounter in 84:(floor(numberResults/12)+1)){

How does the program know the number of results returned by the search? It looks through the search results page line by line until it finds a line like:

<input id="fl-decade-0" type="checkbox" class="facet-checkbox" name="decade[]" value="1910" facet />

If you like, take a look at the source of this page and find the line above by searching for 1910.

The line above is unique on the page and does not change regardless of how many search results. Following that line we have the one below containing the number of results:

<label for="fl-decade-0"> 1910 - 1919 <span class="facet-count" data-facet="1910">(1,009)</span></label>

Here is the part of the program that searches for the line above and parses the second line to get the numeric value of 1009 we want:

# find number of results
 for (entriesCounter in 1:550){
      if(thepage[entriesCounter] == '<input id=\"fl-decade-0\" type=\"checkbox\" class=\"facet-checkbox\" name=\"decade[]\" value=\"1910\" facet />') {
           tmpline = thepage[entriesCounter+1]
           tmpleft = gregexpr(pattern ='"1910',tmpline)
           tmpright = gregexpr(pattern ='</span>',tmpline)
           numberResults = substr(tmpline, tmpleft[[1]]+8, tmpright[[1]]-2)
           numberResults = trimws(gsub(",","",numberResults))
           numberResults = as.numeric(numberResults)

Getting this information returned from an API would be easier to work with, but we can handle this.

For testing purposes, I’m using 3 pages of search results for now. Here is a sample of the logic used in most of the program:

for(gatherPagesCounter in 1:3){
      thepage = readLines(paste   (searchURL,"&page=",gatherPagesCounter,sep=""))
      # get rid of the tabs
      thepage = trimws(gsub("\t"," ",thepage))
      for (entriesCounter in 900:length(thepage)){
           if(thepage[entriesCounter] == '<h2 class=\"result-title\">'){
               entryTitle = trimws(gsub("</a>","",thepage[entriesCounter+2]))

A page number is appended to the search URL noted above…

thepage = readLines(paste(searchURL,"&page=",gatherPagesCounter,sep=""))

…so that we have a URL like below and the program can step through pages 1,2,3…85:

This statement removes tab characters to make each line cleaner for the purposes of looking for the lines we want:

thepage = trimws(gsub("\t"," ",thepage))

The program loops through the lines on the page looking for each line that signifies the start of an article returned from the search: <h2 class=”result-title”>

for (entriesCounter in 900:length(thepage)){
 if(thepage[entriesCounter] == '<h2 class=\"result-title\">'){

Once we know which line number <h2 class=”result-title”> is on, we can get items like the article title that happen to be 2 lines below the line we found:

entryTitle = trimws(gsub("</a>","",thepage[entriesCounter+2]))

This all breaks to smithereens if the web site is redesigned, however it’s working ok for my purposes here which are temporary. I hope it works for you. This technique can be adapted to other similar web sites.

Download each article:

For each search result returned the program also downloads the text of the linked article. As it does that, the program takes a pause for 5 seconds so as not to put a strain on the web servers at the other end of this.

# wait 5 seconds - don't stress the server
p1 <- proc.time()
proc.time() - p1

Results of the program:

  • A comma separated value (.csv) file of each article with the newspaper name, article title, date published, URL and a rough citation.
  • A separate html file with the title and text of each article so that I can read it off-line.
  • An html index to these files for easy reference.

Reading through the articles I can make notes in a spreadsheet listing all of the articles, removing the non-relevant ones and classifying the others. I have applied natural language processing to extract people, location and organization entities from the articles, but I am still evaluating if that provides useful data due to the frequency of errors I’m seeing.

It’s time to let this loose!