I am learning about Graph Databases. These are notes that may be useful to you, but caution is advised since I’m new to this.
What are Graph databases? They are a means to relate data as connected points using vertices (the points) and edges (the connections). Each vertex and edge has properties that describe it.
What flavours of Graph databases exist? I have been looking at neo4j.io and Azure Cosmos DB and I imagine there are more. This post works with Cosmos DB, Gremlin and Python.
Azure Cosmos DB
My notes are based on Quickstart: Create a graph database in Azure Cosmos DB using Python and the Azure portal. (Thank you to Manish Sharma, Sneha Gunda and Microsoft for this.) I’m also reading Graph Databases in Action –
Examples in Gremlin by Dave Bechberger and Josh Perryman.
Python & Gremlin
Python and the Gremlin api for the Cosmos DB graph database are used to manipulate data. Gremlin is part of the Apache TinkerPop framework for graph databases. I tried to use a Jupyter notebook on Google Colab, but kept getting “RuntimeError: Cannot run the event loop while another loop is running”. I used Visual Studio Code to edit and run the .py programs. I installed gremlinpython:
pip install --trusted-host=pypi.org --trusted-host=files.pythonhosted.org --user gremlinpython
Knowledge Graph of subjects in the diary of Capt. William White.
A study of the 1917 World War I diary of Canadian soldier Capt. William White yielded a lists of people, organizations and locations appearing in it. An analysis of these subjects may open more avenues for research into Capt. White’s experience in wartime France as well as his unit, No. 2 Construction. With this exercise I am seeking to build a knowledge graph of subjects in the diary.
The programs I am using are in Github and based on the Microsoft example. Errors in the programs are mine.
I set up a Graph database in Microsoft Azure for this exercise using the Quickstart noted above. I’m using the free tier of Azure pricing.
- Run drop.py to remove all of the data from the database and make it ready to load. The Gremlin statement is g.V().drop()
- The subjects in the diary are stored in a relational database. Each table was exported as a .csv. These files include wwper.csv, wwloc.csv, wworg.csv, wwpages.csv (which store the person, location, organization and page entities). The files wwper_x_page.csv, wwloc_x_page.csv and wworg_x_page.csv store the references between subjects, such as a person, and the diary page they appear on.
- A set of programs, like prep_data_1_people.py parse and reformat the .csv data into Gremlin statements that allow the data to be inserted into the graph database.
- Run all of the “prep_data_1” programs to prepare the data to be loaded.
Loading the database.
- Run data_load_people.py. This runs a set of statements to create Vertices such as: g.addV(‘person’).property(‘id’, ‘per0’).property(‘name’, ‘Capt. William White’).property(‘name_last’, ‘White’).property(‘about’, ‘No. 2 Construction chaplin’).property(‘partitionKey’, ‘partitionKey’)
- Run data_load_locations.py, data_load_organizations.py and data_load_pages.py. This completes the load of vertices.
- Check the results in Azure’s Data Explorer. Click “Execute Gremlin Query” and click on results to view each Vertex. At the present time Vertices are not related to each other.
- Change the graph style to display the name of the Vertex and differentiate node types by color, using the label field. (See picture below)
- Run data_load_per_x_page.py. This executes a set of statements to create Edges such as: g.V(‘per1’).addE(‘appears’).to(g.V(‘page3’)). Person ID=1 (Mlles Thomas) appears on page ID=3, seen here.
- Run data_load_loc_x_page.py and data_load_org_x_page.py. The completes the load of edges.
- Check the results in Data Explorer. Click “Execute Gremlin Query” and click on results to view each Vertex. You should see relationships now. (see below)
The next steps are to model more meaningful relationships between entities. For example: soldiers in the same unit, members of a family, friends and related organizations.
Visual Studio Code, side note.
I am using Visual Studio Code for this. I had this error connecting to GitHub “SSL certificate problem: self signed certificate in certificate chain“. Matt Ferderer’s post here allowed me to fix this.
I also get this error message below. I am running VS Code on Windows. Thanks to this answer on Stack Overflow, it seems to be an error with aiohttp. I can’t fix it, but it does not prevent the scripts from completing.
Exception ignored in: <function _ProactorBasePipeTransport.__del__ at 0x000001B296E5FAF0> Traceback (most recent call last): File "C:\Users\jblackad\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 116, in __del__ self.close() File "C:\Users\jblackad\AppData\Local\Programs\Python\Python39\lib\asyncio\proactor_events.py", line 108, in close self._loop.call_soon(self._call_connection_lost, None) File "C:\Users\jblackad\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 746, in call_soon self._check_closed() File "C:\Users\jblackad\AppData\Local\Programs\Python\Python39\lib\asyncio\base_events.py", line 510, in _check_closed raise RuntimeError('Event loop is closed') RuntimeError: Event loop is closed