Storing data as trees in RDB - Part 1: Adjacency List

Models

The problem of solving hierarchical data in RDMSs is quite common and old. Data that adheres to some hierarchy is usually represented in a tree. This allows us to traverse the tree through parent-child relations. Elements in the same level of the tree are then understood as elements that are incomparable (an antichain if you want) or belonging to the same level of hierarchy.

The problem occured to me while developing wikisophy, a web app that discovers and analyzes paths of the first links from wikipedia articles. It is conjectured that almost every wikipedia articles leads to the "Philosophy" articles. In order to gather and store such data it is quite helpful to imagine them as a tree, where parent is the article that is the href from the first link in the child article.

There are quite a few ways of storing such data in a relational database. In this article I will explore some of them, using sqlite3 and Common Lisp.

The basic differences of the models are:

The way the relation is expressed.
The number of columns used in order to capture the relationships.
The cost of storing/retrieving parts of the tree for our needs.

So let's hop into some of them and see.

Adjacency list model implementation

So the plain simple approach is to just add a column that holds the parent id of each data row. I'm using the clsql system here to speak to plain old sqlite database.

First we need to connect to our database. This concept doesn't really mean anything for an sqlite database, but the clsql system relies on the \*default-database\* global variable that is set with the connect procedure.

(defun database-exists-p (db)
  "Check if database DB exists. DB is an sqlite3 database's filename."
  (clsql:probe-database db :database-type :sqlite3))

(defun connect ()
  "Initiate connection to *database*."
  (when (database-exists-p *database*)
    (clsql:connect *database* :database-type :sqlite3)))

(connect)

Then we create our table:

(defparameter *adjacency-list-model*
  '(([id] integer :primary-key :unique)
    ([title] varchar :unique :collate nocase)
    ([parent_id] integer)))

(defun create-adjacency-list-table (table-name)
  "Create the table NAME in the DB."
  (if (not (clsql:table-exists-p table-name))
      (clsql:create-table table-name *adjacency-list-model*)
      "The table already exists!"))

(create-adjacency-list-table "test_table")

Now we have the model of our data in place. The good thing about the adjacency list model is that it requires very few metadata. The single "parent_id" column is sufficient to express the tree data structure.

Given that we have the table in place, the next step is to define procedures for storing and retrieving the data we need. For my aforementioned web application, I developed a crawler that returns a list of the wikipedia articles that it encountered till reaching the "Philosophy" article or an endpoint. Thus I need functions that insert the list into the database, adding the correct "parent_id" for each element.

(defun get-id-by-title (title)
  "Return the id of TITLE." 
  (caar (clsql:select [id] :from [adj_list] :where [= [title] title] :collate "nocase")))

(defun get-parent-by-title (title)
  "Return the parent_id of TITLE." 
  (caar (clsql:select [parent_id] :from [adj_list] :where [= [title] title])))

(defun get-title-by-id (id)
  "Return the title of ID."
  (caar (clsql:select [title] :from [adj_list] :where [= [id] id])))

(defun clear-underscores (string)
  "Given a string replace any underscores with spaces."
  (substitute #\SPACE #\_ string))

(defun insert-path (path &optional parent-id)
  "Given a path of strings, insert them properly in database, by
  filling the appropriate parent_ids."
  (when path
    (let ((current (clear-underscores (car path))))
      (when (not (get-id-by-title current))
          (clsql:insert-records :into [adj_list]
                                :attributes '(title parent_id)
                                :values `(,current ,parent-id)))
      (insert-path (cdr path) (get-id-by-title current)))))

The recursive procedure "insert-path" is probably not the best way to achieve this, but since my Lisp is way better than my ability to write complex SQL queries I chose to sequence the inserts like this. The fact that the procedure is meant to be used with a sqlite database also makes it less problematic since sqlite is actually quite fast in chaining commands, given that it doesn't have a server-client model. Using a different RDB like postgresql would be a different story though.

Also notice that the first recursion of the procedure runs always without the parent-id argument, hence resulting in the first item of the list getting inserted with a null value in the column. This way it is declared as the root of our tree.

Lastly we have to to retrieve a path to "Philosophy". By this I mean that given an article title we want to get a list of articles that are linked by the initial through the first article principle. This is easily defined like-so:

(defun get-path (title)
  "Given the title of an entry return it's path up to the root of the
  tree."
  (let ((parent (get-parent-by-title title)))
    (if parent
        (cons (caar (clsql:select [title]
                                  :from [adj_list]
                                  :where [= [title] title]))
         (get-path (get-title-by-id parent)))
        `(,(caar (clsql:select [title]
                               :from [adj_list]
                               :where [= [title] title]))))))

Model assessment

Having seen a plain implementation of the adjacency list model for storing tree like data in a database, we can now discuss some advantages and disadvantages of the model.

For the good stuff we have:

It needs very few metadata columns.
Inserting an element is cheap, since it needs few metadata and they are element only, so that there is no need for updates in other elements.
Finding adjacent nodes is very cheap and the implementation can be expanded to store children information as easily.

No all is great though. The main problem with adjacency lists is that retrieving subtrees is quite expensive. To retrieve a trivial subtree we have to make a sequential search along all its nodes. This can't be optimized and can be hairy for long subtrees.

Doing some benchmarking proved that for my intended usage it may be a good enough solution, since the paths I want to retrieve tend to be quite short. I can't be sure yet though, so I will explore other solutions as well.

DATABASE> (time (get-path "grendizer"))
Evaluation took:
  0.003 seconds of real time
  0.002415 seconds of total run time (0.002415 user, 0.000000 system)
  66.67% CPU
  6,023,658 processor cycles
  294,368 bytes consed

("Grendizer" "Super Robot" "Anime" "Animation" "Image" "Depiction"
 "Picture plane" "Painting" "Paint" "Liquid" "Compressibility" "Thermodynamics"
 "Physics" "Natural science" "Branch of science" "Science" "Testability"
 "Empirical" "Information" "Uncertainty" "Epistemic" "Philosophy")

Anyway, that's all for this part. In the next one I'll explore and implement the nested set model and also present a hybrid model of it and the adjacency list model. Till then take care of yourselves. This pandemic is quite hard on everyone, so I hope you find ways to pass this pause of everyday life creatively! Don't hesitate to contact me for anything I may be able to help.