A database index is an extra data structure in a storage engine w/ the goal of faster retrieval. It’s not the primary storing entity for the data, and it has to be maintained on writes.

The simplest type of index. The keys are whatever column the index is built on, and the value is the location of the data on disk—this prevents us for having to scan for the values.

Hash indexes work well for simple equality (`select * from table_name where id = 4;`

). They don’t work for range queries (`select * from table_name where id > 4 and where id < 16;`

), due to the nature of a hash map (hash functions are by no means sequential).

Hash indexes are stored in memory (in RAM), and so if the database crashes, they’re wiped out and have to be rebuilt.

Note that putting and getting from a hash table is *O(1)* in most cases (as in, when a hash collision doesn’t occur), so this type of index is extremely fast, and works well for it’s limited set of use cases (equality!). Note that hash collisions are minimized when sufficient memory is allocated to the hashmap.

This is the most common type of index. In fact, when you use `create index`

in Postgres, it creates a B-Tree index by default.

B-Trees keep the key-value pairs sorted in order by key, which allows for great performance on range queries.

A B-Tree is made up of pages, with page size *P*, each of which has *b* possible references to child pages. We call *b* the *branching factor*. If a page has *d* children, then it is storing *d-1* keys, and *d <= b*.

When dealing with the data structure (as opposed to the database index), you will often hear “pages” referred to as nodes. They’re the same thing, just different contexts.

The key properties of a B-Tree are the following:

- Each node can store b-1 keys, where b is the branching factor. This is in contrast to something like a splay tree, where each node stores only a single value.
- It is always balanced. Insert and delete operations, if they put the tree out of balance, are followed by some balancing operation. (A splay tree acts the same in this regard).

For the following example, we’ll be dealing with 2-4 trees, which are a B-Tree, with a branching factor of 4. Meaning, a node can have a maximum of 4 children, and a maximum of 3 values.

When a node overflows (the # of keys allowed has been exceeded upon an insertion), we split the node.

e.g. If a node has keys [1,2,3,4], it has overflowed. Therefore, we must split it.

We do this by promoting one of the keys in the middle to the parent, and making children out of the keys on the left and right of the value promoted.

One challenge for implementers is that overflow can propagate—if we overflow a leaf node, then we split it, and promote a key, that can cause overflow on the parent node, which we will then further have to split. This can (and should) be solved recursively, but it is nonetheless a challenge.

Upon deletion of a value, we can have a node that has no values. We have two cases—transfer and fusion. When a node has a sibling node with at least 2 values, then we can perform transfer.

In the case where there isn’t such a sibling, and transferring would just propagate the underflow to the sibling, we use fusion—the combining of two nodes. We combine the empty node with its sibling, and also move a value from parent into that node.

This website has a fantastic visualization of how it works. The main thing to take away about the structure is that it is always balanced, and so when we go to search for a key-value pair, it is worst case *O(logn)*.

B-Tree indexes, unlike hashmap indexes, are usually maintained on disk. This is because we usually keep a large page size. When a B-Tree has a large page size and branching factor, it has a very high storage capacity.

`storage capacity = (b^n) * P`

Where n is the length of the tree, b is the branching factor, and P is the page size. e.g.

`500^4 * 4KB = 250TB`

(example from [1]).◾

Git is my personal favorite application of a directed acyclic graph. Git, while incredibly powerful, can be confusing because it is abstract. You do need, to a certain extent, understand its internals.

Objects are the unit of storage in Git. They are uniquely identified by the SHA-1 of its contents. They are immutable (i.e. they cannot be changed) [1].

Commit objects contain information about a revision, ** pointers to its parents**, and additional metadata [1].

Things to notice about the above example:

- We have 2 branches,
`master`

and`add-feature`

. - There are 6 commits.
- Notice how a commit points at its parent. The initial commit, A, has no parent.

`master`

, `add-feature`

, and `HEAD`

are examples of refs. A ref is a pointer to a commit (or to another ref). `HEAD`

is a special ref, we’ll come back to this.

Note that when you are “checked out on branch add-feature”, what we really mean is “`HEAD`

is pointing to the ref add-feature”.

`git checkout -b add-feature`

This does the following: 1. Create a new ref, add-feature, pointing to the same commit that the currently checked out branch was pointing at. 2. Check out your new branch (point HEAD at the newly created ref).

`git rebase master`

`git rev-list HEAD`

Print the ancestry graph from the stated location (in reverse chronological order).

I think I’ll leave it at that for now, I’m sure there will be more deep dives into Git in the future.◾

Recently I wrote about directed acyclic graphs, and how you can derive topological orderings from them. Now let’s look at a more practical example.

Gradle is a build tool. It is built on tasks.

There are dependencies among the tasks (i.e. we can’t `build`

if we haven’t run `compileJava`

). Instead of just storing a linked list type structure in which we do the tasks, Gradle internally stores the dependency graph as a DAG.

To illustrate this example, I have a demo Spring Boot application, with dependencies in Spring MVC & Tomcat (web). I also made it a Kotlin project. Other than that, it’s the standard application the Spring Initializr will generate.

Let’s look at a section of the graph (technically everything needed to run `assemble`

, which is one of the two tasks needed for `gradle build`

to run (the other being `test`

)).

TLDR – a gradle task is a single atomic piece of work for a build, such as compiling classes or generating javadoc.

Now, if we want to get a valid order in which to run the tasks (Gradle does this internally), we need to topologically sort the DAG.

```
tsort <<EOF
compileKotlin compileJava
compileKotlin jar
compileKotlin bootJar
compileJava classes
processResources classes
classes bootJar
classes jar
classes inspectClassesForKotlinIC
inspectClassesForKotlinICjar
jar assemble
bootJar assemble
EOF
```

And the result is a valid order in which to run the tasks:

```
processResources
compileKotlin
compileJava
classes
bootJar
inspectClassesForKotlinIC
jar
assemble
```

As we said, Gradle has to topologically sort the DAG internally. In order to see it’s linear ordering, run `gradle assemble -m`

.

```
:compileKotlin SKIPPED
:compileJava SKIPPED
:processResources SKIPPED
:classes SKIPPED
:bootJar SKIPPED
:inspectClassesForKotlinIC SKIPPED
:jar SKIPPED
:assemble SKIPPED
BUILD SUCCESSFUL in 0s
```

It is different, but only slightly. Recall that topological orderings are not unique (there are multiple ways to put your clothes on in the morning).◾

]]>Now that we know the basics of directed acyclic graphs, we’re going to move to a more specialized data structure, one that is wholly derivable from a DAG.

A ** topological ordering** of a directed graph

For those of you who need an algorithm to figure out how to put your clothes on in the morning, you’re saved! Let’s define the task dependency graph for getting dressed. An edge *x -> y* in this graph means “you must put on *x* before *y*“, i.e. *socks -> shoes* is an edge, because you must put on your socks before your shoes.

Without going into the details of how the topological sorting algorithm works (algorithm here), let’s run it on the graph.

We’ll use the `tsort`

command, which has this implemented already implemented on most Unix based platforms. (As you can see below, we just provide space-separated tuples representing all edges in the graph).

```
tsort <<EOF
underwear shoes
socks shoes
underwear pants
pants belt
shirt belt
EOF
# output - a valid way to get dressed
shirt
socks
underwear
pants
shoes
belt
```

Note that topological orderings are not guaranteed to be unique. (You could put your shirt or socks on first, right? It doesn’t matter).

Also notice that the notion of a ** cycle** in this graph makes no sense. (e.g. “You must put on your socks before your shoes, and your shoes before your underwear, and your underwear before your socks.” Good luck with that one).◾

A directed acyclic graph is one of the most important data structures we deal with in software engineering. You can't understand Git without grasping the main concepts of this data structure. In the next four articles, we're going to go in depth into this powerful data structure, and see why it sits at the heart of so many things.

A graph is a data structure, defined by its vertices and edges. Oftentimes you’ll see it defined as *G = (V, E)*. That just means, “G is the graph defined by vertices *V* and edges *E*”.

We usually draw them like this:

We call a graph ** connected** if there is a path from any vertex to any other vertex.

A ** cycle** in a graph is a path from a vertex back to itself, where each edge is only travelled once (the above graph does not contain a cycle).

In a directed graph, each edge has a direction, meaning it goes *from* one vertex *to* another vertex. We draw directed graphs like this.

Directed graphs can have cycles too, like the above one in red. When a directed graph doesn’t have a cycle, we call it a ** directed acyclic graph** (a DAG). (The changed edge to remove the cycle is shown in gold).

In short, that is a directed acyclic graph. A graph that is directed and has no cycles.

A related but different concept is a tree. A tree (in graph theory) is a graph where there is exactly one path from any vertex to any other. Or, more formally, a connected, undirected, acyclic graph.

Think of a family tree.

We’ll dive deeper into the DAG in articles to come.◾

]]>