Disjoint Set Data Structure

The last two weeks didn’t really offer a lot of challenges considering the semester is over and I didn’t grow a big enough pair to actually leave the practice rooms in TopCoder and go participate. So, I decided to revise some of the most interesting stuff I learned in my CS 381 class. And since I really liked the stuff we did with Disjoint Sets (not a lot of stuff, just kruskal’s algorithm but it still is pretty neat).We typically define 3 operations on disjoint sets:

  • MakeSet(x) : Create a set with the element ‘x’ in it.
  • FindSet(x) : Return the set that the element belongs to
  • Union(x, y) : Make a new set defined as :  where and   and destroy X and Y.

Implementation Using A Linked List:

Here each set is represented by placing its elements in the nodes of a linked list. In each node, we also store a pointer to the parent which contains the name of each set and some other information like the number of elements in the set, the name and so on.

Running Time For Each Operation:

MakeSet(x) : Constant Time

FindSet(x) : Return the parent that the parent pointer points to

Union(x, y)  :  Assume X is the set x resides in and Y is the set y resides in. Now, if we append the elements of X to Y, we would need to walk through X and set the parent pointers to Y. So, 1 union operation takes O(n) time.

Special Heuristics:

The Union(x, y) implementation gets a slight change here. When deciding which set to append to the other in the union operation, we append that set which has fewer elements to the other set. The advantage is that the walk through the list to change the parent takes lesser time.

Implementation Using A Tree:

We represent a set by a rooted tree where each node points to its parent (not parent -> children but the other way round). We define a special term here called rank. The rank of a set is an upper bound on the height of the tree. Why it is an upper bound will be evident soon.

Running Time Per Operation:

MakeSet(x) : Constant Time. Let x be a node whose parent is itself

FindSet(x) : O(log(n)) time. Basically need to go from x to the root.

Union(x, y): Constant time. We just make the root of either X or Y (the sets x and y belong to respectively) the parent of the other root.

Special Heuristics:

We use a heuristic here called path compression. When we run FindSet(x), we typically traverse the path from x to the root. With path compression, we basically make all nodes on this path the immediate children of the root. The advantage? Subsequent FindSet() operations on these nodes take constant time. This might modify the actual height (as you can see). BUT YOU DON’T CHANGE THE RANK HERE. Hence the rank is an “upper-bound” on the height. If we didn’t have the path-compression heuristic, the rank would have been the height.

So, the code follows. First the linked list implementation:

Here are the class definitions for the Nodes, the List and the Universe

valueNodeDict = {}

class Node():
        ‘Represents a node of the linked list’

        def __init__(self, value, parent, next=None):
                self.value = value
                self.next = next
                self.parent = parent
                valueNodeDict[value] = self

        def __str__(self):
                return str(self.value)

class List():
        ‘Represents the linked list itself’

        def __init__(self, name, value=None):
                ‘Creates a set and sets head and tail to the appropriate nodes’
                self.name = name
                self.count = 0
                if value is None:
                        self.head = None
                        self.tail = None

                else:
                        self.head = Node(value, self)
                        self.tail = self.head
                        self.count += 1

        def __str__(self):
                ‘Lists out all the elements’
                s = self.name + ": "
                node = self.head
                while node != None:
                        s += str(node) + ", "
                        node = node.next
                return s + "\n"

        def addElement(self, node):
                ‘Add another node’
                self.tail.next = node
                self.tail = self.tail.next
                self.count += 1

        def changeParent(self, newParent):
                ‘Changes the parent of every node in the set’
                node = self.head
                while node is not None:
                        node.parent = newParent
                        node = node.next

class Universe():
        ‘Operations you can perform in the universe’
        def __init__(self, name="U"):
                ‘Creates the universe where all your subsequent sets belong’
                self.name = name
                self.setCount = 0
                self.sets = []

        def __str__(self):
                s = self.name + ":\n"
                for set in self.__sets:
                        s += "\t" + str(set) + "\n"
                return s

        def addSet(self, name=None, value=None):
                ‘Adds a set to the universe’
                if name is None:
                        self.__sets.append(List("S"+str(self.setCount), value))
                        self.setCount += 1
                else:
                        self.__sets.add(List(name, value))

U = Universe()
 

As you can see, I have a valueNodeDict dictionary in the implementation. The idea is that I should be able to use the values themselves in the operations and not the nodes.

Next, the operations themselves:

def MakeSet(x, name=None):
        ‘Makes a new Linked List’
        U.addSet(name, x)

def FindSet(x):
        ‘Find The Set This Node Belongs To’
        return valueNodeDict[x].parent

def Union(x, y):
        ‘Destructively Perform The Union’
        if valueNodeDict[x].parent is not valueNodeDict[y].parent:
                x_set = valueNodeDict[x].parent
                y_set = valueNodeDict[y].parent

                if x_set.count < y_set.count:
                        #x gets appended to y
                        y_set.count += x_set.count
                        x_set.changeParent(y_set)
                        y_set.tail.next = x_set.head
                        y_set.tail = x_set.tail
                        U.sets.remove(x_set)
                else:
                        #y gets appended to x
                        x_set.count += y_set.count
                        y_set.changeParent(x_set)
                        x_set.tail.next = y_set.head
                        x_set.tail = y_set.tail
                        U.sets.remove(y_set)
 

Now, the tree implementation:

valueNodeDict = {}

class Node():
        ‘Represents a node in a tree’
        def __init__(self, value, parent, rank):
                self.value = value
                self.parent = parent
                self.rank = rank
                valueNodeDict[value] = self

        def __str__(self):
                return str(value)

class Universe():
        ‘The Universe where all sets sit’
        def __init__(self):
                self.sets = [] #list of all root nodes

        def addSet(self, root):
                self.sets.append(root)

U = Universe()
 

And the operations:

def internalFindSet(x):
        ‘Find The Root Of The Tree That Contains This Node. Uses path compression’
        if x.parent is not x:
                x.parent = internalFindSet(x.parent)
        return x.parent

def MakeSet(x):
        ‘Make a new node whose parent is itself’
        a = Node(x, None, 0)
        a.parent = a
        U.addSet(a)

def FindSet(x):
        ‘Returns the root of the tree which contains this value’
        x_node = valueNodeDict[x]
        return internalFindSet(x_node)

def Union(x, y):
        ‘Destructively Unite X and Y where x belongs to X and y to Y’
        x_set = FindSet(x)
        y_set = FindSet(y)

        if x_set.rank > y_set.rank:
                y_set.parent = x_set
        else:
                x_set.parent = y_set
                if x_set.rank == y_set.rank:
                        y_set.rank += 1
 

As an added exercise, I also threw in Kruskal’s algorithm for finding the Minimum Spanning Tree of a Graph. Here is my graph implementation which just consists of an edge-list and a vertex-list:

import heapq
from disjoint_set2 import *

class Vertex:
        def __init__(self, value):
                self.value = value

        def __str__(self):
                return self.value

class Edge:
        def __init__(self, vertex1, vertex2, weight):
                self.vertex1 = vertex1
                self.vertex2 = vertex2
                self.weight = weight

        def __str__(self):
                s = "Connects: " + str(self.vertex1) + " and " + str(self.vertex2) + "\tWeight: " + str(self.weight)
                return s

class Graph:
        def __init__(self, V, E):
                self.V = V
                self.E = E
 

I decided to create a small graph with nodes: ‘a’, ‘b’, ‘c’, ‘d’ and ‘e’. The code to do that:

characters = [‘a’, ‘b’, ‘c’, ‘d’, ‘e’]

vertices = []
edges = []

for char in characters:
        vertices.append(Vertex(char))
 

Next, I decided to connect every vertex to every other vertex using edges whose weights are in increasing order as follows:

Edge from ‘a’ to ‘a’ has weight 0

From ‘a’ to ‘b’ we get 1

From ‘a’ to ‘c’ we get 2

and so on, you get the idea. The plan was to get a verifiable result straight away.

for vertex1 in vertices:
        for vertex2 in vertices:
                edges.append(Edge(vertex1, vertex2, i))
                i += 1

G = Graph(vertices, edges)
 

Now, we prepare the required data structures for this algorithm. We need a priority queue. Python 2.6.5 ships with a heapq module. Using this module, we can maintain a heap in an array and the desired “bubble up” and “bubble down” operations are performed by the heapq module’s routines. As it is known to mankind, the root of any heap contains the element with the smallest key (assuming the heap is storing (key, value) pairs and is a min-heap). So, we finally have:

def kruskal(G):
        heap = []

        for edge in G.E:
                heapq.heappush(heap, (edge.weight, edge))

        T = []  #this contains all the edges in the tree

        # run makeset on all the vertices

        for vertex in G.V:
                MakeSet(vertex)

        while heap:
                min_edge = heapq.heappop(heap)[1]
                if FindSet(min_edge.vertex1) is not FindSet(min_edge.vertex2):
                        # perform a union and add this edge to the Tree
                        T.append(min_edge)
                        Union(min_edge.vertex1, min_edge.vertex2)
        return T
 

So, I return a list of edges in the MST. When I run this on the graph created above, I get:

(shriphani@Shriphani-Palakodetys-MacBook-Pro) (556/ttys000) (09:03P:05/19/10) -
(%:~/scripts) > python kruskal.py
Connects: a and b	Weight: 1
Connects: a and c	Weight: 2
Connects: a and d	Weight: 3
Connects: a and e	Weight: 4

Which when checked is the actual MST.

Analysis

Assume that we have n makeset operations out of m overall operations, with the special heuristic we use, anytime a list is chosen for appending to another list, we observe that this list has to have fewer elements than the other list.

So, this is the pattern we have:

  • If there is one element in the list and this list is chosen for appending to the other list, then the resulting size would be at least 2
  • If the current size is 2 and we append this list to another list, the resulting size will be at least 4
  • Once we append a list of size 4 to another list, the resulting list would have a size of 8.

So, we observe that in 3 append operations, we approached a size of 8. So, in lg(8) operations, we approached size 8. So, we would approach size ‘n’ (the kruskal() routine obtains the MST on a connected graph at this stage) in lg(n) operations.

Also, there need to be n-1 union operations since the Universe finally contains just 1 set with all vertices in it (assuming you have a connected graph). There need to be n-1 union operations for the universe to approach this state. So, a final running time is O(n * lg(n)) for this.

Now, if the graph is pretty sparse, as in 5 vertices and a single edge in the entire graph, the other operations would dominate. So, the actual running time is

O(m + n*lg(n))

With the tree based structure, our course didn’t cover the analysis but we were told that the running time was a cool O(m * alpha(n)) where alpha(n) <= 4 for  most circumstances.

The code can be obtained at:

In this post I used material from Cormen-Leiserson-Rivest-Stein’s amazing book (It is the best book I’ve ever read. Please go get a copy. You won’t regret it). And of course, stuff from Professor GNF’s CS 381 class. It is the best CS course I’ve taken thus far.


About this entry