Disjoint Set Data Structure
The last two weeks didn’t really offer a lot of challenges considering the semester is over and I didn’t grow a big enough pair to actually leave the practice rooms in TopCoder and go participate. So, I decided to revise some of the most interesting stuff I learned in my CS 381 class. And since I really liked the stuff we did with Disjoint Sets (not a lot of stuff, just kruskal’s algorithm but it still is pretty neat).We typically define 3 operations on disjoint sets:
- MakeSet(x) : Create a set with the element ‘x’ in it.
- FindSet(x) : Return the set that the element belongs to
- Union(x, y) : Make a new set defined as :
where
and
and destroy X and Y.
Implementation Using A Linked List:
Here each set is represented by placing its elements in the nodes of a linked list. In each node, we also store a pointer to the parent which contains the name of each set and some other information like the number of elements in the set, the name and so on.
Running Time For Each Operation:
MakeSet(x) : Constant Time
FindSet(x) : Return the parent that the parent pointer points to
Union(x, y) : Assume X is the set x resides in and Y is the set y resides in. Now, if we append the elements of X to Y, we would need to walk through X and set the parent pointers to Y. So, 1 union operation takes O(n) time.
Special Heuristics:
The Union(x, y) implementation gets a slight change here. When deciding which set to append to the other in the union operation, we append that set which has fewer elements to the other set. The advantage is that the walk through the list to change the parent takes lesser time.
Implementation Using A Tree:
We represent a set by a rooted tree where each node points to its parent (not parent -> children but the other way round). We define a special term here called rank. The rank of a set is an upper bound on the height of the tree. Why it is an upper bound will be evident soon.
Running Time Per Operation:
MakeSet(x) : Constant Time. Let x be a node whose parent is itself
FindSet(x) : O(log(n)) time. Basically need to go from x to the root.
Union(x, y): Constant time. We just make the root of either X or Y (the sets x and y belong to respectively) the parent of the other root.
Special Heuristics:
We use a heuristic here called path compression. When we run FindSet(x), we typically traverse the path from x to the root. With path compression, we basically make all nodes on this path the immediate children of the root. The advantage? Subsequent FindSet() operations on these nodes take constant time. This might modify the actual height (as you can see). BUT YOU DON’T CHANGE THE RANK HERE. Hence the rank is an “upper-bound” on the height. If we didn’t have the path-compression heuristic, the rank would have been the height.
So, the code follows. First the linked list implementation:
Here are the class definitions for the Nodes, the List and the Universe
class Node():
”‘Represents a node of the linked list’”
def __init__(self, value, parent, next=None):
self.value = value
self.next = next
self.parent = parent
valueNodeDict[value] = self
def __str__(self):
return str(self.value)
class List():
”‘Represents the linked list itself’”
def __init__(self, name, value=None):
”‘Creates a set and sets head and tail to the appropriate nodes’”
self.name = name
self.count = 0
if value is None:
self.head = None
self.tail = None
else:
self.head = Node(value, self)
self.tail = self.head
self.count += 1
def __str__(self):
”‘Lists out all the elements’”
s = self.name + ": "
node = self.head
while node != None:
s += str(node) + ", "
node = node.next
return s + "\n"
def addElement(self, node):
”‘Add another node’”
self.tail.next = node
self.tail = self.tail.next
self.count += 1
def changeParent(self, newParent):
”‘Changes the parent of every node in the set’”
node = self.head
while node is not None:
node.parent = newParent
node = node.next
class Universe():
”‘Operations you can perform in the universe’”
def __init__(self, name="U"):
”‘Creates the universe where all your subsequent sets belong’”
self.name = name
self.setCount = 0
self.sets = []
def __str__(self):
s = self.name + ":\n"
for set in self.__sets:
s += "\t" + str(set) + "\n"
return s
def addSet(self, name=None, value=None):
”‘Adds a set to the universe’”
if name is None:
self.__sets.append(List("S"+str(self.setCount), value))
self.setCount += 1
else:
self.__sets.add(List(name, value))
U = Universe()
As you can see, I have a valueNodeDict dictionary in the implementation. The idea is that I should be able to use the values themselves in the operations and not the nodes.
Next, the operations themselves:
”‘Makes a new Linked List’”
U.addSet(name, x)
def FindSet(x):
”‘Find The Set This Node Belongs To’”
return valueNodeDict[x].parent
def Union(x, y):
”‘Destructively Perform The Union’”
if valueNodeDict[x].parent is not valueNodeDict[y].parent:
x_set = valueNodeDict[x].parent
y_set = valueNodeDict[y].parent
if x_set.count < y_set.count:
#x gets appended to y
y_set.count += x_set.count
x_set.changeParent(y_set)
y_set.tail.next = x_set.head
y_set.tail = x_set.tail
U.sets.remove(x_set)
else:
#y gets appended to x
x_set.count += y_set.count
y_set.changeParent(x_set)
x_set.tail.next = y_set.head
x_set.tail = y_set.tail
U.sets.remove(y_set)
Now, the tree implementation:
class Node():
”‘Represents a node in a tree’”
def __init__(self, value, parent, rank):
self.value = value
self.parent = parent
self.rank = rank
valueNodeDict[value] = self
def __str__(self):
return str(value)
class Universe():
”‘The Universe where all sets sit’”
def __init__(self):
self.sets = [] #list of all root nodes
def addSet(self, root):
self.sets.append(root)
U = Universe()
And the operations:
”‘Find The Root Of The Tree That Contains This Node. Uses path compression’”
if x.parent is not x:
x.parent = internalFindSet(x.parent)
return x.parent
def MakeSet(x):
”‘Make a new node whose parent is itself’”
a = Node(x, None, 0)
a.parent = a
U.addSet(a)
def FindSet(x):
”‘Returns the root of the tree which contains this value’”
x_node = valueNodeDict[x]
return internalFindSet(x_node)
def Union(x, y):
”‘Destructively Unite X and Y where x belongs to X and y to Y’”
x_set = FindSet(x)
y_set = FindSet(y)
if x_set.rank > y_set.rank:
y_set.parent = x_set
else:
x_set.parent = y_set
if x_set.rank == y_set.rank:
y_set.rank += 1
As an added exercise, I also threw in Kruskal’s algorithm for finding the Minimum Spanning Tree of a Graph. Here is my graph implementation which just consists of an edge-list and a vertex-list:
from disjoint_set2 import *
class Vertex:
def __init__(self, value):
self.value = value
def __str__(self):
return self.value
class Edge:
def __init__(self, vertex1, vertex2, weight):
self.vertex1 = vertex1
self.vertex2 = vertex2
self.weight = weight
def __str__(self):
s = "Connects: " + str(self.vertex1) + " and " + str(self.vertex2) + "\tWeight: " + str(self.weight)
return s
class Graph:
def __init__(self, V, E):
self.V = V
self.E = E
I decided to create a small graph with nodes: ‘a’, ‘b’, ‘c’, ‘d’ and ‘e’. The code to do that:
vertices = []
edges = []
for char in characters:
vertices.append(Vertex(char))
Next, I decided to connect every vertex to every other vertex using edges whose weights are in increasing order as follows:
Edge from ‘a’ to ‘a’ has weight 0
From ‘a’ to ‘b’ we get 1
From ‘a’ to ‘c’ we get 2
and so on, you get the idea. The plan was to get a verifiable result straight away.
for vertex2 in vertices:
edges.append(Edge(vertex1, vertex2, i))
i += 1
G = Graph(vertices, edges)
Now, we prepare the required data structures for this algorithm. We need a priority queue. Python 2.6.5 ships with a heapq module. Using this module, we can maintain a heap in an array and the desired “bubble up” and “bubble down” operations are performed by the heapq module’s routines. As it is known to mankind, the root of any heap contains the element with the smallest key (assuming the heap is storing (key, value) pairs and is a min-heap). So, we finally have:
heap = []
for edge in G.E:
heapq.heappush(heap, (edge.weight, edge))
T = [] #this contains all the edges in the tree
# run makeset on all the vertices
for vertex in G.V:
MakeSet(vertex)
while heap:
min_edge = heapq.heappop(heap)[1]
if FindSet(min_edge.vertex1) is not FindSet(min_edge.vertex2):
# perform a union and add this edge to the Tree
T.append(min_edge)
Union(min_edge.vertex1, min_edge.vertex2)
return T
So, I return a list of edges in the MST. When I run this on the graph created above, I get:
(shriphani@Shriphani-Palakodetys-MacBook-Pro) (556/ttys000) (09:03P:05/19/10) - (%:~/scripts) > python kruskal.py Connects: a and b Weight: 1 Connects: a and c Weight: 2 Connects: a and d Weight: 3 Connects: a and e Weight: 4
Which when checked is the actual MST.
Analysis
Assume that we have n makeset operations out of m overall operations, with the special heuristic we use, anytime a list is chosen for appending to another list, we observe that this list has to have fewer elements than the other list.
So, this is the pattern we have:
- If there is one element in the list and this list is chosen for appending to the other list, then the resulting size would be at least 2
- If the current size is 2 and we append this list to another list, the resulting size will be at least 4
- Once we append a list of size 4 to another list, the resulting list would have a size of 8.
So, we observe that in 3 append operations, we approached a size of 8. So, in lg(8) operations, we approached size 8. So, we would approach size ‘n’ (the kruskal() routine obtains the MST on a connected graph at this stage) in lg(n) operations.
Also, there need to be n-1 union operations since the Universe finally contains just 1 set with all vertices in it (assuming you have a connected graph). There need to be n-1 union operations for the universe to approach this state. So, a final running time is O(n * lg(n)) for this.
Now, if the graph is pretty sparse, as in 5 vertices and a single edge in the entire graph, the other operations would dominate. So, the actual running time is
O(m + n*lg(n))
With the tree based structure, our course didn’t cover the analysis but we were told that the running time was a cool O(m * alpha(n)) where alpha(n) <= 4 for most circumstances.
The code can be obtained at:
In this post I used material from Cormen-Leiserson-Rivest-Stein’s amazing book (It is the best book I’ve ever read. Please go get a copy. You won’t regret it). And of course, stuff from Professor GNF’s CS 381 class. It is the best CS course I’ve taken thus far.
About this entry
You’re currently reading “Disjoint Set Data Structure,” an entry on Shriphani Palakodety
- Published:
- 05.20.10 / 1am
- Category:
- Computer Science, python
No comments
Jump to comment form | comments rss [?] | trackback uri [?]