Yesterday, I tackled the last problem in the “Aha! Algorithms” chapter of Jon Bentley’s “Programming Pearls”, an anagram detector. It takes a file consisting of words separated by a line and return lists containing anagrams separated by a line. It does not construct anagrams though and just hunts down a file for them.
The first step in building this includes an element of identity common to anagrams. This is called a signature. All anagrams belonging to the same class have the same signature. I chose my signature to be a string that contains letters followed by the number of times they appear in the word (the letters are in alphabetical order). Example:
"alphabet" - "a2b1e1h1l1p1t1"
The next step involves using a script that will bring together words with the same signature and these obviously are anagrams. Finally, a function prints these anagrams.
This portion of Python code generates the signature:
char_list = list(word)
char_list.sort()
word_list = []
for char in char_list:
if not char in word_list:
word_list.extend([char, str(word.count(char))])
else:
continue
signature = "".join(word_list)
return signature
Next comes the function that is supposed to bring the anagrams together (so to speak):
"""Obtain Sinatures and Create a Dict.
key = the signature
value = a list of words with same signature"""
f = open(filename, "r")
ana_dict = {}
for word in f:
signature = sign(word[:-1])
if ana_dict.has_key(signature):
ana_dict[signature].append(word[:-1])
else:
ana_dict[signature] = [word[:-1]]
f.close()
return ana_dict
In the above function, I used word[:-1] to avoid appending the newline character that seems to be unavoidable while using “for word in f:”.
Finally, a function to check the anagram dictionary for anagrams (should include those values whose length is more than 1) and to print them out.
ana_list = ana_dict.values()
for value in ana_list:
if len(value) > 1:
print value
else:
continue
I decided to test this script on a file containing words separated by lines. I used the following words:
pans pots opt snap stop tops
When I ran the script, I got:
python anagram.py ['pans', 'snap'] ['pots', 'stop', 'tops']
So, it works! The script finally looks like this.
Well, Happy Coding ![]()



7 comments ↓
To get around the repeated slicing of the last character in your word, you could always do:
for word in f:
word = word.strip()
…
Here’s my signature maker:
def sign(word):
x=list(word)
x.sort()
return ”.join(x)
It’s a different format of signature but I’m sure it’s faster.
@DiThi: good call.
If you’re using a recent Python, consider just doing: “”.join(sorted(WORD)).
To be even lazy, we can do:
from itertools import groupby
def sign(xs):
return “”.join(x + str(len(list(g)))
for x, g in groupby(sorted(xs)))
def sign(word):
#can’t hash a list
return tuple(sorted(word))
def anagramCollect(filename):
f = open(filename, “r”)
ana_dict = {}
for word in f:
word = word.rstrip()
signature = sign(word)
ana_dict.setdefault(signature,[]).append(word)
f.close()
return ana_dict
I found your blog when searching for the source of the anagram program I vaguely remembered. So it was Bentley. Thanks. I also found http://pramode.net/2005/11/18/teaching-tips-the-anagram-problem-solved-in-unix-style/
Programming it is fun, but it is also fun to do it without programming, which is what Bentley showed. Bentley tells how Tom Cargill (another person worth knowing) describes the method: “first sort this way (moves hand from left to right), then sort that way (moves hand up and down).
[...] got on this blog) over to my account. Then, I ran a very huge file full of words through my anagram detector and the results were astonishing, a record 1.93 seconds to parse over 10000 words and pick [...]
Leave a Comment