An Anagram Detector

Yesterday, I tackled the last problem in the “Aha! Algorithms” chapter of Jon Bentley’s “Programming Pearls”, an anagram detector. It takes a file consisting of words separated by a line and return lists containing anagrams separated by a line. It does not construct anagrams though and just hunts down a file for them.

The first step in building this includes an element of identity common to anagrams. This is called a signature. All anagrams belonging to the same class have the same signature. I chose my signature to be a string that contains letters followed by the number of times they appear in the word (the letters are in alphabetical order). Example:

"alphabet" - "a2b1e1h1l1p1t1"

The next step involves using a script that will bring together words with the same signature and these obviously are anagrams. Finally, a function prints these anagrams.

This portion of Python code generates the signature:

def sign(word):
        char_list = list(word)
        char_list.sort()
        word_list = []
        for char in char_list:
                if not char in word_list:
                        word_list.extend([char, str(word.count(char))])
                else:
                        continue
        signature = "".join(word_list)
        return signature

Next comes the function that is supposed to bring the anagrams together (so to speak):

def anagramCollect(filename):
        """Obtain Sinatures and Create a Dict.
        key = the signature
        value = a list of words with same signature"
""
        f = open(filename, "r")
        ana_dict = {}
        for word in f:
                signature = sign(word[:-1])
                if ana_dict.has_key(signature):
                        ana_dict[signature].append(word[:-1])
                else:
                        ana_dict[signature] = [word[:-1]]
        f.close()
        return ana_dict

In the above function, I used word[:-1] to avoid appending the newline character that seems to be unavoidable while using “for word in f:”.

Finally, a function to check the anagram dictionary for anagrams (should include those values whose length is more than 1) and to print them out.

def printAnagram(ana_dict):
        ana_list = ana_dict.values()
        for value in ana_list:
                if len(value) > 1:
                        print value
                else:
                        continue

I decided to test this script on a file containing words separated by lines. I used the following words:

pans
pots
opt
snap
stop
tops

When I ran the script, I got:

 python anagram.py
['pans', 'snap']
['pots', 'stop', 'tops']

So, it works! The script finally looks like this.

Well, Happy Coding :D

Comments 6

  1. Mike Pirnat wrote:

    To get around the repeated slicing of the last character in your word, you could always do:

    for word in f:
    word = word.strip()

    Posted 14 Jun 2008 at 12:46 pm
  2. DiThi wrote:

    Here’s my signature maker:

    def sign(word):
    x=list(word)
    x.sort()
    return ”.join(x)

    It’s a different format of signature but I’m sure it’s faster.

    Posted 16 Jun 2008 at 2:58 am
  3. Ben wrote:

    @DiThi: good call.

    If you’re using a recent Python, consider just doing: “”.join(sorted(WORD)).

    Posted 16 Jun 2008 at 11:47 am
  4. J wrote:

    To be even lazy, we can do:

    from itertools import groupby

    def sign(xs):
    return “”.join(x + str(len(list(g)))
    for x, g in groupby(sorted(xs)))

    Posted 22 Jun 2008 at 2:38 am
  5. Justin wrote:

    def sign(word):
    #can’t hash a list
    return tuple(sorted(word))

    def anagramCollect(filename):
    f = open(filename, “r”)
    ana_dict = {}
    for word in f:
    word = word.rstrip()
    signature = sign(word)
    ana_dict.setdefault(signature,[]).append(word)
    f.close()
    return ana_dict

    Posted 24 Jun 2008 at 7:15 pm
  6. Maarten van Emden wrote:

    I found your blog when searching for the source of the anagram program I vaguely remembered. So it was Bentley. Thanks. I also found http://pramode.net/2005/11/18/teaching-tips-the-anagram-problem-solved-in-unix-style/
    Programming it is fun, but it is also fun to do it without programming, which is what Bentley showed. Bentley tells how Tom Cargill (another person worth knowing) describes the method: “first sort this way (moves hand from left to right), then sort that way (moves hand up and down).

    Posted 12 Jul 2008 at 7:00 pm

Trackbacks & Pingbacks 1

  1. From Shriphani’s Weblog - Huge Anagram List on 21 Aug 2008 at 10:51 pm

    [...] got on this blog) over to my account. Then, I ran a very huge file full of words through my anagram detector and the results were astonishing, a record 1.93 seconds to parse over 10000 words and pick [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *