I remembered reading about the soundex code used to encode similarly sounding names or names with similar spelling so that they could be retrieved easily.
This is what the soundex code for a word consists of:
- The first letter of the name.
- The letters A, E, I, O, U, Y, W and H are never used in encoding (except when they are the first character in a word)
- The following table lists the character codes:
| Code | Key Letters |
|---|---|
| 1 | B, P, F, V |
| 2 | C, S, K, G, J, Q, X, Z |
| 3 | D, T |
| 4 | L |
| 5 | M, N |
| 6 | R |
- Letters are not encoded if the preceeding letters have the same code
- Codes are truncated after the third digit.
- Trailing zeros are appended as needed.
So, here is the snippet that should generate a soundex code for a given word:
def codeGen(word):
word = word.upper()
code = [word[0]]
for i in range(1, len(word)):
if code_dict[word[i]] != "0":
if word[i] != word[i-1] and code_dict[word[i]] != code_dict[word[i-1]]:
code.append(str(code_dict[word[i]]))
else:
continue
else:
continue
if len(code) <= 4:
code += ["0"] * (4 - len(code))
else:
code = code[0:4]
return "".join(code)
#Tests
print codeGen("SCHAAK")
print codeGen("Lee")
print codeGen("Kuhne")
print codeGen("EBELL")
print codeGen("ebelson")
print codeGen("SCHAEFER")
The output:
$ python soundex.py S200 L000 K500 E140 E142 S160
This problem was part of the Berkeley Programming Contest held in 2007 at the University of California Berkeley.
Post a Comment