A Whitespace + Punctuation Tokenizer

Written by Ben Wendt

In my previous post, I discussed some tokenization techniques and mentioned that a whitespace-only tokenizer will make tokens that are sub-optimal for indexing. I also mentioned that a simple solution to this is created a whitespace + punctuation tokenizer.

So let’s take a look at how that might work.

import re

def whitespace_punctuation_tokenize(str, punctuation = "[\.,\"']"):
    tokens = re.split(punctuation + "*\s+" + punctuation + "*", str)
    if (tokens[0]):
        tokens[0] = re.sub("^" + punctuation + "+", "", tokens[0])
    if (tokens[-1]):
        tokens[-1] = re.sub(punctuation + "+$", "", tokens[-1])
    return tokens

You would run that code with something like this:

str = """'abc-123' is a cool one. It's far and away the
    ring-tossingest toy this year."""

print whitespace_punctuation_tokenize(str)
print whitespace_punctuation_tokenize(str, "[\.,\"]")

From this, you would see output like the following:

[‘abc-123′, ‘is’, ‘a’, ‘cool’, ‘one’, “It’s”, ‘far’, ‘and’, ‘away’, ‘the’, ‘ring-tossingest’, ‘toy’, ‘this’, ‘year’]

[“‘abc-123′”, ‘is’, ‘a’, ‘cool’, ‘one’, “It’s”, ‘far’, ‘and’, ‘away’, ‘the’, ‘ring-tossingest’, ‘toy’, ‘this’, ‘year’]

Here we see a function that does a regular expression split on an input string and accepts a configurable parameter of which characters to consider as punctuation. In this way, we can specify how want tokens to be delimited with a bit more granular control.

The function makes this split, then corrects and leading punctuation on the first element and any trailing punctuation on the last element. At this point you would have higher quality tokens to pass into your analysis chain.