14. February 2015
A Whitespace + Punctuation Tokenizer
In my previous post, I discussed some tokenization techniques and mentioned that a whitespace-only tokenizer will make tokens that are sub-optimal for indexing. I also mentioned that a simple solution to this is created a whitespace + punctuation tokenizer.