`flambe.tokenizer.word`¶

Module Contents¶

class flambe.tokenizer.word.WordTokenizer[source]¶

Implement a word level tokenizer.

tokenize(self, example: str)[source]¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output word tokens, as a list of strings
Return type:	List[str]

class flambe.tokenizer.word.NGramsTokenizer(ngrams: Union[int, List[int]] = 1)[source]¶

Implement a n-gram tokenizer

Examples

>>> t = NGramsTokenizer(ngrams=2).tokenize("hi how are you?")
['hi, how', 'how are', 'are you?']

>>> t = NGramsTokenizer(ngrams=[1,2]).tokenize("hi how are you?")
['hi,', 'how', 'are', 'you?', 'hi, how', 'how are', 'are you?']

Parameters:	ngrams (Union[int, List[int]]) – An int or a list of ints. If it’s a list of ints, all n-grams (for each int) will be considered in the tokenizer.

static _tokenize(example: str, n: int)[source]¶: Tokenize an input example using ngrams.

tokenize(self, example: str)[source]¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string.
Returns:	The output word tokens, as a list of strings
Return type:	List[str]