`flambe.tokenizer`¶

Submodules¶

Package Contents¶

class flambe.tokenizer.Tokenizer[source]¶

Bases: flambe.Component

Base interface to a Tokenizer object.

Tokenizers implement the tokenize method, which takes a string as input and produces a list of strings as output.

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output tokens, as a list of strings
Return type:	List[str]

__call__(self, example: str)¶: Make a tokenizer callable.

class flambe.tokenizer.CharTokenizer[source]¶

Bases: flambe.tokenizer.Tokenizer

Implement a character level tokenizer.

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output charachter tokens, as a list of strings
Return type:	List[str]

class flambe.tokenizer.WordTokenizer[source]¶

Bases: flambe.tokenizer.Tokenizer

Implement a word level tokenizer using nltk.tokenize.word_tokenize

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output word tokens, as a list of strings
Return type:	List[str]

class flambe.tokenizer.NLTKWordTokenizer(**kwargs)[source]¶

Bases: flambe.tokenizer.Tokenizer

Implement a word level tokenizer using nltk.tokenize.word_tokenize

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output word tokens, as a list of strings
Return type:	List[str]

class flambe.tokenizer.NGramsTokenizer(ngrams: Union[int, List[int]] = 1, exclude_stopwords: bool = False, stop_words: Optional[List] = None)[source]¶

Bases: flambe.tokenizer.Tokenizer

Implement a n-gram tokenizer

Examples

>>> t = NGramsTokenizer(ngrams=2).tokenize("hi how are you?")
['hi, how', 'how are', 'are you?']

>>> t = NGramsTokenizer(ngrams=[1,2]).tokenize("hi how are you?")
['hi,', 'how', 'are', 'you?', 'hi, how', 'how are', 'are you?']

Parameters:

ngrams (Union[int, List[int]]) – An int or a list of ints. If it’s a list of ints, all n-grams (for each int) will be considered in the tokenizer.
exclude_stopwords (bool) – Whether to exlude stopword or not. See the related param stop_words
stop_words (Optional[List]) – List of stop words to exclude when exclude_stopwords is True. If None set to nltk.corpus.stopwords.

static _tokenize(example: str, n: int)¶: Tokenize an input example using ngrams.

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string.
Returns:	The output word tokens, as a list of strings
Return type:	List[str]

class flambe.tokenizer.BPETokenizer(codes_path: str)[source]¶

Bases: flambe.tokenizer.Tokenizer

Implement a subword level tokenizer using byte pair encoding. Tokenization is done using fastBPE (https://github.com/glample/fastBPE) and requires a fastBPE codes file.

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output subword tokens, as a list of strings
Return type:	List[str]

class flambe.tokenizer.LabelTokenizer(multilabel_sep: Optional[str] = None)[source]¶

Bases: flambe.tokenizer.Tokenizer

Base label tokenizer.

This object tokenizes string labels into a list of a single or multiple elements, depending on the provided separator.

tokenize(self, example: str)¶

Tokenize an input example.

Parameters:	example (str) – The input example, as a string
Returns:	The output tokens, as a list of strings
Return type:	List[str]

flambe.tokenizer¶

Submodules¶

Package Contents¶

`flambe.tokenizer`¶