flambe.tokenizer
¶
Submodules¶
Package Contents¶
-
class
flambe.tokenizer.
Tokenizer
[source]¶ Bases:
flambe.Component
Base interface to a Tokenizer object.
Tokenizers implement the tokenize method, which takes a string as input and produces a list of strings as output.
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string Returns: The output tokens, as a list of strings Return type: List[str]
-
__call__
(self, example: str)¶ Make a tokenizer callable.
-
-
class
flambe.tokenizer.
CharTokenizer
[source]¶ Bases:
flambe.tokenizer.Tokenizer
Implement a character level tokenizer.
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string Returns: The output charachter tokens, as a list of strings Return type: List[str]
-
-
class
flambe.tokenizer.
WordTokenizer
[source]¶ Bases:
flambe.tokenizer.Tokenizer
Implement a word level tokenizer using nltk.tokenize.word_tokenize
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string Returns: The output word tokens, as a list of strings Return type: List[str]
-
-
class
flambe.tokenizer.
NLTKWordTokenizer
(**kwargs)[source]¶ Bases:
flambe.tokenizer.Tokenizer
Implement a word level tokenizer using nltk.tokenize.word_tokenize
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string Returns: The output word tokens, as a list of strings Return type: List[str]
-
-
class
flambe.tokenizer.
NGramsTokenizer
(ngrams: Union[int, List[int]] = 1, exclude_stopwords: bool = False, stop_words: Optional[List] = None)[source]¶ Bases:
flambe.tokenizer.Tokenizer
Implement a n-gram tokenizer
Examples
>>> t = NGramsTokenizer(ngrams=2).tokenize("hi how are you?") ['hi, how', 'how are', 'are you?']
>>> t = NGramsTokenizer(ngrams=[1,2]).tokenize("hi how are you?") ['hi,', 'how', 'are', 'you?', 'hi, how', 'how are', 'are you?']
Parameters: - ngrams (Union[int, List[int]]) – An int or a list of ints. If it’s a list of ints, all n-grams (for each int) will be considered in the tokenizer.
- exclude_stopwords (bool) – Whether to exlude stopword or not. See the related param stop_words
- stop_words (Optional[List]) – List of stop words to exclude when exclude_stopwords is True. If None set to nltk.corpus.stopwords.
-
static
_tokenize
(example: str, n: int)¶ Tokenize an input example using ngrams.
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string. Returns: The output word tokens, as a list of strings Return type: List[str]
-
class
flambe.tokenizer.
BPETokenizer
(codes_path: str)[source]¶ Bases:
flambe.tokenizer.Tokenizer
Implement a subword level tokenizer using byte pair encoding. Tokenization is done using fastBPE (https://github.com/glample/fastBPE) and requires a fastBPE codes file.
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string Returns: The output subword tokens, as a list of strings Return type: List[str]
-
-
class
flambe.tokenizer.
LabelTokenizer
(multilabel_sep: Optional[str] = None)[source]¶ Bases:
flambe.tokenizer.Tokenizer
Base label tokenizer.
This object tokenizes string labels into a list of a single or multiple elements, depending on the provided separator.
-
tokenize
(self, example: str)¶ Tokenize an input example.
Parameters: example (str) – The input example, as a string Returns: The output tokens, as a list of strings Return type: List[str]
-