flambe.field
¶
Package Contents¶
-
class
flambe.field.
Field
[source]¶ Bases:
flambe.Component
Base Field interface.
A field processes raw examples and produces Tensors.
-
setup
(self, *data: np.ndarray)¶ Setup the field.
This method will be called with all the data in the dataset and it can be used to compute aggregated information (for example, vocabulary in Fields that process text).
ATTENTION: this method could be called multiple times in case the same field is used in different datasets. Take this into account and build a stateful implementation.
Parameters: *data (np.ndarray) – Multiple 2d arrays (ex: train_data, dev_data, test_data). First dimension is for the examples, second dimension for the columns specified for this specific field.
-
process
(self, *example: Any)¶ Process an example into a Tensor or tuple of Tensor.
This method allows N to M mappings from example columns (N) to tensors (M).
Parameters: *example (Any) – Column values of the example Returns: The processed example, as a tensor or tuple of tensors Return type: Union[torch.Tensor, Tuple[torch.Tensor, ..]]
-
-
class
flambe.field.
TextField
(tokenizer: Optional[Tokenizer] = None, lower: bool = False, pad_token: Optional[str] = '<pad>', unk_token: str = '<unk>', sos_token: Optional[str] = None, eos_token: Optional[str] = None, embeddings: Optional[str] = None, embeddings_format: str = 'glove', embeddings_binary: bool = False, model: Optional[KeyedVectors] = None, unk_init_all: bool = False, drop_unknown: bool = False, max_seq_len: Optional[int] = None, truncate_end: bool = False, setup_all_embeddings: bool = False)[source]¶ Bases:
flambe.field.Field
Featurize raw text inputs
This class performs tokenization and numericalization, as well as decorating the input sequences with optional start and end tokens.
When a vocabulary is passed during initialiazation, it is used to map the the words to indices. However, the vocabulary can also be generated from input data, through the setup method. Once a vocabulary has been built, this object can also be used to load external pretrained embeddings.
The pad, unk, sos and eos tokens, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.
-
vocab_size
:int¶ Get the vocabulary length.
Returns: The length of the vocabulary Return type: int
-
_build_vocab
(self, *data: np.ndarray)¶ Build the vocabulary for this object based on the special tokens and the data provided.
This method is safe to be called multiple times.
Parameters: *data (np.ndarray) – The data
-
_build_embeddings
(self, model: KeyedVectors)¶ Create the embeddings matrix and the new vocabulary in case this objects needs to use an embedding model.
A new vocabulary needs to be built because of the parameters that could allow, for example, collapsing OOVs.
Parameters: model (KeyedVectors) – The embeddings Returns: A tuple with the new embeddings and the embedding matrix Return type: Tuple[OrderedDict, torch.Tensor]
-
setup
(self, *data: np.ndarray)¶ Build the vocabulary and sets embeddings.
Parameters: data (Iterable[str]) – List of input strings.
-
process
(self, example: str)¶ Process an example, and create a Tensor.
Parameters: example (str) – The example to process, as a single string Returns: The processed example, tokenized and numericalized Return type: torch.Tensor
-
classmethod
from_embeddings
(cls, embeddings: str, embeddings_format: str = 'glove', embeddings_binary: bool = False, setup_all_embeddings: bool = False, unk_init_all: bool = False, drop_unknown: bool = False, **kwargs)¶ Optional constructor to create TextField from embeddings params.
Parameters: - embeddings (Optional[str], optional) – Path to pretrained embeddings, by default None
- embeddings_format (str, optional) – The format of the input embeddings, should be one of: ‘glove’, ‘word2vec’, ‘fasttext’ or ‘gensim’. The latter can be used to download embeddings hosted on gensim on the fly. See https://github.com/RaRe-Technologies/gensim-data for the list of available embedding aliases.
- embeddings_binary (bool, optional) – Whether the input embeddings are provided in binary format, by default False
- setup_all_embeddings (bool) – Controls if all words from the optional provided embeddings will be added to the vocabulary and to the embedding matrix. Defaults to False.
- unk_init_all (bool, optional) – If True, every token not provided in the input embeddings is given a random embedding from a normal distribution. Otherwise, all of them map to the ‘<unk>’ token.
- drop_unknown (bool) – Whether to drop tokens that don’t have embeddings associated. Defaults to True. Important: this flag will only work when using embeddings.
Returns: The constructed text field with the requested model.
Return type:
-
-
class
flambe.field.
BoWField
(tokenizer: Optional[Tokenizer] = None, lower: bool = False, unk_token: str = '<unk>', min_freq: int = 5, normalize: bool = False, scale_factor: float = None)[source]¶ Bases:
flambe.field.Field
Featurize raw text inputs using bag of words (BoW)
This class performs tokenization and numericalization.
The pad, unk, when given, are assigned the first indices in the vocabulary, in that order. This means, that whenever a pad token is specified, it will always use the 0 index.
Examples
>>> f = BoWField(min_freq=2, normalize=True) >>> f.setup(['thank you', 'thank you very much', 'thanks a lot']) >>> f._vocab.keys() ['thank', you']
Note that ‘thank’ and ‘you’ are the only ones that appear twice.
>>> f.process("thank you really. You help was awesome") tensor([1, 2])
-
vocab_size
:int¶ Get the vocabulary length.
Returns: The length of the vocabulary Return type: int
-
process
(self, example)¶
-
setup
(self, *data)¶
-
-
class
flambe.field.
LabelField
(one_hot: bool = False, multilabel_sep: Optional[str] = None, labels: Optional[Sequence[str]] = None)[source]¶ Bases:
flambe.field.field.Field
Featurizes input labels.
The class also handles multilabel inputs and one hot encoding.
-
vocab_size
:int¶ Get the vocabulary length.
Returns: The length of the vocabulary Return type: int
-
label_count
:torch.Tensor¶ Get the label count.
Returns: Tensor containing the count for each label, indexed by the id of the label in the vocabulary. Return type: torch.Tensor
-
label_freq
:torch.Tensor¶ Get the frequency of each label.
Returns: Tensor containing the frequency of each label, indexed by the id of the label in the vocabulary. Return type: torch.Tensor
-
label_inv_freq
:torch.Tensor¶ Get the inverse frequency for each label.
Returns: Tensor containing the inverse frequency of each label, indexed by the id of the label in the vocabulary. Return type: torch.Tensor
-
setup
(self, *data: np.ndarray)¶ Build the vocabulary.
Parameters: data (Iterable[str]) – List of input strings.
-
process
(self, example)¶ Featurize a single example.
Parameters: example (str) – The input label Returns: A list of integer tokens Return type: torch.Tensor
-