flambe.dataset.tabular
¶
Module Contents¶
-
class
flambe.dataset.tabular.
DataView
(data: np.ndarray, transform_hooks: List[Tuple[Field, Union[int, List[int]]]], cache: bool)[source]¶ TabularDataset view for the train, val or test split. This class must be used only internally in the TabularDataset class.
A DataView is a lazy Iterable that receives the operations from the TabularDataset object. When __getitem__ is called, then all the fields defined in the transform are applied.
This object can cache examples already transformed. To enable this, make sure to use this view under a Singleton pattern (there must only be one DataView per split in the TabularDataset).
-
class
flambe.dataset.tabular.
TabularDataset
(train: Iterable[Iterable], val: Optional[Iterable[Iterable]] = None, test: Optional[Iterable[Iterable]] = None, cache: bool = True, named_columns: Optional[List[str]] = None, transform: Dict[str, Union[Field, Dict]] = None)[source]¶ Bases:
flambe.dataset.Dataset
Loader for tabular data, usually in csv or tsv format.
A TabularDataset can represent any data that can be organized in a table. Internally, we store all information in a 2D numpy generic array. This object also behaves as a sequence over the whole dataset, chaining the training, validation and test data, in that order. This is useful in creating vocabularies or loading embeddings over the full datasets.
-
train
:np.ndarray[source] Returns the training data as a numpy nd array
-
val
:np.ndarray[source] Returns the validation data as a numpy nd array
-
test
:np.ndarray[source] Returns the test data as a numpy nd array
-
_set_transforms
(self, transform: Dict[str, Union[Field, Dict]])[source]¶ Set transformations attributes and hooks to the data splits.
This method adds attributes for each field in the transform dict. It also adds hooks for the ‘process’ call in each field.
ATTENTION: This method works with the _train, _val and _test hidden attributes as this runs in the constructor and creates the hooks to be used in creating the properties.
-
classmethod
from_path
(cls, train_path: str, val_path: Optional[str] = None, test_path: Optional[str] = None, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)[source]¶ Load a TabularDataset from the given file paths.
Parameters: - train_path (str) – The path to the train data
- val_path (str, optional) – The path to the optional validation data
- test_path (str, optional) – The path to the optional test data
- sep (str) – Separator to pass to the read_csv method
- header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
- columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
- encoding (str) – The encoding format passed to the pandas reader
- transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.
-
classmethod
autogen_val_test
(cls, data_path: str, seed: Optional[int] = None, test_ratio: Optional[float] = 0.2, val_ratio: Optional[float] = 0.2, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8', transform: Dict[str, Union[Field, Dict]] = None)[source]¶ Generate a test and validation set from the given file paths, then load a TabularDataset.
Parameters: - data_path (str) – The path to the data
- seed (Optional[int]) – Random seed to be used in test/val generation
- test_ratio (Optional[float]) – The ratio of the test dataset in relation to the whole dataset
- val_ratio (Optional[float]) – The ratio of the validation dataset in relation to the training dataset (whole - test)
- sep (str) – Separator to pass to the read_csv method
- header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
- columns (List[str]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
- encoding (str) – The encoding format passed to the pandas reader
- transform (Dict[str, Union[Field, Dict]]) – The fields to be applied to the columns. Each field is identified with a name for easy linking.
-
classmethod
_load_file
(cls, path: str, sep: Optional[str] = 't', header: Optional[str] = 'infer', columns: Optional[Union[List[str], List[int]]] = None, encoding: Optional[str] = 'utf-8')[source]¶ Load data from the given path.
The path may be either a single file or a directory. If it is a directory, each file is loaded according to the specified options and all the data is concatenated into a single list.
Parameters: - path (str) – Path to data, could be a directory or a file
- sep (str) – Separator to pass to the read_csv method
- header (Optional[Union[str, int]]) – Use 0 for first line, None for no headers, and ‘infer’ to detect it automatically, defaults to ‘infer’
- columns (Optional[Union[List[str], List[int]]]) – List of columns to load, can be used to select a subset of columns, or change their order at loading time
- encoding (str) – The encoding format passed to the pandas reader
Returns: A tuple containing the list of examples (where each example is itself also a list or tuple of entries in the dataset) and an optional list of named columns (one string for each column in the dataset)
Return type: Tuple[List[Tuple], Optional[List[str]]]
-