Dataloader
The main aim of a dataloader is to generate batches of data with which a model can be run. It therefore has to return a dictionary with three keys:
inputs
targets
(optional)metadata
(optional).
As the names suggest, the inputs
will get feeded to the model to make the predictions and targets
could be used to train the model. The metadata
field is used to give additional information about the samples (like sample ID, or genomic ranges for DNA-sequence based models).
In a batch of data returned by the dataloader, all three fields can be further nested - i.e. inputs
can be a list of numpy arrays or a dictionary of numpy arrays. The only restriction is that the leaf objects are numpy arrays and that the first axis (batch dimension) is the same for all arrays.
Note that the inputs
and targets
have to be compatible with the model you are using. Keras, for instance, can accept as inputs and targets all three options: single numpy array, list of numpy arrays, dictionary of numpy arrays (note: to use as input a dictionary of numpy arrays you have to use the functional API and specify the name
fields in the keras.layers.Input
layer). On the other hand, the Scikit-learn models only allow the inputs and targets to be a single 2-dimensional numpy array.
Conceptionally, there are three ways how you can write a dataloader. The dataloader can either yield:
- individual samples
- batches of data
- whole dataset
Note that when a dataloader returns individual samples, the returned numpy arrays shouldn't contain the batch axis. The batch axis will get generated by Kipoi when batching the samples. Also, the samples may contain non-numpy array scalar types like bool
, float
, int
, str
. These will later get stacked into a one-dimensional numpy array.
Dataloader types
Specifically, a dataloader has to inherit from one of the following classes defined in kipoi.data
:
-
PreloadedDataset
- Function that returns the whole dataset as a nested dictionary/list of numpy arrays
- useful when: the dataset is expected to load quickly and fit into the memory
-
Dataset
- Class that inherits from
kipoi.data.Dataset
and implements__len__
and__getitem__
methods.__getitem__
returns a single sample from the dataset. - useful when: dataset length is easy to infer, there are no significant performance gain when reading data of the disk in batches
- Class that inherits from
-
BatchDataset
- Class that inherits from
kipoi.data.BatchDataset
and implements__len__
and__getitem__
methods.__getitem__
returns a single batch of samples from the dataset. - useful when: dataset length is easy to infer, and there is a significant performance gain when reading data of the disk in batches
- Class that inherits from
-
SampleIterator
- Class that inherits from
kipoi.data.SampleIterator
and implements__iter__
and__next__
(next
in python 2).__next__
returns a single sample from the dataset or raisesStopIteration
if all the samples were already returned. - useful when: the dataset length is not know in advance or is difficult to infer, and there are no significant performance gain when reading data of the disk in batches
- Class that inherits from
-
BatchIterator
- Class that inherits from
kipoi.data.BatchIterator
and implements__iter__
and__next__
(next
in python 2).__next__
returns a single batch of samples sample from the dataset or raisesStopIteration
if all the samples were already returned. - useful when: the dataset length is not know in advance or is difficult to infer, and there is a significant performance gain when reading data of the disk in batches
- Class that inherits from
-
SampleGenerator
- A generator function that yields a single sample from the dataset and returns when all the samples were yielded.
- useful when: same as for
SampleIterator
, but can be typically implemented in fewer lines of code
-
BatchGenerator
- A generator function that yields a single batch of samples from the dataset and returns when all the samples were yielded.
- useful when: same as for
BatchIterator
, but can be typically implemented in fewer lines of code
Here is a table showing the (recommended) requirements for each dataloader type:
Dataloader type | Length known? | Significant benefit from loading data in batches? | Fits into memory and loads quickly? |
---|---|---|---|
PreloadedDataset | yes | yes | yes |
Dataset | yes | no | no |
BatchDataset | yes | yes | no |
SampleIterator | no | no | no |
BatchIterator | no | yes | no |
SampleGenerator | no | no | no |
BatchGenerator | no | yes | no |
Dataset example
Here is an example dataloader that gets as input a fasta file and a bed file and returns a one-hot encoded sequence (under 'inputs') along with the used genomic interval (under 'metadata/ranges').
import numpy as np
from pybedtools import BedTool
from kipoi.data import Dataset
from kipoi.metadata import GenomicRanges
class SeqDataset(Dataset):
"""
Args:
intervals_file: bed3 file containing intervals
fasta_file: file path; Genome sequence
"""
def __init__(self, intervals_file, fasta_file):
self.bt = BedTool(intervals_file)
self.fasta_file = fasta_file
self.fasta_extractor = None
def __len__(self):
return len(self.bt)
def __getitem__(self, idx):
if self.fasta_extractor is None:
self.fasta_extractor = FastaExtractor(self.fasta_file)
interval = self.bt[idx]
seq = np.squeeze(self.fasta_extractor([interval]), axis=0)
return {
"inputs": seq,
# lacks targets
"metadata": {
"ranges": GenomicRanges.from_interval(interval)
}
}
Since FastaExtractor
is not multi-processing safe, we have initialized it on the first call of __getitem__
instead of __init__
. The reason for this is that when we use parallel dataloading, each process will get a copy of the SeqDataset(...)
object. Upon the first call of __getitem__
the fasta_extractor
and hence the underlying file-handle will be setup for each worker independently.
Required static files
If your dataloader requires an external data file as for example in tutorials/contributing_models, then the Kipoi way of automatically downloading and using that file is by adding an argument to the dataloader implementation:
from kipoi.data import Dataset
class SeqDataset(Dataset):
"""
Args:
intervals_file: bed3 file containing intervals
fasta_file: file path; Genome sequence
"""
def __init__(self, intervals_file, fasta_file, essential_other_file):
fh = open(essential_other_file, "r")
...
Kipoi can automaticall download the required file from a zenodo or figshare url as if the url was defined as a default
in the dataloader.yaml
as follows:
args:
...
essential_other_file:
default:
url: https://zenodo.org/path/to/my/essential/other/file.xyz
md5: 765sadf876a
Further examples
To see examples of other dataloaders, run kipoi init
from the command-line and choose each time a different dataloader_type.
$ kipoi init
INFO [kipoi.cli.main] Initializing a new Kipoi model
...
Select dataloader_type:
1 - Dataset
2 - PreloadedDataset
3 - BatchDataset
4 - SampleIterator
5 - SampleGenerator
6 - BatchIterator
7 - BatchGenerator
Choose from 1, 2, 3, 4, 5, 6, 7 [1]:
The generated model directory will contain a working implementation of a dataloader.