Dataloader

The main aim of a dataloader is to generate batches of data with which a model can be run. It therefore has to return a dictionary with three keys:

  • inputs
  • targets (optional)
  • metadata (optional).

As the names suggest, the inputs will get feeded to the model to make the predictions and targets could be used to train the model. The metadata field is used to give additional information about the samples (like sample ID, or genomic ranges for DNA-sequence based models).

In a batch of data returned by the dataloader, all three fields can be further nested - i.e. inputs can be a list of numpy arrays or a dictionary of numpy arrays. The only restriction is that the leaf objects are numpy arrays and that the first axis (batch dimension) is the same for all arrays.

Note that the inputs and targets have to be compatible with the model you are using. Keras, for instance, can accept as inputs and targets all three options: single numpy array, list of numpy arrays, dictionary of numpy arrays (note: to use as input a dictionary of numpy arrays you have to use the functional API and specify the name fields in the keras.layers.Input layer). On the other hand, the Scikit-learn models only allow the inputs and targets to be a single 2-dimensional numpy array.

Conceptionally, there are three ways how you can write a dataloader. The dataloader can either yield:

  • individual samples
  • batches of data
  • whole dataset

Note that when a dataloader returns individual samples, the returned numpy arrays shouldn't contain the batch axis. The batch axis will get generated by Kipoi when batching the samples. Also, the samples may contain non-numpy array scalar types like bool, float, int, str. These will later get stacked into a one-dimensional numpy array.

Dataloader types

Specifically, a dataloader has to inherit from one of the following classes defined in kipoi.data:

  • PreloadedDataset

    • Function that returns the whole dataset as a nested dictionary/list of numpy arrays
    • useful when: the dataset is expected to load quickly and fit into the memory
  • Dataset

    • Class that inherits from kipoi.data.Dataset and implements __len__ and __getitem__ methods. __getitem__ returns a single sample from the dataset.
    • useful when: dataset length is easy to infer, there are no significant performance gain when reading data of the disk in batches
  • BatchDataset

    • Class that inherits from kipoi.data.BatchDataset and implements __len__ and __getitem__ methods. __getitem__ returns a single batch of samples from the dataset.
    • useful when: dataset length is easy to infer, and there is a significant performance gain when reading data of the disk in batches
  • SampleIterator

    • Class that inherits from kipoi.data.SampleIterator and implements __iter__ and __next__ (next in python 2). __next__ returns a single sample from the dataset or raises StopIteration if all the samples were already returned.
    • useful when: the dataset length is not know in advance or is difficult to infer, and there are no significant performance gain when reading data of the disk in batches
  • BatchIterator

    • Class that inherits from kipoi.data.BatchIterator and implements __iter__ and __next__ (next in python 2). __next__ returns a single batch of samples sample from the dataset or raises StopIteration if all the samples were already returned.
    • useful when: the dataset length is not know in advance or is difficult to infer, and there is a significant performance gain when reading data of the disk in batches
  • SampleGenerator

    • A generator function that yields a single sample from the dataset and returns when all the samples were yielded.
    • useful when: same as for SampleIterator, but can be typically implemented in fewer lines of code
  • BatchGenerator

    • A generator function that yields a single batch of samples from the dataset and returns when all the samples were yielded.
    • useful when: same as for BatchIterator, but can be typically implemented in fewer lines of code

Here is a table showing the (recommended) requirements for each dataloader type:

Dataloader type Length known? Significant benefit from loading data in batches? Fits into memory and loads quickly?
PreloadedDataset yes yes yes
Dataset yes no no
BatchDataset yes yes no
SampleIterator no no no
BatchIterator no yes no
SampleGenerator no no no
BatchGenerator no yes no

Dataset example

Here is an example dataloader that gets as input a fasta file and a bed file and returns a one-hot encoded sequence (under 'inputs') along with the used genomic interval (under 'metadata/ranges').

from __future__ import absolute_import, division, print_function
import numpy as np
from pybedtools import BedTool
from genomelake.extractors import FastaExtractor
from kipoi.data import Dataset
from kipoi.metadata import GenomicRanges

class SeqDataset(Dataset):
    """
    Args:
        intervals_file: bed3 file containing intervals
        fasta_file: file path; Genome sequence
    """

    def __init__(self, intervals_file, fasta_file):

        self.bt = BedTool(intervals_file)
        self.fasta_file = fasta_file
        self.fasta_extractor = None

    def __len__(self):
        return len(self.bt)

    def __getitem__(self, idx):
        if self.fasta_extractor is None:
            self.fasta_extractor = FastaExtractor(self.fasta_file)

        interval = self.bt[idx]

        seq = np.squeeze(self.fasta_extractor([interval]), axis=0)
        return {
            "inputs": seq,
            # lacks targets
            "metadata": {
                "ranges": GenomicRanges.from_interval(interval)
            }
        }

Since FastaExtractor is not multi-processing safe, we have initialized it on the first call of __getitem__ instead of __init__. The reason for this is that when we use parallel dataloading, each process will get a copy of the SeqDataset(...) object. Upon the first call of __getitem__ the fasta_extractor and hence the underlying file-handle will be setup for each worker independently.

Required static files

If your dataloader requires an external data file as for example in tutorials/contributing_models, then the Kipoi way of automatically downloading and using that file is by adding an argument to the dataloader implementation:

from __future__ import absolute_import, division, print_function
from kipoi.data import Dataset

class SeqDataset(Dataset):
    """
    Args:
        intervals_file: bed3 file containing intervals
        fasta_file: file path; Genome sequence
    """

    def __init__(self, intervals_file, fasta_file, essential_other_file):
        fh = open(essential_other_file, "r")
        ...

Kipoi can automaticall download the required file from a zenodo or figshare url as if the url was defined as a default in the dataloader.yaml as follows:

args:
   ...
   essential_other_file:
       default:
           url: https://zenodo.org/path/to/my/essential/other/file.xyz
           md5: 765sadf876a

Further examples

To see examples of other dataloaders, run kipoi init from the command-line and choose each time a different dataloader_type.

$ kipoi init
INFO [kipoi.cli.main] Initializing a new Kipoi model

...

Select dataloader_type:
1 - Dataset
2 - PreloadedDataset
3 - BatchDataset
4 - SampleIterator
5 - SampleGenerator
6 - BatchIterator
7 - BatchGenerator
Choose from 1, 2, 3, 4, 5, 6, 7 [1]:

The generated model directory will contain a working implementation of a dataloader.