Generated from notebooks/python-api.ipynb

Kipoi python API

Quick start

There are three basic building blocks in kipoi:

  • Source - provides Models and DataLoaders.
  • Model - makes the prediction given the numpy arrays.
  • Dataloader - loads the data from raw files and transforms them into a form that is directly consumable by the Model

img

List of main commands

Get/list sources - kipoi.list_sources() - kipoi.get_source()

List models/dataloaders - kipoi.list_models() - kipoi.list_dataloaders()

Get model/dataloader - kipoi.get_model() - kipoi.get_dataloader_factory()

Load only model/dataloader description from the yaml file without loading the model

  • kipoi.get_model_descr()
  • kipoi.get_dataloader_descr()

Install the dependencies - kipoi.install_model_dependencies() - kipoi.install_dataloader_dependencies()

import kipoi

Source

Available sources are specified in the config file located at: ~/.kipoi/config.yaml. Here is an example config file:

model_sources:
    kipoi: # default
        type: git-lfs # git repository with large file storage (git-lfs)
        remote_url: [email protected]:kipoi/models.git # git remote
        local_path: ~/.kipoi/models/ # local storage path
    gl:
        type: git-lfs  # custom model
        remote_url: https://i12g-gagneurweb.informatik.tu-muenchen.de/gitlab/gagneurlab/model-zoo.git
        local_path: /s/project/model-zoo

There are three different model sources possible:

  • git-lfs - git repository with source files tracked normally by git and all the binary files like model weights (located in files* directories) are tracked by git-lfs.
  • Requires git-lfs to be installed.
  • git - all the files including weights (not recommended)
  • local - local directory containing models defined in subdirectories

For git-lfs source type, larger files tracked by git-lfs will be downloaded into the specified directory local_path only after the model has been requested (when invoking kipoi.get_model()).

Note

A particular model/dataloader is defined by its source (say kipoi or my_git_models) and the relative path of the desired model directory from the model source root (say rbp/).

A directory is considered a model if it contains a model.yaml file.

import kipoi
import warnings
warnings.filterwarnings('ignore')

import logging
logging.disable(1000)
kipoi.list_sources()
source type location local_size n_models n_dataloaders
0 kipoi git-lfs /home/avsec/.kipoi/mo... 1,2G 780 780
s = kipoi.get_source("kipoi")
s
GitLFSSource(remote_url='[email protected]:kipoi/models.git', local_path='/home/avsec/.kipoi/models/')
kipoi.list_models().head()
source model version authors contributors doc type inputs targets postproc_score_variants license cite_as trained_on training_procedure tags
0 kipoi DeepSEAKeras 0.1 [Author(name='Jian Zh... [Author(name='Lara Ur... This CNN is based on ... keras seq TFBS_DHS_probs True MIT https://doi.org/10.10... ENCODE and Roadmap Ep... https://www.nature.co... [Histone modification...
1 kipoi extended_coda 0.1 [Author(name='Pang We... [Author(name='Johnny ... Single bp resolution ... keras [H3K27AC_subsampled] [H3K27ac] False MIT https://doi.org/10.10... Described in https://... Described in https://... [Histone modification]
2 kipoi DeepCpG_DNA/Hou2016_m... 1.0.4 [Author(name='Christo... [Author(name='Roman K... This is the extractio... keras [dna] [cpg/mESC1, cpg/mESC2... True MIT https://doi.org/10.11... scBS-seq and scRRBS-s... Described in https://... [DNA methylation]
3 kipoi DeepCpG_DNA/Smallwood... 1.0.4 [Author(name='Christo... [Author(name='Roman K... This is the extractio... keras [dna] [cpg/BS24_1_2I, cpg/B... True MIT https://doi.org/10.11... scBS-seq and scRRBS-s... Described in https://... [DNA methylation]
4 kipoi DeepCpG_DNA/Hou2016_H... 1.0.4 [Author(name='Christo... [Author(name='Roman K... This is the extractio... keras [dna] [cpg/HepG21, cpg/HepG... True MIT https://doi.org/10.11... scBS-seq and scRRBS-s... Described in https://... [DNA methylation]

Model

Let's choose to use the rbp_eclip/UPF1 model from kipoi

MODEL = "rbp_eclip/UPF1"

NOTE: If you are using python2, use a different model like MaxEntScan/3prime to following this example.

# Note. Install all the dependencies for that model:
# add --gpu flag to install gpu-compatible dependencies (e.g. installs tensorflow-gpu instead of tensorflow)
!kipoi env install {MODEL}
model = kipoi.get_model(MODEL)

Available fields:

Model

  • type
  • args
  • info
  • authors
  • name
  • version
  • tags
  • doc
  • schema
  • inputs
  • targets
  • default_dataloader - loaded dataloader class

  • predict_on_batch()

  • source
  • source_dir
  • pipeline
  • predict()
  • predict_example()
  • predict_generator()

Dataloader

  • type
  • defined_as
  • args
  • info (same as for the model)
  • output_schema
  • inputs
  • targets
  • metadata

  • source

  • source_dir
  • example_kwargs
  • init_example()
  • batch_iter()
  • batch_train_iter()
  • batch_predict_iter()
  • load_all()
model
<kipoi.model.KerasModel at 0x7f95b27af2b0>
model.type
'keras'

Info

model.info
ModelInfo(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='\'RBP binding model from Avsec et al: "Modeling positional effects of regulatory sequences with spline transformations increases prediction accuracy of deep neural networks". \'

', name=None, version='0.1', license='MIT', tags=['RNA binding'], contributors=[Author(name='Ziga Avsec', github='avsecz', email=None)], cite_as='https://doi.org/10.1093/bioinformatics/btx727', trained_on='RBP occupancy peaks measured by eCLIP-seq (Van Nostrand et al., 2016 - https://doi.org/10.1038/nmeth.3810), https://github.com/gagneurlab/Manuscript_Avsec_Bioinformatics_2017 ', training_procedure='Single task training with ADAM')

model.info.version
'0.1'

Schema

dict(model.schema.inputs)
{'dist_exon_intron': ArraySchema(shape=(1, 10), doc='Distance the nearest exon_intron (splice donor) site transformed with B-splines', name='dist_exon_intron', special_type=None, associated_metadata=[], column_labels=None),
 'dist_gene_end': ArraySchema(shape=(1, 10), doc='Distance the nearest gene end transformed with B-splines', name='dist_gene_end', special_type=None, associated_metadata=[], column_labels=None),
 'dist_gene_start': ArraySchema(shape=(1, 10), doc='Distance the nearest gene start transformed with B-splines', name='dist_gene_start', special_type=None, associated_metadata=[], column_labels=None),
 'dist_intron_exon': ArraySchema(shape=(1, 10), doc='Distance the nearest intron_exon (splice acceptor) site transformed with B-splines', name='dist_intron_exon', special_type=None, associated_metadata=[], column_labels=None),
 'dist_polya': ArraySchema(shape=(1, 10), doc='Distance the nearest Poly-A site transformed with B-splines', name='dist_polya', special_type=None, associated_metadata=[], column_labels=None),
 'dist_start_codon': ArraySchema(shape=(1, 10), doc='Distance the nearest start codon transformed with B-splines', name='dist_start_codon', special_type=None, associated_metadata=[], column_labels=None),
 'dist_stop_codon': ArraySchema(shape=(1, 10), doc='Distance the nearest stop codon transformed with B-splines', name='dist_stop_codon', special_type=None, associated_metadata=[], column_labels=None),
 'dist_tss': ArraySchema(shape=(1, 10), doc='Distance the nearest TSS site transformed with B-splines', name='dist_tss', special_type=None, associated_metadata=[], column_labels=None),
 'seq': ArraySchema(shape=(101, 4), doc='One-hot encoded RNA sequence', name='seq', special_type=<ArraySpecialType.DNASeq: 'DNASeq'>, associated_metadata=[], column_labels=None)}
model.schema.targets
ArraySchema(shape=(1,), doc='Predicted binding strength', name=None, special_type=None, associated_metadata=[], column_labels=None)

Default dataloader

Model already has the default dataloder present. To use it, specify

model.source_dir
'/home/avsec/.kipoi/models/rbp_eclip/UPF1'
model.default_dataloader
dataloader.SeqDistDataset
model.default_dataloader.info
Info(authors=[Author(name='Ziga Avsec', github='avsecz', email=None)], doc='RBP binding model taking as input 101nt long sequence as well as 8 distances to nearest genomic landmarks -  tss, poly-A, exon-intron boundary, intron-exon boundary, start codon, stop codon, gene start, gene end

', name=None, version='0.1', license='MIT', tags=[])

Predict_on_batch

model.predict_on_batch
<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x7f95b27af2b0>>

Others

# Model source
model.source
GitLFSSource(remote_url='[email protected]:kipoi/models.git', local_path='/home/avsec/.kipoi/models/')
# model location directory
model.source_dir
'/home/avsec/.kipoi/models/rbp_eclip/UPF1'

DataLoader

DataLoader = kipoi.get_dataloader_factory(MODEL)
# same as DataLoader = model.default_dataloader

A dataloader will most likely require input arguments in which the input files are defined, for example input fasta files or bed files, based on which the model input is generated. There are several options where the dataloader input keyword arguments are displayed:

# Display information about the dataloader
print(DataLoader.__doc__)
    Args:
        intervals_file: file path; tsv file
            Assumes bed-like `chrom start end id score strand` format.
        fasta_file: file path; Genome sequence
        gtf_file: file path; Genome annotation GTF file.
        filter_protein_coding: Considering genomic landmarks only for protein coding genes
        preproc_transformer: file path; tranformer used for pre-processing.
        target_file: file path; path to the targets
        batch_size: int
# Alternatively the dataloader keyword arguments can be displayed using the function:
kipoi.print_dl_kwargs(DataLoader)
Keyword argument: `intervals_file`
    doc: bed6 file with `chrom start end id score strand` columns
    type: str
    optional: False
    example: example_files/intervals.bed
Keyword argument: `fasta_file`
    doc: Reference genome sequence
    type: str
    optional: False
    example: example_files/hg38_chr22.fa
Keyword argument: `gtf_file`
    doc: file path; Genome annotation GTF file
    type: str
    optional: False
    example: example_files/gencode.v24.annotation_chr22.gtf
Keyword argument: `filter_protein_coding`
    doc: Considering genomic landmarks only for protein coding genes when computing the distances to the nearest genomic landmark.
    type: str
    optional: True
    example: True
Keyword argument: `target_file`
    doc: path to the targets (txt) file
    type: str
    optional: True
    example: example_files/targets.tsv
Keyword argument: `use_linecache`
    doc: if True, use linecache https://docs.python.org/3/library/linecache.html to access bed file rows
    type: str
    optional: True
--------------------------------------------------------------------------------
Example keyword arguments are: {'intervals_file': 'example_files/intervals.bed', 'fasta_file': 'example_files/hg38_chr22.fa', 'gtf_file': 'example_files/gencode.v24.annotation_chr22.gtf', 'filter_protein_coding': True, 'target_file': 'example_files/targets.tsv'}

Run dataloader on some examples

# each dataloader already provides example files which can be used to illustrate its use:
DataLoader.example_kwargs
{'fasta_file': 'example_files/hg38_chr22.fa',
 'filter_protein_coding': True,
 'gtf_file': 'example_files/gencode.v24.annotation_chr22.gtf',
 'intervals_file': 'example_files/intervals.bed',
 'target_file': 'example_files/targets.tsv'}
import os
# cd into the source directory 
os.chdir(DataLoader.source_dir)
!tree
.
├── custom_keras_objects.py -> ../template/custom_keras_objects.py
├── dataloader_files
│   └── position_transformer.pkl
├── dataloader.py -> ../template/dataloader.py
├── dataloader.yaml -> ../template/dataloader.yaml
├── example_files -> ../template/example_files
├── model_files
│   └── model.h5
├── model.yaml -> ../template/model.yaml
└── __pycache__
    ├── custom_keras_objects.cpython-36.pyc
    └── dataloader.cpython-36.pyc

4 directories, 8 files
dl = DataLoader(**DataLoader.example_kwargs)
# could be also done with DataLoader.init_example()
# This particular dataloader is of type Dataset
# i.e. it implements the __getitem__ method:
dl[0].keys()
dict_keys(['inputs', 'targets', 'metadata'])
dl[0]["inputs"]["seq"][:5]
array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]], dtype=float32)
dl[0]["inputs"]["seq"][:5]
array([[0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25],
       [0.25, 0.25, 0.25, 0.25]], dtype=float32)
len(dl)
14

Get the whole dataset

whole_data = dl.load_all()
100%|██████████| 1/1 [00:00<00:00,  6.24it/s]
whole_data.keys()
dict_keys(['inputs', 'targets', 'metadata'])
whole_data["inputs"]["seq"].shape
(14, 101, 4)

Get the iterator to run predictions

it = dl.batch_iter(batch_size=1, shuffle=False, num_workers=0, drop_last=False)
next(it)["inputs"]["seq"].shape
(1, 101, 4)
model.predict_on_batch(next(it)["inputs"])
array([[0.1351]], dtype=float32)

Pipeline

Pipeline object will take the dataloader arguments and run the whole pipeline directly:

dataloader arguments --Dataloader-->  numpy arrays --Model--> prediction
example_kwargs = model.default_dataloader.example_kwargs
preds = model.pipeline.predict_example()
preds
100%|██████████| 1/1 [00:00<00:00,  6.78it/s]





array([[0.4208],
       [0.0005],
       [0.0005],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.0005],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208]], dtype=float32)
model.pipeline.predict(example_kwargs)
1it [00:01,  1.56s/it]





array([0.4208, 0.0005, 0.0005, 0.4208, 0.4208, 0.4208, 0.0005, 0.4208, 0.4208, 0.4208, 0.4208,
       0.4208, 0.4208, 0.4208], dtype=float32)
next(model.pipeline.predict_generator(example_kwargs, batch_size=2))
array([[0.4208],
       [0.0005]], dtype=float32)
from kipoi.data_utils import numpy_collate
numpy_collate_concat(list(model.pipeline.predict_generator(example_kwargs)))
array([[0.4208],
       [0.0005],
       [0.0005],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.0005],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208],
       [0.4208]], dtype=float32)

Re-train the Keras model

Keras model is stored under the .model attribute.

model.model.compile("adam", "binary_crossentropy")
train_it = dl.batch_train_iter(batch_size=2)
# model.model.summary()
model.model.fit_generator(train_it, steps_per_epoch=3, epochs=1)
Epoch 1/1
3/3 [==============================] - 1s 291ms/step - loss: 1.3592





<keras.callbacks.History at 0x7f95b0095fd0>