Authors: David Kelley

Version: 0.1

License: MIT

Contributed by: Ziga Avsec

Cite as:

Trained on:

Type: tensorflow

Postprocessing: variant_effects

Sequential regulatory activity predictions with deep convolutional neural networks. Github link - Abstract Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.

Create a new conda environment with all dependencies installed
kipoi env create Basenji
source activate kipoi-Basenji
Install model dependencies into current environment
kipoi env install Basenji
Test the model
kipoi test Basenji --source=kipoi
Make a prediction
cd ~/.kipoi/models/Basenji
kipoi predict Basenji \
  --dataloader_args='{'intervals_file': 'example_files/intervals.bed', 'fasta_file': 'example_files/hg38_chr22.fa'}' \
  -o '/tmp/Basenji.example_pred.tsv'
# check the results
head '/tmp/Basenji.example_pred.tsv'
Get the model
import kipoi
model = kipoi.get_model('Basenji')
Make a prediction for example files
pred = model.pipeline.predict_example()
Use dataloader and model separately
# setup the example dataloader kwargs
dl_kwargs = {'intervals_file': 'example_files/intervals.bed', 'fasta_file': 'example_files/hg38_chr22.fa'}
import os; os.chdir(os.path.expanduser('~/.kipoi/models/Basenji'))
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
it = dl.batch_iter(batch_size=2)
# predict for a batch
batch = next(it)
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=2)
Get the model
kipoi <- import('kipoi')
model <- kipoi$get_model('Basenji')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Get the dataloader
dl <- model$default_dataloader(intervals_file='example_files/intervals.bed', fasta_file='example_files/hg38_chr22.fa')
# get a batch iterator
it <- dl$batch_iter(batch_size=2)
# predict for a batch
batch <- iter_next(it)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=2)



Single numpy array

Name: seq

    Shape: (131072, 4) 

    Doc: * one-hot encoded DNA sequence * 4096bp starting flank sequence * 122880bp core sequence (960 * 128), predicted by the model in 128bp bins * 4096bp end flank sequence


Single numpy array

Name: genomic_features

    Shape: (960, 4229) 

    Doc: * 960 bins corresponding to 128bp regions on input sequence * 4229 different output tracks ordered according to


Relative path: .

Version: 0.1

Doc: Dataloader for the Basenji model. Note - batch-size needs to be always 2!

Authors: Ziga Avsec

Type: Dataset

License: MIT


intervals_file : bed3 file with `chrom start end id score strand`

fasta_file : Reference genome sequence

use_linecache (optional): if True, use linecache to access bed file rows

Model dependencies
  • python=3.5

  • tensorflow>=1.4.1

Dataloader dependencies
  • bioconda::genomelake
  • bioconda::pybedtools
  • python=3.5
  • numpy
  • cython