Authors: David R. Kelley , Jasper Snoek , John L. Rinn

License: MIT

Contributed by: Roman Kreuzhuber

Cite as:

Type: pytorch

Postprocessing: variant_effects

Trained on: From 2,071,886 total sites, 71,886 randomly reserved for testing and 70,000 for validation, leaving 1,930,000 for training.

Source files

This is the Basset model published by David Kelley converted to pytorch by Roman Kreuzhuber. It categorically predicts probabilities of accesible genomic regions in 164 cell types (ENCODE project and Roadmap Epigenomics Consortium). Data was generated using DNAse-seq. The sequence length the model uses as input is 600bp. The input of the tensor has to be (N, 4, 600, 1) for N samples, 600bp window size and 4 nucleotides. Per sample, 164 probabilities of accessible chromatin will be predicted.

Create a new conda environment with all dependencies installed
kipoi env create Basset
source activate kipoi-Basset
Install model dependencies into current environment
kipoi env install Basset
Test the model
kipoi test Basset --source=kipoi
Make a prediction
kipoi get-example Basset -o example
kipoi predict Basset \
  --dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
  -o '/tmp/Basset.example_pred.tsv'
# check the results
head '/tmp/Basset.example_pred.tsv'
Get the model
import kipoi
model = kipoi.get_model('Basset')
Make a prediction for example files
pred = model.pipeline.predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
it = dl.batch_iter(batch_size=4)
# predict for a batch
batch = next(it)
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
kipoi <- import('kipoi')
model <- kipoi$get_model('Basset')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
Get the docker image
docker pull haimasree/kipoi-docker:sharedpy3keras2
Get the activated conda environment inside the container
docker run -it haimasree/kipoi-docker:sharedpy3keras2
Test the model
docker run haimasree/kipoi-docker:sharedpy3keras2 kipoi test Basset --source=kipoi
Make prediction for custom files directly
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ haimasree/kipoi-docker:sharedpy3keras2 \
kipoi get-example Basset -o /app/example 
docker run -v $PWD/kipoi-example:/app/ haimasree/kipoi-docker:sharedpy3keras2 \
kipoi predict Basset \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file'}' \
-o '/app/Basset.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/Basset.example_pred.tsv



Single numpy array

Name: seq

    Shape: (4, 600, 1) 

    Doc: DNA sequence


Single numpy array

Name: DHS_probs

    Shape: (164,) 

    Doc: Probability of accessible chromatin in 164 cell types


Defined as: kipoiseq.dataloaders.SeqIntervalDl

Doc: Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited `intervals_file` and converts them into one-hot encoded format. Returned sequences are of the type np.array with the shape inferred from the arguments: `alphabet_axis` and `dummy_axis`.

Authors: Ziga Avsec , Roman Kreuzhuber

Type: Dataset

License: MIT


intervals_file : bed3+<columns> file path containing intervals + (optionally) labels

fasta_file : Reference genome FASTA file path.

num_chr_fasta (optional): True, the the dataloader will make sure that the chromosomes don't start with chr.

label_dtype (optional): None, datatype of the task labels taken from the intervals_file. Example: str, int, float, np.float32

use_strand (optional): reverse-complement fasta sequence if bed file defines negative strand. Requires a bed6 file

ignore_targets (optional): if True, don't return any target variables

Model dependencies
  • python=3.6
  • h5py=2.10.0
  • _pytorch_select=0.2=gpu_0
  • pytorch=1.3.1=cuda100py36h53c1284_0
  • pip=20.3.3
  • pysam=0.15.3

  • kipoiseq

Dataloader dependencies
  • bioconda::pybedtools
  • bioconda::pyfaidx
  • numpy
  • pandas

  • kipoiseq