Basset

Authors: David R. Kelley , Jasper Snoek , John L. Rinn

Version: 0.1.0

License: MIT

Contributed by: Roman Kreuzhuber

Cite as: https://doi.org/10.1101/gr.200535.115

Trained on:

Type: pytorch

Postprocessing: variant_effects

This is the Basset model published by David Kelley converted to pytorch by Roman Kreuzhuber. It categorically predicts probabilities of accesible genomic regions in 164 cell types. Data was generated using DNAse-seq. The sequence length the model uses as input is 600bp. The input of the tensor has to be (N, 4, 600, 1) for N samples, 600bp window size and 4 nucleotides. Per sample, 164 probabilities of accessible chromatin will be predicted.

Create a new conda environment with all dependencies installed
kipoi env create Basset
source activate kipoi-Basset
Install model dependencies into current environment
kipoi env install Basset
Test the model
kipoi test Basset --source=kipoi
Make a prediction
cd ~/.kipoi/models/Basset
kipoi predict Basset \
  --dataloader_args='{'intervals_file': 'example_files/intervals.bed', 'fasta_file': 'example_files/hg38_chr22.fa'}' \
  -o '/tmp/Basset.example_pred.tsv'
# check the results
head '/tmp/Basset.example_pred.tsv'
Get the model
import kipoi
model = kipoi.get_model('Basset')
Make a prediction for example files
pred = model.pipeline.predict_example()
Use dataloader and model separately
# setup the example dataloader kwargs
dl_kwargs = {'intervals_file': 'example_files/intervals.bed', 'fasta_file': 'example_files/hg38_chr22.fa'}
import os; os.chdir(os.path.expanduser('~/.kipoi/models/Basset'))
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
it = dl.batch_iter(batch_size=4)
# predict for a batch
batch = next(it)
model.predict_on_batch(batch['inputs'])
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Basset')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Get the dataloader
setwd('~/.kipoi/models/Basset')
dl <- model$default_dataloader(intervals_file='example_files/intervals.bed', fasta_file='example_files/hg38_chr22.fa')
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)

Schema

Inputs

Single numpy array

Name: seq

    Shape: (4, 600, 1) 

    Doc: DNA sequence


Targets

Single numpy array

Name: DHS_probs

    Shape: (164,) 

    Doc: Probability of accessible chromatin in 164 cell types


Dataloader

Relative path: .

Version: 0.1

Doc: Dataloader for the Basset model.

Authors: Lara Urban , Ziga Avsec , Roman Kreuzhuber

Type: Dataset

License: MIT


Arguments

intervals_file : bed3 file with `chrom start end id score strand`

fasta_file : Reference genome sequence

target_file (optional): path to the targets (.tsv) file

use_linecache (optional): if True, use linecache https://docs.python.org/3/library/linecache.html to access bed file rows


Model dependencies
conda:
  • python=3.5
  • h5py
  • pytorch::pytorch-cpu>=0.2.0

pip:

Dataloader dependencies
conda:
  • bioconda::genomelake
  • bioconda::pybedtools
  • python=3.5
  • numpy
  • pandas
  • cython

pip: