DeepSEA/beluga

Authors: Jian Zhou , Olga G. Troyanskaya

License: Non-comercial

Contributed by: Jian Zhou , Olga G. Troyanskaya

Cite as: https://doi.org/10.1038/s41588-018-0160-6

Type: None

Postprocessing: variant_effects

Trained on: Chromosome 8 and 9 were excluded from training, and the rest of the autosomes were used for training and validation. 4,000 samples on chromosome 7 spanning the genomic coordinates 30,508,751-35,296,850 were used as the validation set.

Source files

This model (DeepSEA Beluga) is a part of the ExPecto model for predicting expression from sequence. The model itself is a deep convolutional network model of TF binding, DNase accessibility and histone marks. Comparing to DeepSEA, this model contains 2x number of convolution layers, takes 2000bp input, and expanded the histone mark collections to all of Roadmap Epigenomics release.

Create a new conda environment with all dependencies installed
kipoi env create DeepSEA/beluga
source activate kipoi-DeepSEA__beluga
Test the model
kipoi test DeepSEA/beluga --source=kipoi
Make a prediction
kipoi get-example DeepSEA/beluga -o example
kipoi predict DeepSEA/beluga \
  --dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
  -o '/tmp/DeepSEA|beluga.example_pred.tsv'
# check the results
head '/tmp/DeepSEA|beluga.example_pred.tsv'
Create a new conda environment with all dependencies installed
kipoi env create DeepSEA/beluga
source activate kipoi-DeepSEA__beluga
Get the model
import kipoi
model = kipoi.get_model('DeepSEA/beluga')
Make a prediction for example files
pred = model.pipeline.predict_example(batch_size=4)
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
    # predict for a batch
    batch_pred = model.predict_on_batch(batch['inputs'])
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('DeepSEA/beluga')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
Get the docker image
docker pull kipoi/kipoi-docker:sharedpy3keras2tf2-slim
Get the full sized docker image
docker pull kipoi/kipoi-docker:sharedpy3keras2tf2
Get the activated conda environment inside the container
docker run -it kipoi/kipoi-docker:sharedpy3keras2tf2-slim
Test the model
docker run kipoi/kipoi-docker:sharedpy3keras2tf2-slim kipoi test DeepSEA/beluga --source=kipoi
Make prediction for custom files directly
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf2-slim \
kipoi get-example DeepSEA/beluga -o /app/example 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf2-slim \
kipoi predict DeepSEA/beluga \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file'}' \
-o '/app/DeepSEA_beluga.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/DeepSEA_beluga.example_pred.tsv
    
Install apptainer
https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps
Make prediction for custom files directly
kipoi get-example DeepSEA/beluga -o example
kipoi predict DeepSEA/beluga \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
-o 'DeepSEA_beluga.example_pred.tsv' \
--singularity 
# check the results
head DeepSEA_beluga.example_pred.tsv

Schema

Inputs

Single numpy array

Name: seq

    Shape: (4, 1, 2000) 

    Doc: DNA sequence


Targets

Single numpy array

Name: TFBS_DHS_probs

    Shape: (2002,) 

    Doc: Probability of a specific epigentic feature


Dataloader

Defined as: kipoiseq.dataloaders.SeqIntervalDl

Doc: Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited `intervals_file` and converts them into one-hot encoded format. Returned sequences are of the type np.array with the shape inferred from the arguments: `alphabet_axis` and `dummy_axis`.

Authors: Ziga Avsec , Roman Kreuzhuber

Type: Dataset

License: MIT


Arguments

intervals_file : bed3+<columns> file path containing intervals + (optionally) labels

fasta_file : Reference genome FASTA file path.

num_chr_fasta (optional): True, the the dataloader will make sure that the chromosomes don't start with chr.

label_dtype (optional): None, datatype of the task labels taken from the intervals_file. Example: str, int, float, np.float32

use_strand (optional): reverse-complement fasta sequence if bed file defines negative strand. Requires a bed6 file

ignore_targets (optional): if True, don't return any target variables


Model dependencies
conda:
  • python=3.8
  • pip=22.0.4
  • h5py=3.9.0
  • pytorch::pytorch=2.0.1
  • cython=3.0.0

pip:
  • kipoi
  • kipoiseq

Dataloader dependencies
conda:
  • bioconda::pybedtools
  • bioconda::pyfaidx
  • bioconda::pyranges
  • numpy
  • pandas

pip:
  • kipoiseq