D00279.001_RNAcompete_Rbm47

Authors: Babak Alipanahi , Andrew Delong , Matthew T Weirauch , Brendan J Frey

License: BSD 3-Clause

Contributed by: Johnny Israeli

Cite as: https://doi.org/10.1038/nbt.3300

Trained on: ?All chromosomes? Data from protein binding microarrays (Mukherjee et al., 2004), RNAcompete assays (Ray et al., 2009), ChIP-seq (Kharchenko et al., 2008), and HT-SELEX (Jolma et al., 2010)

Source files

Abstract: Knowing the sequence specificities of DNA- and RNA-binding proteins is essential for developing models of the regulatory processes in biological systems and for identifying causal disease variants. Here we show that sequence specificities can be ascertained from experimental data with 'deep learning' techniques, which offer a scalable, flexible and unified computational approach for pattern discovery. Using a diverse array of experimental data and evaluation metrics, we find that deep learning outperforms other state-of-the-art methods, even when training on in vitro data and testing on in vivo data. We call this approach DeepBind and have built a stand-alone software tool that is fully automatic and handles millions of sequences per experiment. Specificities determined by DeepBind are readily visualized as a weighted ensemble of position weight matrices or as a 'mutation map' that indicates how variations affect binding within a specific sequence.

Create a new conda environment with all dependencies installed

kipoi env create DeepBind
source activate kipoi-DeepBind

Test the model

kipoi test DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 --source=kipoi

Make a prediction

kipoi get-example DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 -o example
kipoi predict DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 \
  --dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
  -o '/tmp/DeepBind|Xenopus_tropicalis|RBP|D00279.001_RNAcompete_Rbm47.example_pred.tsv'
# check the results
head '/tmp/DeepBind|Xenopus_tropicalis|RBP|D00279.001_RNAcompete_Rbm47.example_pred.tsv'

Create a new conda environment with all dependencies installed

kipoi env create DeepBind
source activate kipoi-DeepBind

Get the model

import kipoi
model = kipoi.get_model('DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47')

Make a prediction for example files

pred = model.pipeline.predict_example(batch_size=4)

Use dataloader and model separately

# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
    # predict for a batch
    batch_pred = model.predict_on_batch(batch['inputs'])

Make predictions for custom files directly

pred = model.pipeline.predict(dl_kwargs, batch_size=4)

Get the model

library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47')

Make a prediction for example files

predictions <- model$pipeline$predict_example()

Use dataloader and model separately

# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)

Make predictions for custom files directly

pred <- model$pipeline$predict(dl_kwargs, batch_size=4)

Get the docker image

docker pull kipoi/kipoi-docker:sharedpy3keras2tf1-slim

Get the full sized docker image

docker pull kipoi/kipoi-docker:sharedpy3keras2tf1

Get the activated conda environment inside the container

docker run -it kipoi/kipoi-docker:sharedpy3keras2tf1-slim

Test the model

docker run kipoi/kipoi-docker:sharedpy3keras2tf1-slim kipoi test DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 --source=kipoi

Make prediction for custom files directly

# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf1-slim \
kipoi get-example DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 -o /app/example 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf1-slim \
kipoi predict DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file'}' \
-o '/app/DeepBind_Xenopus_tropicalis_RBP_D00279.001_RNAcompete_Rbm47.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/DeepBind_Xenopus_tropicalis_RBP_D00279.001_RNAcompete_Rbm47.example_pred.tsv

Install apptainer

https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps

Make prediction for custom files directly

kipoi get-example DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 -o example
kipoi predict DeepBind/Xenopus_tropicalis/RBP/D00279.001_RNAcompete_Rbm47 \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
-o 'DeepBind_Xenopus_tropicalis_RBP_D00279.001_RNAcompete_Rbm47.example_pred.tsv' \
--singularity 
# check the results
head DeepBind_Xenopus_tropicalis_RBP_D00279.001_RNAcompete_Rbm47.example_pred.tsv

Schema

Inputs

Single numpy array

Name: seq

Shape: (101, 4)

Doc: DNA sequence

Targets

Single numpy array

Name: binding_prob

Shape: (1,)

Doc: Protein binding probability

Dataloader

Defined as: kipoiseq.dataloaders.SeqIntervalDl

Doc: Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited `intervals_file` and converts them into one-hot encoded format. Returned sequences are of the type np.array with the shape inferred from the arguments: `alphabet_axis` and `dummy_axis`.

Authors: Ziga Avsec , Roman Kreuzhuber

Type: Dataset

License: MIT

Arguments

intervals_file : bed3+<columns> file path containing intervals + (optionally) labels

fasta_file : Reference genome FASTA file path.

num_chr_fasta (optional): True, the the dataloader will make sure that the chromosomes don't start with chr.

label_dtype (optional): None, datatype of the task labels taken from the intervals_file. Example: str, int, float, np.float32

use_strand (optional): reverse-complement fasta sequence if bed file defines negative strand. Requires a bed6 file

ignore_targets (optional): if True, don't return any target variables

Model dependencies

conda:

h5py=2.10.0
tensorflow=2.7.0
keras=2.7.0
python=3.7
bioconda::pysam=0.18.0
pip=20.2.4

pip:

Dataloader dependencies

conda:

bioconda::pybedtools
bioconda::pyfaidx
bioconda::pyranges
numpy
pandas

pip:

kipoiseq