FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF

Authors: Daniel Quang , Xiaohui Xie

License: MIT

Contributed by: Ziga Avsec

Cite as: https://doi.org/10.1101/151274

Type: keras

Postprocessing: None

Trained on:

Source files

FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data Github link - https://github.com/uci-cbcl/FactorNet Abststract: Due to the large numbers of transcription factors (TFs) and cell types, querying binding profiles of all TF/cell type pairs is not experimentally feasible, owing to constraints in time and resources. To address this issue, we developed a convolutional-recurrent neural network model, called FactorNet, to computationally impute the missing binding data. FactorNet trains on binding data from reference cell types to make accurate predictions on testing cell types by leveraging a variety of features, including genomic sequences, genome annotations, gene expression, and single-nucleotide resolution sequential signals, such as DNase I cleavage. To the best of our knowledge, this is the first deep learning method to study the rules governing TF binding at such a fine resolution. With FactorNet, a researcher can perform a single sequencing assay, such as DNase-seq, on a cell type and computationally impute dozens of TF binding profiles. This is an integral step for reconstructing the complex networks underlying gene regulation. While neural networks can be computationally expensive to train, we introduce several novel strategies to significantly reduce the overhead. By visualizing the neural network models, we can interpret how the model predicts binding which in turn reveals additional insights into regulatory grammar. We also investigate the variables that affect cross-cell type predictive performance to explain why the model performs better on some TF/cell types than others, and offer insights to improve upon this field. Our method ranked among the top four teams in the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge.

Create a new conda environment with all dependencies installed
kipoi env create FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF
source activate kipoi-FactorNet__GABPA__metaGENCODE_RNAseq_Unique35_DGF
Test the model
kipoi test FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF --source=kipoi
Make a prediction
kipoi get-example FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF -o example
kipoi predict FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF \
  --dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "dnase_file": "example/dnase_file", "cell_line": "PC-3", "mappability_file": "example/mappability_file"}' \
  -o '/tmp/FactorNet|GABPA|metaGENCODE_RNAseq_Unique35_DGF.example_pred.tsv'
# check the results
head '/tmp/FactorNet|GABPA|metaGENCODE_RNAseq_Unique35_DGF.example_pred.tsv'
Create a new conda environment with all dependencies installed
kipoi env create FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF
source activate kipoi-FactorNet__GABPA__metaGENCODE_RNAseq_Unique35_DGF
Get the model
import kipoi
model = kipoi.get_model('FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF')
Make a prediction for example files
pred = model.pipeline.predict_example(batch_size=4)
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
    # predict for a batch
    batch_pred = model.predict_on_batch(batch['inputs'])
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
Get the docker image
docker pull kipoi/kipoi-docker:sharedpy3keras2tf1-slim
Get the full sized docker image
docker pull kipoi/kipoi-docker:sharedpy3keras2tf1
Get the activated conda environment inside the container
docker run -it kipoi/kipoi-docker:sharedpy3keras2tf1-slim
Test the model
docker run kipoi/kipoi-docker:sharedpy3keras2tf1-slim kipoi test FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF --source=kipoi
Make prediction for custom files directly
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf1-slim \
kipoi get-example FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF -o /app/example 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf1-slim \
kipoi predict FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file', 'dnase_file': '/app/example/dnase_file', 'cell_line': 'PC-3', 'mappability_file': '/app/example/mappability_file'}' \
-o '/app/FactorNet_GABPA_metaGENCODE_RNAseq_Unique35_DGF.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/FactorNet_GABPA_metaGENCODE_RNAseq_Unique35_DGF.example_pred.tsv
    
Install apptainer
https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps
Make prediction for custom files directly
kipoi get-example FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF -o example
kipoi predict FactorNet/GABPA/metaGENCODE_RNAseq_Unique35_DGF \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "dnase_file": "example/dnase_file", "cell_line": "PC-3", "mappability_file": "example/mappability_file"}' \
-o 'FactorNet_GABPA_metaGENCODE_RNAseq_Unique35_DGF.example_pred.tsv' \
--singularity 
# check the results
head FactorNet_GABPA_metaGENCODE_RNAseq_Unique35_DGF.example_pred.tsv

Schema

Inputs

List of numpy arrays

Name: seq

    Shape: (1002, 6) 

    Doc: DNA sequence and other big-wig channels (mappability and DNAseq)

Name: seq_rc

    Shape: (1002, 6) 

    Doc: Reverse-complemented DNA sequence and reversed other bigwig channels

Name: meta_features

    Shape: (14,) 

    Doc: First 8 RNAseq principle-components for the tissue. 6 gencode feature counts - cpg, cds, intron, promoter, utr5, utr4


Targets

Single numpy array

Name: is_binding_site

    Shape: (1,) 

    Doc: TF binding class


Dataloader

Defined as: .

Doc: Dataloader for the FactorNet model.

Authors: Ziga Avsec

Type: Dataset

License: MIT


Arguments

intervals_file : bed3 file with `chrom start end id score strand`

fasta_file : Reference genome sequence

dnase_file : DNase bigwig file

cell_line (optional): Cell type as a string.

RNAseq_PC_file (optional): file path to a RNAseq PC file computed by https://github.com/davidaknowles/tf_net/blob/master/gene_expression_pca.R. See https://github.com/uci-cbcl/FactorNet/blob/master/data/README.md.

mappability_file (optional): USCS mappability track - http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeMapability/wgEncodeDukeMapabilityUniqueness35bp.bigWig. by deafult, provide this file with the dataloader, download in background

GENCODE_dir (optional): Path to the already pre-processed gencode files directory to compute the gencode features

use_linecache (optional): if True, use linecache https://docs.python.org/3/library/linecache.html to access bed file rows


Model dependencies
conda:
  • python=3.7

pip:
  • tensorflow>=1.4.1,<2.0.0
  • keras>=2.0.4,<2.2.0
  • protobuf==3.20

Dataloader dependencies
conda:
  • bioconda::bedtools
  • bioconda::pybedtools
  • bioconda::genomelake==0.1.4
  • numpy
  • cython

pip: