Xpresso/human_K562

Authors: Vikram Agarwal

License: MIT

Contributed by: Vikram Agarwal

Cite as: https://doi.org/10.1016/j.celrep.2020.107663

Type: None

Postprocessing: None

Trained on: A random subset of ~17,000 protein-coding gene promoters

Source files

A model to predict RNA expression levels from a genomic sequence

Create a new conda environment with all dependencies installed
kipoi env create Xpresso
source activate kipoi-Xpresso
Test the model
kipoi test Xpresso/human_K562 --source=kipoi
Make a prediction
kipoi get-example Xpresso/human_K562 -o example
kipoi predict Xpresso/human_K562 \
  --dataloader_args='{"gtf_file": "example/gtf_file", "fasta_file": "example/fasta_file"}' \
  -o '/tmp/Xpresso|human_K562.example_pred.tsv'
# check the results
head '/tmp/Xpresso|human_K562.example_pred.tsv'
Create a new conda environment with all dependencies installed
kipoi env create Xpresso
source activate kipoi-Xpresso
Get the model
import kipoi
model = kipoi.get_model('Xpresso/human_K562')
Make a prediction for example files
pred = model.pipeline.predict_example(batch_size=4)
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
    # predict for a batch
    batch_pred = model.predict_on_batch(batch['inputs'])
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Xpresso/human_K562')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
Get the docker image
docker pull kipoi/kipoi-docker:sharedpy3keras2tf2-slim
Get the full sized docker image
docker pull kipoi/kipoi-docker:sharedpy3keras2tf2
Get the activated conda environment inside the container
docker run -it kipoi/kipoi-docker:sharedpy3keras2tf2-slim
Test the model
docker run kipoi/kipoi-docker:sharedpy3keras2tf2-slim kipoi test Xpresso/human_K562 --source=kipoi
Make prediction for custom files directly
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf2-slim \
kipoi get-example Xpresso/human_K562 -o /app/example 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf2-slim \
kipoi predict Xpresso/human_K562 \
--dataloader_args='{'gtf_file': '/app/example/gtf_file', 'fasta_file': '/app/example/fasta_file'}' \
-o '/app/Xpresso_human_K562.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/Xpresso_human_K562.example_pred.tsv
    
Install apptainer
https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps
Make prediction for custom files directly
kipoi get-example Xpresso/human_K562 -o example
kipoi predict Xpresso/human_K562 \
--dataloader_args='{"gtf_file": "example/gtf_file", "fasta_file": "example/fasta_file"}' \
-o 'Xpresso_human_K562.example_pred.tsv' \
--singularity 
# check the results
head Xpresso_human_K562.example_pred.tsv

Schema

Inputs

Single numpy array

Name: None

    Shape: (10500, 4) 

    Doc: input encoded DNA


Targets

Single numpy array

Name: None

    Shape: (1,) 

    Doc: predicted log10(RNA expression level)


Dataloader

Defined as: kipoiseq.dataloaders.AnchoredGTFDl

Doc: Dataloader for a combination of fasta and gtf files. The dataloader extracts fixed length regions around anchor points. Anchor points are extracted from the gtf based on the anchor parameter. The sequences corresponding to the region are then extracted from the fasta file and optionally trnasformed using a function given by the transform parameter.

Authors: Alex Karollus

Type: Dataset

License: MIT


Arguments

gtf_file : Path to a gtf file (str)

fasta_file : Reference genome FASTA file path (str)

gtf_filter (optional): Allows to filter the gtf before extracting the anchor points. Can be str, callable or None. If str, it is interpreted as argument to pandas .query(). If callable, it is interpreted as function that filters a pandas dataframe and returns the filtered df.

anchor (optional): Defines the anchor points. Can be str or callable. If it is a callable, it is treated as function that takes a pandas dataframe and returns a modified version of the dataframe where each row represents one anchor point, the position of which is stored in the column called anchor_pos. If it is a string, a predefined function is loaded. Currently available are tss (anchor is the start of a gene), start_codon (anchor is the start of the start_codon), stop_codon (anchor is the position right after the stop_codon), polya (anchor is the position right after the end of a gene).

transform (optional): Callable (or None) to transform the extracted sequence (e.g. one-hot)

interval_attrs (optional): Metadata to extract from the gtf, e.g. ["gene_id", "Strand"]

use_strand (optional): True or False


Model dependencies
conda:
  • python=3.8
  • h5py=2.10
  • numpy
  • pip=22.0.4
  • bioconda::pysam=0.17
  • cython
  • keras=2.4
  • tensorflow=2.4

pip:
  • kipoiseq
  • protobuf==3.20

Dataloader dependencies
conda:
  • bioconda::pybedtools
  • bioconda::pyfaidx
  • bioconda::pyranges
  • numpy
  • pandas

pip:
  • kipoiseq