Framepool

Authors: Alexander Karollus

License: MIT

Contributed by: Alexander Karollus

Cite as:

Type: None

Postprocessing: None

Trained on: Data from MPRA experiment using eGFP library on HEK293T cells. Trained on eGFP library (260k sequences). 20,000 sequences were withheld for testing, and additional validations on endogenous data were performed.

Source files

This model predicts the log2 fold change in mean ribosome load from introducing variants into the 5' UTR of a sequence. Additionally, the model will provide log2 fold changes due to variants assuming the frame of the sequence has been shifted. High log fold changes after such shifts can provide an indication that a new in-frame start has been created within the 5' UTR, lengthening the canonical protein. The Model adapted from Sample et al: Human 5 prime UTR design and variant effect prediction from a massively parallel translation assay (https://doi.org/10.1101/310375). Several modifications have been added to allow arbitrary length inputs, instead of fixed size.

Create a new conda environment with all dependencies installed

kipoi env create Framepool
source activate kipoi-Framepool

Test the model

kipoi test Framepool --source=kipoi

Make a prediction

kipoi get-example Framepool -o example
kipoi predict Framepool \
  --dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "vcf_file_tbi": "example/vcf_file_tbi", "chr_order_file": "example/chr_order_file", "num_chr": true}' \
  -o '/tmp/Framepool.example_pred.tsv'
# check the results
head '/tmp/Framepool.example_pred.tsv'

Create a new conda environment with all dependencies installed

kipoi env create Framepool
source activate kipoi-Framepool

Get the model

import kipoi
model = kipoi.get_model('Framepool')

Make a prediction for example files

pred = model.pipeline.predict_example(batch_size=4)

Use dataloader and model separately

# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
    # predict for a batch
    batch_pred = model.predict_on_batch(batch['inputs'])

Make predictions for custom files directly

pred = model.pipeline.predict(dl_kwargs, batch_size=4)

Get the model

library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Framepool')

Make a prediction for example files

predictions <- model$pipeline$predict_example()

Use dataloader and model separately

# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)

Make predictions for custom files directly

pred <- model$pipeline$predict(dl_kwargs, batch_size=4)

Get the docker image

docker pull kipoi/kipoi-docker:framepool-slim

Get the full sized docker image

docker pull kipoi/kipoi-docker:framepool

Get the activated conda environment inside the container

docker run -it kipoi/kipoi-docker:framepool-slim

Test the model

docker run kipoi/kipoi-docker:framepool-slim kipoi test Framepool --source=kipoi

Make prediction for custom files directly

# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:framepool-slim \
kipoi get-example Framepool -o /app/example 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:framepool-slim \
kipoi predict Framepool \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file', 'vcf_file': '/app/example/vcf_file', 'vcf_file_tbi': '/app/example/vcf_file_tbi', 'chr_order_file': '/app/example/chr_order_file', 'num_chr': True}' \
-o '/app/Framepool.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/Framepool.example_pred.tsv

Install apptainer

https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps

Make prediction for custom files directly

kipoi get-example Framepool -o example
kipoi predict Framepool \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "vcf_file_tbi": "example/vcf_file_tbi", "chr_order_file": "example/chr_order_file", "num_chr": true}' \
-o 'Framepool.example_pred.tsv' \
--singularity 
# check the results
head Framepool.example_pred.tsv

Schema

Inputs

Dictionary of numpy arrays

Name: ref_seq

Shape: ()

Doc: reference sequence of 5' UTR

Name: alt_seq

Shape: ()

Doc: alternative sequence of 5' UTR

Targets

Dictionary of numpy arrays

Name: mrl_fold_change

Shape: (1,)

Doc: Log2 Fold Change in predicted mean ribosome load

Name: shift_1

Shape: (1,)

Doc: Log2 Fold Change in mrl if frame is shifted by 1

Name: shift_2

Shape: (1,)

Doc: Log2 Fold Change in mrl if frame is shifted by 2

Dataloader

Defined as: .

Doc: This Dataloader requires the following input files: 1. bed3+ where a specific user-specified column (>3, 1-based) of the bed denotes the strand and a specific user-specified column (>3, 1-based) of the bed denotes the transcript id (or some other id that explains which exons in the bed belong together to form one sequence). All columns of the bed, except the first three, the id and the strand, are ignored. 2. fasta file that provides the reference genome 3. bgzip compressed (single sample) vcf that provides the variants 4. A chromosome order file (such as a fai file) that specifies the order of chromosomes (must be valid for all files) The bed and vcf must both be sorted (by position) and a tabix index must be present. (must lie in the same directory and have the same name + .tbi) The num_chr flag indicates whether chromosomes are listed numerically or with a chr prefix. This must be consistent across all input files! The dataloader finds all intervals in the bed which contain at least one variant in the vcf. It then joins intervals belonging to the same transcript, as specified by the id, to a single sequence. For these sequences, it extracts the reference sequence from the fasta file, injects the applicable variants and reverse complements according to the strand information. This means that if a vcf mixes variants from more than one patient, the results will not be meaningful. Split the vcf by patient and run the predictions seperately in this case! Returns the reference sequence and variant sequence as np.array([reference_sequence, variant_sequence]). Region metadata is additionally provided

Authors:

Type: Dataset

License: MIT

Arguments

intervals_file : bed3+<columns> file path containing bed3 and at least one column specifying the strand and at least one column specifying the id. Additional columns are (currently) ignored. Must be sorted

fasta_file : Reference genome FASTA file path

vcf_file : bgzipped vcf file with the variants that are to be investigated. Must be sorted and tabix index present. Filter out any variants with non-DNA symbols!

vcf_file_tbi : tabix index of vcf (just to make kipoi tests work - leave as None in normal usage)

chr_order_file : file specifying the chromosome order (genome/faidx file) This must be consistent across vcf and bed file (fasta can deviate)

strand_column : the column (1-based) specifying the strand (column 6 in a standard bed file)

id_column : the column (1-based) where seq-id information can be found (column 4 in standard bed)

num_chr : Specify whether chromosome names are numeric or have chr prefix (true if numeric, false if with prefix). Must be consistent across all files!

Model dependencies

conda:

python=3.7
numpy>=1.16.2
tensorflow=1.13.1
keras=2.2.4
h5py=2.9.0
pip=20.2.4

pip:

kipoi

Dataloader dependencies

conda:

python=3.7
bioconda::pybedtools>=0.8.0
bioconda::biopython
bioconda::bedtools>=2.28.0
bioconda::cyvcf2>=0.10.10
numpy>=1.16.2
pandas>=0.24.2

pip:

kipoi
kipoiseq