Framepool

Authors: Alexander Karollus

License: MIT

Contributed by: Alexander Karollus

Cite as:

Type: None

Postprocessing: None

Trained on: Data from MPRA experiment using eGFP library on HEK293T cells. Trained on eGFP library (260k sequences). 20,000 sequences were withheld for testing, and additional validations on endogenous data were performed.

Source files

This model predicts the log2 fold change in mean ribosome load from introducing variants into the 5' UTR of a sequence. Additionally, the model will provide log2 fold changes due to variants assuming the frame of the sequence has been shifted. High log fold changes after such shifts can provide an indication that a new in-frame start has been created within the 5' UTR, lengthening the canonical protein. The Model adapted from Sample et al: Human 5 prime UTR design and variant effect prediction from a massively parallel translation assay (https://doi.org/10.1101/310375). Several modifications have been added to allow arbitrary length inputs, instead of fixed size.

Create a new conda environment with all dependencies installed
kipoi env create Framepool
source activate kipoi-Framepool
Install model dependencies into current environment
kipoi env install Framepool
Test the model
kipoi test Framepool --source=kipoi
Make a prediction
kipoi get-example Framepool -o example
kipoi predict Framepool \
  --dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "vcf_file_tbi": "example/vcf_file_tbi", "chr_order_file": "example/chr_order_file", "num_chr": true}' \
  -o '/tmp/Framepool.example_pred.tsv'
# check the results
head '/tmp/Framepool.example_pred.tsv'
Get the model
import kipoi
model = kipoi.get_model('Framepool')
Make a prediction for example files
pred = model.pipeline.predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
it = dl.batch_iter(batch_size=4)
# predict for a batch
batch = next(it)
model.predict_on_batch(batch['inputs'])
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Framepool')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)

Schema

Inputs

Single numpy array

Name: seq

    Shape: (2,) 

    Doc: Reference sequence and variant 5UTR sequence, as string


Targets

Dictionary of numpy arrays

Name: mrl_fold_change

    Shape: (1,) 

    Doc: Log2 Fold Change in predicted mean ribosome load

Name: shift_1

    Shape: (1,) 

    Doc: Log2 Fold Change in mrl if frame is shifted by 1

Name: shift_2

    Shape: (1,) 

    Doc: Log2 Fold Change in mrl if frame is shifted by 2


Dataloader

Defined as: .

Doc: This Dataloader requires the following input files: 1. bed3+ where a specific user-specified column (>3, 1-based) of the bed denotes the strand and a specific user-specified column (>3, 1-based) of the bed denotes the transcript id (or some other id that explains which exons in the bed belong together to form one sequence). All columns of the bed, except the first three, the id and the strand, are ignored. 2. fasta file that provides the reference genome 3. bgzip compressed (single sample) vcf that provides the variants 4. A chromosome order file (such as a fai file) that specifies the order of chromosomes (must be valid for all files) The bed and vcf must both be sorted (by position) and a tabix index must be present. (must lie in the same directory and have the same name + .tbi) The num_chr flag indicates whether chromosomes are listed numerically or with a chr prefix. This must be consistent across all input files! The dataloader finds all intervals in the bed which contain at least one variant in the vcf. It then joins intervals belonging to the same transcript, as specified by the id, to a single sequence. For these sequences, it extracts the reference sequence from the fasta file, injects the applicable variants and reverse complements according to the strand information. This means that if a vcf mixes variants from more than one patient, the results will not be meaningful. Split the vcf by patient and run the predictions seperately in this case! Returns the reference sequence and variant sequence as np.array([reference_sequence, variant_sequence]). Region metadata is additionally provided

Authors:

Type: Dataset

License: MIT


Arguments

intervals_file : bed3+<columns> file path containing bed3 and at least one column specifying the strand and at least one column specifying the id. Additional columns are (currently) ignored. Must be sorted

fasta_file : Reference genome FASTA file path

vcf_file : bgzipped vcf file with the variants that are to be investigated. Must be sorted and tabix index present. Filter out any variants with non-DNA symbols!

vcf_file_tbi : tabix index of vcf (just to make kipoi tests work - leave as None in normal usage)

chr_order_file : file specifying the chromosome order (genome/faidx file) This must be consistent across vcf and bed file (fasta can deviate)

strand_column : the column (1-based) specifying the strand (column 6 in a standard bed file)

id_column : the column (1-based) where seq-id information can be found (column 4 in standard bed)

num_chr : Specify whether chromosome names are numeric or have chr prefix (true if numeric, false if with prefix). Must be consistent across all files!


Model dependencies
conda:
  • python=3.6.7
  • numpy>=1.16.2
  • tensorflow=1.13.1
  • keras=2.2.4
  • h5py=2.9.0

pip:
  • kipoi

Dataloader dependencies
conda:
  • python=3.6.7
  • bioconda::pybedtools>=0.8.0
  • bioconda::bedtools>=2.28.0
  • bioconda::cyvcf2>=0.10.10
  • numpy>=1.16.2
  • pandas>=0.24.2

pip:
  • kipoi
  • kipoiseq