Framepool
Type: None
Postprocessing: None
Trained on: Data from MPRA experiment using eGFP library on HEK293T cells. Trained on eGFP library (260k sequences). 20,000 sequences were withheld for testing, and additional validations on endogenous data were performed.
This model predicts the log2 fold change in mean ribosome load from introducing variants into the 5' UTR of a sequence. Additionally, the model will provide log2 fold changes due to variants assuming the frame of the sequence has been shifted. High log fold changes after such shifts can provide an indication that a new in-frame start has been created within the 5' UTR, lengthening the canonical protein. The Model adapted from Sample et al: Human 5 prime UTR design and variant effect prediction from a massively parallel translation assay (https://doi.org/10.1101/310375). Several modifications have been added to allow arbitrary length inputs, instead of fixed size.
kipoi env create Framepool
source activate kipoi-Framepool
kipoi test Framepool --source=kipoi
kipoi get-example Framepool -o example
kipoi predict Framepool \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "vcf_file_tbi": "example/vcf_file_tbi", "chr_order_file": "example/chr_order_file", "num_chr": true}' \
-o '/tmp/Framepool.example_pred.tsv'
# check the results
head '/tmp/Framepool.example_pred.tsv'
kipoi env create Framepool
source activate kipoi-Framepool
import kipoi
model = kipoi.get_model('Framepool')
pred = model.pipeline.predict_example(batch_size=4)
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
# predict for a batch
batch_pred = model.predict_on_batch(batch['inputs'])
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Framepool')
predictions <- model$pipeline$predict_example()
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
docker pull kipoi/kipoi-docker:framepool-slim
docker pull kipoi/kipoi-docker:framepool
docker run -it kipoi/kipoi-docker:framepool-slim
docker run kipoi/kipoi-docker:framepool-slim kipoi test Framepool --source=kipoi
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example
# You can replace $PWD/kipoi-example with a different absolute path containing the data
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:framepool-slim \
kipoi get-example Framepool -o /app/example
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:framepool-slim \
kipoi predict Framepool \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file', 'vcf_file': '/app/example/vcf_file', 'vcf_file_tbi': '/app/example/vcf_file_tbi', 'chr_order_file': '/app/example/chr_order_file', 'num_chr': True}' \
-o '/app/Framepool.example_pred.tsv'
# check the results
head $PWD/kipoi-example/Framepool.example_pred.tsv
https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps
kipoi get-example Framepool -o example
kipoi predict Framepool \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "vcf_file_tbi": "example/vcf_file_tbi", "chr_order_file": "example/chr_order_file", "num_chr": true}' \
-o 'Framepool.example_pred.tsv' \
--singularity
# check the results
head Framepool.example_pred.tsv
Inputs
Dictionary of numpy arrays
Name: ref_seq
Doc: reference sequence of 5' UTR
Name: alt_seq
Doc: alternative sequence of 5' UTR
Defined as: .
Doc: This Dataloader requires the following input files: 1. bed3+ where a specific user-specified column (>3, 1-based) of the bed denotes the strand and a specific user-specified column (>3, 1-based) of the bed denotes the transcript id (or some other id that explains which exons in the bed belong together to form one sequence). All columns of the bed, except the first three, the id and the strand, are ignored. 2. fasta file that provides the reference genome 3. bgzip compressed (single sample) vcf that provides the variants 4. A chromosome order file (such as a fai file) that specifies the order of chromosomes (must be valid for all files) The bed and vcf must both be sorted (by position) and a tabix index must be present. (must lie in the same directory and have the same name + .tbi) The num_chr flag indicates whether chromosomes are listed numerically or with a chr prefix. This must be consistent across all input files! The dataloader finds all intervals in the bed which contain at least one variant in the vcf. It then joins intervals belonging to the same transcript, as specified by the id, to a single sequence. For these sequences, it extracts the reference sequence from the fasta file, injects the applicable variants and reverse complements according to the strand information. This means that if a vcf mixes variants from more than one patient, the results will not be meaningful. Split the vcf by patient and run the predictions seperately in this case! Returns the reference sequence and variant sequence as np.array([reference_sequence, variant_sequence]). Region metadata is additionally provided
Authors:
Type: Dataset
License: MIT
Arguments
intervals_file : bed3+<columns> file path containing bed3 and at least one column specifying the strand and at least one column specifying the id. Additional columns are (currently) ignored. Must be sorted
fasta_file : Reference genome FASTA file path
vcf_file : bgzipped vcf file with the variants that are to be investigated. Must be sorted and tabix index present. Filter out any variants with non-DNA symbols!
vcf_file_tbi : tabix index of vcf (just to make kipoi tests work - leave as None in normal usage)
chr_order_file : file specifying the chromosome order (genome/faidx file) This must be consistent across vcf and bed file (fasta can deviate)
strand_column : the column (1-based) specifying the strand (column 6 in a standard bed file)
id_column : the column (1-based) where seq-id information can be found (column 4 in standard bed)
num_chr : Specify whether chromosome names are numeric or have chr prefix (true if numeric, false if with prefix). Must be consistent across all files!
- python=3.7
- numpy>=1.16.2
- tensorflow=1.13.1
- keras=2.2.4
- h5py=2.9.0
- pip=20.2.4
- kipoi
- python=3.7
- bioconda::pybedtools>=0.8.0
- bioconda::biopython
- bioconda::bedtools>=2.28.0
- bioconda::cyvcf2>=0.10.10
- numpy>=1.16.2
- pandas>=0.24.2
- kipoi
- kipoiseq