MMSplice/pathogenicity

Authors: Jun Cheng

License: MIT

Contributed by: Jun Cheng

Cite as:

Type: custom

Postprocessing: variant_effects

Trained on: MPRA (Rosenberg 2015), GENCODE annotation 24, ClinVar (release 2018-04-29) variants (labelled 'Pathogenic' or 'Benign') near the splice sites. Chromosome 1 to chromosome 8 were provided as training data. The remaining chromosomes 9 to 22 and chromosome X were held out.

Source files

Predict splicing variant effect from VCF

Predict variant pathogenicity. Returns one prediction per variant.

Create a new conda environment with all dependencies installed
kipoi env create MMSplice/pathogenicity
source activate kipoi-MMSplice__pathogenicity
Test the model
kipoi test MMSplice/pathogenicity --source=kipoi
Make a prediction
kipoi get-example MMSplice/pathogenicity -o example
kipoi predict MMSplice/pathogenicity \
  --dataloader_args='{"gtf": "example/gtf", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "exon_cut_l": 0, "exon_cut_r": 0, "acceptor_intron_cut": 6, "donor_intron_cut": 6, "acceptor_intron_len": 50, "acceptor_exon_len": 3, "donor_exon_len": 5, "donor_intron_len": 13}' \
  -o '/tmp/MMSplice|pathogenicity.example_pred.tsv'
# check the results
head '/tmp/MMSplice|pathogenicity.example_pred.tsv'
Create a new conda environment with all dependencies installed
kipoi env create MMSplice/pathogenicity
source activate kipoi-MMSplice__pathogenicity
Get the model
import kipoi
model = kipoi.get_model('MMSplice/pathogenicity')
Make a prediction for example files
pred = model.pipeline.predict_example(batch_size=4)
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=4)
for batch in batch_iterator:
    # predict for a batch
    batch_pred = model.predict_on_batch(batch['inputs'])
Make predictions for custom files directly
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
Get the model
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('MMSplice/pathogenicity')
Make a prediction for example files
predictions <- model$pipeline$predict_example()
Use dataloader and model separately
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
Make predictions for custom files directly
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
Get the docker image
docker pull kipoi/kipoi-docker:mmsplice-slim
Get the full sized docker image
docker pull kipoi/kipoi-docker:mmsplice
Get the activated conda environment inside the container
docker run -it kipoi/kipoi-docker:mmsplice-slim
Test the model
docker run kipoi/kipoi-docker:mmsplice-slim kipoi test MMSplice/pathogenicity --source=kipoi
Make prediction for custom files directly
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example 
# You can replace $PWD/kipoi-example with a different absolute path containing the data 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:mmsplice-slim \
kipoi get-example MMSplice/pathogenicity -o /app/example 
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:mmsplice-slim \
kipoi predict MMSplice/pathogenicity \
--dataloader_args='{'gtf': '/app/example/gtf', 'fasta_file': '/app/example/fasta_file', 'vcf_file': '/app/example/vcf_file', 'exon_cut_l': 0, 'exon_cut_r': 0, 'acceptor_intron_cut': 6, 'donor_intron_cut': 6, 'acceptor_intron_len': 50, 'acceptor_exon_len': 3, 'donor_exon_len': 5, 'donor_intron_len': 13}' \
-o '/app/MMSplice_pathogenicity.example_pred.tsv' 
# check the results
head $PWD/kipoi-example/MMSplice_pathogenicity.example_pred.tsv
    
Install apptainer
https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps
Make prediction for custom files directly
kipoi get-example MMSplice/pathogenicity -o example
kipoi predict MMSplice/pathogenicity \
--dataloader_args='{"gtf": "example/gtf", "fasta_file": "example/fasta_file", "vcf_file": "example/vcf_file", "exon_cut_l": 0, "exon_cut_r": 0, "acceptor_intron_cut": 6, "donor_intron_cut": 6, "acceptor_intron_len": 50, "acceptor_exon_len": 3, "donor_exon_len": 5, "donor_intron_len": 13}' \
-o 'MMSplice_pathogenicity.example_pred.tsv' \
--singularity 
# check the results
head MMSplice_pathogenicity.example_pred.tsv

Schema

Inputs

Dictionary of numpy arrays

Name: seq/acceptor_intron

    Shape: (None, 4) 

    Doc: alternative sequence of acceptor intron

Name: seq/acceptor

    Shape: (None, 4) 

    Doc: alternative sequence of acceptor

Name: seq/exon

    Shape: (None, 4) 

    Doc: alternative sequence of exon

Name: seq/donor

    Shape: (None, 4) 

    Doc: alternative sequence of donor

Name: seq/donor_intron

    Shape: (None, 4) 

    Doc: alternative sequence of donor intron

Name: mut_seq/acceptor_intron

    Shape: (None, 4) 

    Doc: alternative sequence of acceptor intron

Name: mut_seq/acceptor

    Shape: (None, 4) 

    Doc: alternative sequence of acceptor

Name: mut_seq/exon

    Shape: (None, 4) 

    Doc: alternative sequence of exon

Name: mut_seq/donor

    Shape: (None, 4) 

    Doc: alternative sequence of donor

Name: mut_seq/donor_intron

    Shape: (None, 4) 

    Doc: alternative sequence of donor intron


Targets

Single numpy array

Name: None

    Shape: (2,) 

    Doc: "Pathogenicity score. 0th index represents the probability of being benign and 1st index represents the probability for being pathogenic."


Dataloader

Defined as: ..

Doc: This model first predicts the effect of variants using 5 sub-modules (acceptor intron module, acceptor module, exon module, donor module, donor intron module), and then integrates those predictions using linear regression. The model has been trained to predict delta PSI subject to variants.

Authors: Jun Cheng

Type: SampleIterator

License: MIT


Arguments

gtf : path to the GTF file required by the models (Ensemble)

fasta_file : reference genome fasta file

vcf_file : Path to the input vcf file

split_seq (optional): Whether split the sequence in dataloader

encode (optional): If split the sequence, whether one hot encoding

exon_cut_l (optional): when extract exon feature, how many base pair to cut out at the begining of an exon

exon_cut_r (optional): when extract exon feature, how many base pair to cut out at the end of an exon

acceptor_intron_cut (optional): how many bp to cut out at the end of acceptor intron that consider as acceptor site

donor_intron_cut (optional): how many bp to cut out at the end of donor intron that consider as donor site

acceptor_intron_len (optional): what length in acceptor intron to consider for acceptor site model

acceptor_exon_len (optional): what length in acceptor exon to consider for acceptor site model

donor_exon_len (optional): what length in donor exon to consider for donor site model

donor_intron_len (optional): what length in donor intron to consider for donor site model


Model dependencies
conda:
  • python=3.7
  • pip=21.0.1

pip:
  • h5py==2.10.0
  • mmsplice==1.0.3
  • protobuf==3.20

Dataloader dependencies
conda:
  • bioconda::cyvcf2=0.11.5
  • bioconda::pyranges=0.0.66
  • bioconda::pysam=0.15.3
  • python=3.7

pip:
  • mmsplice==1.0.3
  • protobuf==3.20