Using variant effect prediction
This chapter describes how to run variant prediction using a model in the zoo either using the python functionality or using the command line. A prerequesite is that the model is compatible with variant effect prediction (see: Variant effect prediction prerequesites for the model.yaml and dataloader.yaml)
Variant effect prediction in python
Using variant effect prediction within python allows more flexibility in the finegrain details compared to using the command line interface. The core function of variant effect prediction is score_variants
, which on the one hand requires a model with its dataloader as well as a valid VCF.
The easiest way to run variant effect prediction is the following:
from kipoi.postprocessing.variant_effects import score_variants
dataloader_arguments = {...}
score_variants(model = "my_model_name",
dl_args = dataloader_arguments,
input_vcf = "path/to/my_vcf.vcf",
output_vcf = "path/to/my_annotated_vcf.vcf",)
Where model
is a kipoi model - replace my_model_name
by a valid model name. dataloader_arguments
contains all the kwargs that are necessary to run the dataloader. The coordinates in the input_vcf
have to match the genome / assembly etc. of the raw input files used by the dataloader. The output of score_variants is an annotated VCF - output_vcf
. For more details please look at the detailed function description of score_variants
. For details on the different scoring methods please take a look at the detailed explanation of variant effect prediction or the API defintion.
The above code will run the dataloader based with dataloader_arguments
and try to overlap the input VCF with the sequences generated by the dataloader. If a model dataloader accepts bed files input to control the generated regions, then a temporary bed file with variant-centered regions will be generated. If the dataloader does not offer a bed file input then the inputs generated by the dataloader will automatically be overlapped with the positions in the VCF and only the overlapping regions / variants are tested. For more control over the region generation please use kipoi.postprocessing.variant_effects.predict_snvs
function's vcf_to_region
argument with SnvCenteredRg
, SnvPosRestrictedRg
, or None
. In the following section features of both functions score_variants
and predict_snvs
will be explained - please keep in mind that score_variants
is a wrapper function around predict_snvs
to cover most frequent use cases and reduce complexity of the user's code.
Test region generation based on VCFs:
Variant-centered effect prediction
In the above example the regions were defined by the dataloader arguments, but if the dataloader supports bed file input (see dataloader.yaml definition for variant effect prediction) then the SnvCenteredRg
class can generate a temporary bed file using a VCF and information on the required input sequence length from the model.yaml which is extracted by the ModelInfoExtractor
instance model_info
:
from kipoi.postprocessing.variant_effects import SnvCenteredRg
vcf_to_region = SnvCenteredRg(model_info)
The resulting vcf_to_region
object can then be used as the vcf_to_region
argument when calling predict_snvs
.
Restricted variant-centered effect prediction
This funcionality is similar to variant-centered effect prediction - the only difference is that this function is designed for models that can't predict on arbitrary regions of the genome, but only in certain regions of the genome. If those regions can be defined in a bed file (further on called 'restriction-bed' file) then this approach can be used. Variant effect prediction will then intersect the VCF with the restriction-bed and generate another bed file that is then passed on to the dataloader. Regions in the restriction-bed file may be larger than the input sequence lenght, in that case the generated seuqence will be centered on the variant position as much as possible - restricted by what is defined in the restrictions-bed file. The SnvPosRestrictedRg
class can generate a temporary bed file using a VCF, the restrictions-bed file (restricted_regions_fpath
in the example below) and information on the required input sequence length from the model.yaml which is extracted by the ModelInfoExtractor
instance model_info
:
from kipoi.postprocessing.variant_effects import SnvPosRestrictedRg
import pybedtools as pb
pbd = pb.BedTool(restricted_regions_fpath)
vcf_to_region = SnvPosRestrictedRg(model_info, pbd)
The resulting vcf_to_region
object can then be used as the vcf_to_region
argument when calling predict_snvs
.
Scoring functions
Scoring functions perform calculations on the model predictions for the reference and alternative sequences. Default scoring functions are: Logit
, LogitAlt
, LogitRef
, Diff
, DeepSEA_effect
. These functions are described in more detail in the variant effect prediction pages.
These and custom scoring functions can be used in the score_variants
function by setting the scores
as a list of strings, for example:
["logit", "diff"]
. This list can contain strings of the implemented scoring functions ("diff"
, "ref"
, "alt"
, "logit"
, "logit_ref"
, "logit_alt"
, "deepsea_effect"
) or callables that are inherited from RCScore
.
Fine-tuning scoring functions
The default scoring functions (Logit
, LogitAlt
, LogitRef
, Diff
, DeepSEA_effect
) offer different options on how the forward and the reverse complement sequences are merged together. They have an rc_merging
argument which can be "min"
, "max"
, "mean"
, "median"
or "absmax"
. So when using the score_variants
function the maximum between forward and reverse complement sequences for the alt-ref prediction differences should be returned, then the scores_kargs
argument would be: [{"rc_merging": "max"}]
and scores
would be ["diff"]
. The default rc_merging
value is "mean"
.
Saving mutated sequence sets to a file
A specialised feature of the predict_snvs
function that is only available when using the python functions is to save the mutated sequence sets in a file. This can be useful for quality control or if a non-deeplearning model outside the model zoo should be run using the same data. For those cases instances of the SyncHdf5SeqWriter
can be used. If they are passed to predict_snvs
as the argument generated_seq_writer
then the respective sequences are written to a file. Keep in mind that when defining a generated_seq_writer
then no actual effect prediction is performed, but only the reference/alternative sequence sets are generated and saved.
Return predictions
By default effect predicions are not kept in memory, but only written to the output VCF to ensure a low memory profile. By setting the parameter return_predictions = True
in predict_snvs
or in score_variants
the effect predictions are accumulated in memory and the results are returned as a dictionary of DataFrames, where the keys are the labels of the used scoring functions and the DataFrames have the shape (number effect predictions, number of output tasks of the model).
Using the command line interface
Similar to kipoi predict
variant effect prediction can be run by executing:
kipoi postproc score_variants my_model_name \
--dataloader_args '{...}' \
--vcf_path path/to/my_vcf.vcf \
--out_vcf_fpath path/to/my_annotated_vcf.vcf
Exceptions are that if the dataloader of the model allows the definition of a bed input file, then the respective field in the --dataloader_args
JSON will be replaced by a bed file that consists in regions that are centered on the variant position. That is, if in the dataloader.yaml file of the respective model the bed_input
flag is set then the respective argument in the --dataloader_args
will be overwritten.
When using variant effect prediction from the command line and using --source dir
, keep in mind that whatever the path is that you put where my_model_name
stands in the above command is treated as your model name. Since the annotated VCF INFO tags contain the model name as an identifier, executing kipoi postproc score_variants ./ --source dir ...
will result in an annotated VCF with the model name ".", which is most probably not desired. For those cases kipoi postproc score_variants ...
should be executed in at least one directory level higher than the one where the model.yaml file lies. Then the command will look similar to this kipoi postproc score_variants ./my_model --source dir ...
and the annotated VCF INFO tags will contain './my_model'.
Scoring functions
Scoring functions perform calculations on the model predictions for the reference and alternative sequences. Default scoring functions are: logit
, logit_alt
, logit_ref
, diff
, deepsea_effect
. These functions are described in more detail in the variant effect prediction pages.
Given a model is compatible with said scoring functions one or more of those can be selected by using the --scoring
argument, e.g.: --scoring diff logit
. The model.yaml file defines which scoring functions are available for a model, with the exception that the diff
scoring function is available for all models. In the model.yaml also additional custom scoring functions can be defined, for details on please see the variant effect prediction pages. The labels by which the different scoring functions are made available can also be defined in the model.yaml file using the name
tag.
Fine-tuning scoring functions
Scoring functions may have or may even require arguments at instantiation. Those arguments can be passed as JSON dictionaries to scoring functions by using the --scoring_kwargs
argument. If --scoring_kwargs
is used then for every label set in --scoring
there must be a --scoring_kwargs
JSON in the exact same order. If the degault values should be used or no arguments are required then an empty dictionary ({}
) can be used. For example: --scoring diff my_scr --scoring_kwargs '{}' '{my_arg:2}'
will use diff
with the default parameters and will instantiate my_scr(my_arg=2)
.
The default scoring functions (logit
, logit_alt
, logit_ref
, diff
, deepsea_effect
) offer different options on how the forward and the reverse complement sequences are merged together. They have an rc_merging
argument which can be "min"
, "max"
, "mean"
, "median"
or "absmax"
. So if the maximum between forward and reverse complement sequences for the alt-ref prediction differences should be returned, then the command would be: --scoring diff --scoring_kwargs '{rc_merging:"max"}'
. By default rc_merging
is set to "mean"
.
Putting scoring functions into context with DeepSEA
The DeepSEA model was one of the first publications to define variant effect prediction scores for deep learning models.
Some of the effect scores of the kipoi-veff
package are inspired by those. Here we wan to give an explanation how the original DeepSEA
scores relate to the ones in kipoi-veff
:
- The
effect score
orlog2 fold change
on the DeepSEA website is the same as using thelogit
function ofkipoi-veff
onDeepSEA/variantEffects
and dividing all values by the constant:np.log(2)
. - The
e-value
of DeepSEA cannot be readily provided as it relies on a set of effect prediction of randomly selected variants. The file size is restrively big and hence was not added to the Kipoi model. The DeepSEA website uses a score to compare the tested variant against the set of random variants (See Q&A 2.), this score is available inkipoi-veff
asdeepsea_effect
. If you want to calculate the evalue locally then usedeepsea_effect
with theDeepSEA/variantEffects
and compare your result to the set of predictions of 1M variants provided with the DeepSEA model.