Using variant effect prediction

This chapter describes how to run variant prediction using a model in the zoo either using the python functionality or using the command line. A prerequesite is that the model is compatible with variant effect prediction (see: Variant effect prediction prerequesites for the model.yaml and dataloader.yaml)

Variant effect prediction in python

Using variant effect prediction within python allows more flexibility in the finegrain details compared to using the command line interface. The core function of variant effect prediction is score_variants, which on the one hand requires a model with its dataloader as well as a valid VCF.

The easiest way to run variant effect prediction is the following:

from kipoi.postprocessing.variant_effects import score_variants
dataloader_arguments = {...}
score_variants(model = "my_model_name",
               dl_args = dataloader_arguments,
               input_vcf = "path/to/my_vcf.vcf",
               output_vcf = "path/to/my_annotated_vcf.vcf",)

Where model is a kipoi model - replace my_model_name by a valid model name. dataloader_arguments contains all the kwargs that are necessary to run the dataloader. The coordinates in the input_vcf have to match the genome / assembly etc. of the raw input files used by the dataloader. The output of score_variants is an annotated VCF - output_vcf. For more details please look at the detailed function description of score_variants. For details on the different scoring methods please take a look at the detailed explanation of variant effect prediction or the API defintion.

The above code will run the dataloader based with dataloader_arguments and try to overlap the input VCF with the sequences generated by the dataloader. If a model dataloader accepts bed files input to control the generated regions, then a temporary bed file with variant-centered regions will be generated. If the dataloader does not offer a bed file input then the inputs generated by the dataloader will automatically be overlapped with the positions in the VCF and only the overlapping regions / variants are tested. For more control over the region generation please use kipoi.postprocessing.variant_effects.predict_snvs function's vcf_to_region argument with SnvCenteredRg, SnvPosRestrictedRg, or None. In the following section features of both functions score_variants and predict_snvs will be explained - please keep in mind that score_variants is a wrapper function around predict_snvs to cover most frequent use cases and reduce complexity of the user's code.

Test region generation based on VCFs:

Variant-centered effect prediction

In the above example the regions were defined by the dataloader arguments, but if the dataloader supports bed file input (see dataloader.yaml definition for variant effect prediction) then the SnvCenteredRg class can generate a temporary bed file using a VCF and information on the required input sequence length from the model.yaml which is extracted by the ModelInfoExtractor instance model_info:

from kipoi.postprocessing.variant_effects import SnvCenteredRg
vcf_to_region = SnvCenteredRg(model_info)

The resulting vcf_to_region object can then be used as the vcf_to_region argument when calling predict_snvs.

Restricted variant-centered effect prediction

This funcionality is similar to variant-centered effect prediction - the only difference is that this function is designed for models that can't predict on arbitrary regions of the genome, but only in certain regions of the genome. If those regions can be defined in a bed file (further on called 'restriction-bed' file) then this approach can be used. Variant effect prediction will then intersect the VCF with the restriction-bed and generate another bed file that is then passed on to the dataloader. Regions in the restriction-bed file may be larger than the input sequence lenght, in that case the generated seuqence will be centered on the variant position as much as possible - restricted by what is defined in the restrictions-bed file. The SnvPosRestrictedRg class can generate a temporary bed file using a VCF, the restrictions-bed file (restricted_regions_fpath in the example below) and information on the required input sequence length from the model.yaml which is extracted by the ModelInfoExtractor instance model_info:

from kipoi.postprocessing.variant_effects import SnvPosRestrictedRg
import pybedtools as pb

pbd = pb.BedTool(restricted_regions_fpath)
vcf_to_region = SnvPosRestrictedRg(model_info, pbd)

The resulting vcf_to_region object can then be used as the vcf_to_region argument when calling predict_snvs.

Scoring functions

Scoring functions perform calculations on the model predictions for the reference and alternative sequences. Default scoring functions are: Logit, LogitAlt, LogitRef, Diff, DeepSEA_effect. These functions are described in more detail in the variant effect prediction pages.

These and custom scoring functions can be used in the score_variants function by setting the scores as a list of strings, for example: ["logit", "diff"]. This list can contain strings of the implemented scoring functions ("diff", "ref", "alt", "logit", "logit_ref", "logit_alt", "deepsea_effect") or callables that are inherited from RCScore.

Fine-tuning scoring functions

The default scoring functions (Logit, LogitAlt, LogitRef, Diff, DeepSEA_effect) offer different options on how the forward and the reverse complement sequences are merged together. They have an rc_merging argument which can be "min", "max", "mean", "median" or "absmax". So when using the score_variants function the maximum between forward and reverse complement sequences for the alt-ref prediction differences should be returned, then the scores_kargs argument would be: [{"rc_merging": "max"}] and scores would be ["diff"]. The default rc_merging value is "mean".

Saving mutated sequence sets to a file

A specialised feature of the predict_snvs function that is only available when using the python functions is to save the mutated sequence sets in a file. This can be useful for quality control or if a non-deeplearning model outside the model zoo should be run using the same data. For those cases instances of the SyncHdf5SeqWriter can be used. If they are passed to predict_snvs as the argument generated_seq_writer then the respective sequences are written to a file. Keep in mind that when defining a generated_seq_writer then no actual effect prediction is performed, but only the reference/alternative sequence sets are generated and saved.

Return predictions

By default effect predicions are not kept in memory, but only written to the output VCF to ensure a low memory profile. By setting the parameter return_predictions = True in predict_snvs or in score_variants the effect predictions are accumulated in memory and the results are returned as a dictionary of DataFrames, where the keys are the labels of the used scoring functions and the DataFrames have the shape (number effect predictions, number of output tasks of the model).

Using the command line interface

Similar to kipoi predict variant effect prediction can be run by executing:

kipoi postproc score_variants my_model_name \
    --dataloader_args '{...}' \
    --vcf_path path/to/my_vcf.vcf \
    --out_vcf_fpath path/to/my_annotated_vcf.vcf

Exceptions are that if the dataloader of the model allows the definition of a bed input file, then the respective field in the --dataloader_args JSON will be replaced by a bed file that consists in regions that are centered on the variant position. That is, if in the dataloader.yaml file of the respective model the bed_input flag is set then the respective argument in the --dataloader_args will be overwritten.

When using variant effect prediction from the command line and using --source dir, keep in mind that whatever the path is that you put where my_model_name stands in the above command is treated as your model name. Since the annotated VCF INFO tags contain the model name as an identifier, executing kipoi postproc score_variants ./ --source dir ... will result in an annotated VCF with the model name ".", which is most probably not desired. For those cases kipoi postproc score_variants ... should be executed in at least one directory level higher than the one where the model.yaml file lies. Then the command will look similar to this kipoi postproc score_variants ./my_model --source dir ... and the annotated VCF INFO tags will contain './my_model'.

Scoring functions

Scoring functions perform calculations on the model predictions for the reference and alternative sequences. Default scoring functions are: logit, logit_alt, logit_ref, diff, deepsea_effect. These functions are described in more detail in the variant effect prediction pages.

Given a model is compatible with said scoring functions one or more of those can be selected by using the --scoring argument, e.g.: --scoring diff logit. The model.yaml file defines which scoring functions are available for a model, with the exception that the diff scoring function is available for all models. In the model.yaml also additional custom scoring functions can be defined, for details on please see the variant effect prediction pages. The labels by which the different scoring functions are made available can also be defined in the model.yaml file using the name tag.

Fine-tuning scoring functions

Scoring functions may have or may even require arguments at instantiation. Those arguments can be passed as JSON dictionaries to scoring functions by using the --scoring_kwargs argument. If --scoring_kwargs is used then for every label set in --scoring there must be a --scoring_kwargs JSON in the exact same order. If the degault values should be used or no arguments are required then an empty dictionary ({}) can be used. For example: --scoring diff my_scr --scoring_kwargs '{}' '{my_arg:2}' will use diff with the default parameters and will instantiate my_scr(my_arg=2).

The default scoring functions (logit, logit_alt, logit_ref, diff, deepsea_effect) offer different options on how the forward and the reverse complement sequences are merged together. They have an rc_merging argument which can be "min", "max", "mean", "median" or "absmax". So if the maximum between forward and reverse complement sequences for the alt-ref prediction differences should be returned, then the command would be: --scoring diff --scoring_kwargs '{rc_merging:"max"}'. By default rc_merging is set to "mean".

Putting scoring functions into context with DeepSEA

The DeepSEA model was one of the first publications to define variant effect prediction scores for deep learning models. Some of the effect scores of the kipoi-veff package are inspired by those. Here we wan to give an explanation how the original DeepSEA scores relate to the ones in kipoi-veff:

  • The effect score or log2 fold change on the DeepSEA website is the same as using the logit function of kipoi-veff on DeepSEA/variantEffects and dividing all values by the constant: np.log(2).
  • The e-value of DeepSEA cannot be readily provided as it relies on a set of effect prediction of randomly selected variants. The file size is restrively big and hence was not added to the Kipoi model. The DeepSEA website uses a score to compare the tested variant against the set of random variants (See Q&A 2.), this score is available in kipoi-veff as deepsea_effect. If you want to calculate the evalue locally then use deepsea_effect with the DeepSEA/variantEffects and compare your result to the set of predictions of 1M variants provided with the DeepSEA model.