Postprocessing

Kipoi offers a set of postprocessing tools that enable to calculate variant effects, create mutation maps, inspect activation of hidden model layers and to calculate the gradient of layer activation with respect to a given input.

Variant effect prediction and mutation map generation is available for all models where the variant_effects parameter in the model.yaml (and dataloader.yaml) is set (see here)[http://kipoi.org/docs/postprocessing/variant_effect_prediction]. Inspection of the activation of hidden model layers and calculation of gradients is available for all deep learning models: Currently supported are Keras, PyTorch and Tensorflow models. For a detailed description and examples of how to use tose features please take a look at:

  • (Variant effect prediction)[# Using variant effect prediction]
  • (Mutation maps)[# Mutation maps]
  • ((Intermediate) layer activation extraction)[# Layer activation extraction]
  • (Gradient calculation)[# Gradient calculation]

Using variant effect prediction

This chapter describes how to run variant prediction using a model in the zoo either using the python functionality or using the command line. A prerequesite is that the model is compatible with variant effect prediction (see: Variant effect prediction prerequesites for the model.yaml and dataloader.yaml)

Variant effect prediction in python

Using variant effect prediction within python allows more flexibility in the finegrain details compared to using the command line interface. The core function of variant effect prediction is score_variants, which on the one hand requires a model with its dataloader as well as a valid VCF.

The easiest way to run variant effect prediction is the following:

from kipoi.postprocessing.variant_effects import score_variants
dataloader_arguments = {...}
score_variants(model = "my_model_name",
               dl_args = dataloader_arguments,
               input_vcf = "path/to/my_vcf.vcf",
               output_vcf = "path/to/my_annotated_vcf.vcf",)

Where model is a kipoi model - replace my_model_name by a valid model name. dataloader_arguments contains all the kwargs that are necessary to run the dataloader. The coordinates in the input_vcf have to match the genome / assembly etc. of the raw input files used by the dataloader. The output of score_variants is an annotated VCF - output_vcf. For more details please look at the detailed function description of score_variants. For details on the different scoring methods please take a look at the detailed explanation of variant effect prediction or the API defintion.

The above code will run the dataloader based with dataloader_arguments and try to overlap the input VCF with the sequences generated by the dataloader. If a model dataloader accepts bed files input to control the generated regions, then a temporary bed file with variant-centered regions will be generated. If the dataloader does not offer a bed file input then the inputs generated by the dataloader will automatically be overlapped with the positions in the VCF and only the overlapping regions / variants are tested. For more control over the region generation please use kipoi.postprocessing.variant_effects.predict_snvs function's vcf_to_region argument with SnvCenteredRg, SnvPosRestrictedRg, or None. In the following section features of both functions score_variants and predict_snvs will be explained - please keep in mind that score_variants is a wrapper function around predict_snvs to cover most frequent use cases and reduce complexity of the user's code.

Test region generation based on VCFs:

Variant-centered effect prediction

In the above example the regions were defined by the dataloader arguments, but if the dataloader supports bed file input (see dataloader.yaml definition for variant effect prediction) then the SnvCenteredRg class can generate a temporary bed file using a VCF and information on the required input sequence length from the model.yaml which is extracted by the ModelInfoExtractor instance model_info:

from kipoi.postprocessing.variant_effects import SnvCenteredRg
vcf_to_region = SnvCenteredRg(model_info)

The resulting vcf_to_region object can then be used as the vcf_to_region argument when calling predict_snvs.

Restricted variant-centered effect prediction

This funcionality is similar to variant-centered effect prediction - the only difference is that this function is designed for models that can't predict on arbitrary regions of the genome, but only in certain regions of the genome. If those regions can be defined in a bed file (further on called 'restriction-bed' file) then this approach can be used. Variant effect prediction will then intersect the VCF with the restriction-bed and generate another bed file that is then passed on to the dataloader. Regions in the restriction-bed file may be larger than the input sequence lenght, in that case the generated seuqence will be centered on the variant position as much as possible - restricted by what is defined in the restrictions-bed file. The SnvPosRestrictedRg class can generate a temporary bed file using a VCF, the restrictions-bed file (restricted_regions_fpath in the example below) and information on the required input sequence length from the model.yaml which is extracted by the ModelInfoExtractor instance model_info:

from kipoi.postprocessing.variant_effects import SnvPosRestrictedRg
import pybedtools as pb

pbd = pb.BedTool(restricted_regions_fpath)
vcf_to_region = SnvPosRestrictedRg(model_info, pbd)

The resulting vcf_to_region object can then be used as the vcf_to_region argument when calling predict_snvs.

Scoring functions

Scoring functions perform calculations on the model predictions for the reference and alternative sequences. Default scoring functions are: Logit, LogitAlt, LogitRef, Diff, DeepSEA_effect. These functions are described in more detail in the variant effect prediction pages.

These and custom scoring functions can be used in the score_variants function by setting the scores as a list of strings, for example: ["logit", "diff"]. This list can contain strings of the implemented scoring functions ("diff", "ref", "alt", "logit", "logit_ref", "logit_alt", "deepsea_effect") or callables that are inherited from RCScore.

Fine-tuning scoring functions

The default scoring functions (Logit, LogitAlt, LogitRef, Diff, DeepSEA_effect) offer different options on how the forward and the reverse complement sequences are merged together. They have an rc_merging argument which can be "min", "max", "mean", "median" or "absmax". So when using the score_variants function the maximum between forward and reverse complement sequences for the alt-ref prediction differences should be returned, then the scores_kargs argument would be: [{"rc_merging": "max"}] and scores would be ["diff"]. The default rc_merging value is "mean".

Saving mutated sequence sets to a file

A specialised feature of the predict_snvs function that is only available when using the python functions is to save the mutated sequence sets in a file. This can be useful for quality control or if a non-deeplearning model outside the model zoo should be run using the same data. For those cases instances of the SyncHdf5SeqWriter can be used. If they are passed to predict_snvs as the argument generated_seq_writer then the respective sequences are written to a file. Keep in mind that when defining a generated_seq_writer then no actual effect prediction is performed, but only the reference/alternative sequence sets are generated and saved.

Return predictions

By default effect predicions are not kept in memory, but only written to the output VCF to ensure a low memory profile. By setting the parameter return_predictions = True in predict_snvs or in score_variants the effect predictions are accumulated in memory and the results are returned as a dictionary of DataFrames, where the keys are the labels of the used scoring functions and the DataFrames have the shape (number effect predictions, number of output tasks of the model).

Using the command line interface

Similar to kipoi predict variant effect prediction can be run by executing:

kipoi postproc score_variants my_model_name \
    --dataloader_args '{...}' \
    --vcf_path path/to/my_vcf.vcf \
    --out_vcf_fpath path/to/my_annotated_vcf.vcf

Exceptions are that if the dataloader of the model allows the definition of a bed input file, then the respective field in the --dataloader_args JSON will be replaced by a bed file that consists in regions that are centered on the variant position. That is, if in the dataloader.yaml file of the respective model the bed_input flag is set then the respective argument in the --dataloader_args will be overwritten.

When using variant effect prediction from the command line and using --source dir, keep in mind that whatever the path is that you put where my_model_name stands in the above command is treated as your model name. Since the annotated VCF INFO tags contain the model name as an identifier, executing kipoi postproc score_variants ./ --source dir ... will result in an annotated VCF with the model name ".", which is most probably not desired. For those cases kipoi postproc score_variants ... should be executed in at least one directory level higher than the one where the model.yaml file lies. Then the command will look similar to this kipoi postproc score_variants ./my_model --source dir ... and the annotated VCF INFO tags will contain './my_model'.

Scoring functions

Scoring functions perform calculations on the model predictions for the reference and alternative sequences. Default scoring functions are: logit, logit_alt, logit_ref, diff, deepsea_effect. These functions are described in more detail in the variant effect prediction pages.

Given a model is compatible with said scoring functions one or more of those can be selected by using the --scoring argument, e.g.: --scoring diff logit. The model.yaml file defines which scoring functions are available for a model, with the exception that the diff scoring function is available for all models. In the model.yaml also additional custom scoring functions can be defined, for details on please see the variant effect prediction pages. The labels by which the different scoring functions are made available can also be defined in the model.yaml file using the name tag.

Fine-tuning scoring functions

Scoring functions may have or may even require arguments at instantiation. Those arguments can be passed as JSON dictionaries to scoring functions by using the --scoring_kwargs argument. If --scoring_kwargs is used then for every label set in --scoring there must be a --scoring_kwargs JSON in the exact same order. If the degault values should be used or no arguments are required then an empty dictionary ({}) can be used. For example: --scoring diff my_scr --scoring_kwargs '{}' '{my_arg:2}' will use diff with the default parameters and will instantiate my_scr(my_arg=2).

The default scoring functions (logit, logit_alt, logit_ref, diff, deepsea_effect) offer different options on how the forward and the reverse complement sequences are merged together. They have an rc_merging argument which can be "min", "max", "mean", "median" or "absmax". So if the maximum between forward and reverse complement sequences for the alt-ref prediction differences should be returned, then the command would be: --scoring diff --scoring_kwargs '{rc_merging:"max"}'. By default rc_merging is set to "mean".

Mutation maps

Mutation maps are related to variant effect prediction discussed above. Mutation maps are the application of SNV variant effect prediction on every position of the input sequence with all three alternative alleles. Therefore mutation maps can only be generated for models that support variant effect prediction. Mutation maps can be used to give an overview over the effect scores in a selected region. This region may be centered on a variant of interest or any other region in the genome for which the model can produce a prediction. It is therefore complementary to the variant effect prediction functionality and is intended for use with less variants / regions of interest as the variant effect prediction itself. Typically a mutation map should be calculated for only a handful of regions or query variants (that each tag a region), because for every region many effect predictions have to calculated resulting in calculation time and memory requirements: For a single query variant / query region N = model_sequence_length * 3 * model_output_tasks * effect_scoring_functions effect predictions have to be performed.

The workflow is desinged in two steps: In a first step the aforementioned calculation is performed and results are stored in an hdf5 file with standardised format. These files can then be imported into the visualisation part of mutation maps. Both steps are available in python, R as well as the command line.

Calculating mutation maps

Python / R API

The core element of mutation maps is the MutationMap class that is instantiated with a Kipoi model object, a dataloader object and the dataloader arguments:

import kipoi
from kipoi.postprocessing.variant_effects import MutationMap

model = kipoi.get_model(<model_name>)
dataloader = model.default_dataloader
dataloader_arguments = {...}

mm = MutationMap(model, dataloader, dataloader_arguments)

MutationMap instances have the following methods to calculate mutation maps for the given query regions / variants: query_region, query_bed, query_vcf. All those functions return an instance of MutationMapPlotter, which can be stored as a hdf5 file or directly be used for generating mutation map plots.

query_region

The query_region command can be used to generate a mutation map for a selected genomic region:

mmp = mm.query_region("chr22", 25346, 25357)

The query region has to be transformed to match model input sequence length as well as it has to lie in a genomic region for which the model can produce prediction. All this is taken care of automatically just like in the score_variants function of kipoi.postprocessing.variant_effects. For more details please see below.

query_bed

The query_bed command can be used to generate mutation maps for genomic regions defined in the bed file:

mmp = mm.query_region("path/to/my/file.bed")

The query regions have to be transformed to match model input sequence length as well as they hasveto lie in a genomic region for which the model can produce prediction. All this is taken care of automatically just like in the score_variants function of kipoi.postprocessing.variant_effects. For more details please see below.

query_vcf

The query_vcf command can be used to generate mutation maps based on variants defined in the vcf file:

mmp = mm.query_vcf("path/to/my/file.vcf")

The regions for query variants are generated analogously to the score_variants function of kipoi.postprocessing.variant_effects. For more details please see below.

MutationMapPlotter

Instances of the MutationMapPlotter class are generated by the query_* methods of the MutationMap class. They contain all the effect predictions plus some additional meta data necessary to produce mutation map plots. Those objects can be stored in hdf5 files.

For plotting the plot_mutmap function can be used. The required arguments select the input sequence by a numerical index (input_entry), the name of the DNA sequence model input (model_seq_input), the name of the scoring function for which the results should be displayed (scoring_key) and finally the model output task (model_output). A combination of those four values directly link to one set of mutation map predictions.

input_entry

input_entry is a numerical index indicating which set of input data should be used. This relates back to how query regions are turned into model input data, see below. Since this link depends model sequence length as well as whether the model can only predict for a restricted subset of he genome, the meaning of an index value may vary from model to model. For a combination of models with highly different model input specifications it is therefore advisable to only query a single variant or region in order to avoid confusion.

model_seq_input

Many models will only have a single model input key, so this parameter might seem superfluous, but in general a model can have multiple DNA sequence inputs which are all being tested for variant effects.

scoring_key

The scoring key is one of the labels passed to the query_* function in the scores argument.

model_output

model_output is a model output task label.

Additional to the required arguments the plots can be generated for a subset of the model input sequence using the limit_region_genomic. The plot can be generated with reverse-complementation of the sequence by using rc_plot. There are additional features available for the python/R API which are described in the method definition, some of which are also used in the mutation_map.ipynb.

The CLI

In the CLI mutation maps can be calculated for bed files or for VCF files. Both file formats are accepted by the --regions_file argument of the CLI command:

kipoi postproc create_mutation_map <my_model_name> --dataloader_args '{...}' --regions_file path/to/my/file.vcf --output path/to/the/results/file.hdf5

Plotting

Plotting in the command line works analogously as using the python API:

kipoi postproc plot_mutation_map --input_file path/to/the/results/file.hdf5 --input_entry 0 --model_seq_input seq --scoring_key diff --model_output my_model_task --output path/to/the/plot/file.png

The meaning of the parameters is identical to the ones in the python API mentioned above. The plotting functionality in the CLI is limited to zooming into genomic region and reverse-complementation of sequences. For examaples please take a look at the mutation_map.ipynb.

Transformation of queries to model input

In order to perform a query on a model the query input must be transformed into genomic regions compatible with the model. Similar to variant effect prediction using the score_variants the automatically chosen region generation method will be chosen based on whether a dataloader offers a bed file input for postprocessing. dataloader.yaml > postprocessing > variant_effects > bed_input. By setting this value the mutation map method will automatically generate a temporary bed input file requesting model input for genomic regions. The path of this temporary bed file is then passed on to the dataloader by resetting the respective argument in the datalaoder_arguments. For some models it is not possible to freely define the genomic region for which model input data should be generated - in that case the dataloader.yaml does not have the dataloader.yaml > postprocessing > variant_effects > bed_input set. In those cases the datalaoder is executed without modifying the dataloader_arguments. The metadata generated alongside the model input is then used to identify model input that overlaps a query region / query variant.

For cases when genomic regions can be defined freely for a model, the input samples will always have to generated matching the model input sequence length. This means that for query variants a region of the length of the model input will be centered on the query variant position. For query regions (e.g.: bed input file) every region is overlapped with windows of length of the model input. The first of those regions will start at the same position as the selected query region. Regions of the length of the model input sequence length will then be generated consecutively in order to cover the full region defined by the respective query region - see this schematic:

variant effect prediction sketch

In the top bit of this schematic on can see the case in which the dataloader accepts a bed file as an input to generate model input data. This also requires the correct setup of the dataloader.yaml in postprocessing > variant_effects > bed_input as described in more detail (here)[../postprocessing/variant_effect_prediction]. When the dataloader doesn't support bed input files for region defintion then all the regions generated by the dataloader will be overlapped with the query regions and any overlapping data generated from the dataloader will be used for mutation maps. The same as for bed query files holds true for VCF query files.

As mentioned above all of those model input sequences are the subjected to variant effect prediction for every base and ever possible allele.

The integer numbers displayed in the green or orange boxes are te order in which model input data is processed by the mutation map calculation algorithm. The numbers represent the index by which the predictions can then be accessed for plotting (input_entry in plot_mutmap method of MutationMapPlotter or --input_entry CLI argument).

Layer activation extraction

Similar to model prediction on a batch of input data it is possible to predict the activation of intermediate layers. The layer can be selected using the layer argument and the value has to be the identifier of the respective layer.

python / R API

In python and R intermediate (hidden) layer activation can be calculated using the predict_activation_on_batch method of Kipoi models implementing the LayerActivationMixin. All deep learning frameworks supported by Kipoi implement this mixin:

import kipoi
# Get the model
model = kipoi.get_model("my_model_name")
dataloader_arguments = {...}
# Get the dataloader iterator
dl_iterator = model.default_dataloader(**dataloader_arguments).iterator()

# Get layer activation
res = model.predict_activation_on_batch(dl_iterator.__next__(), layer = "layer_of_interest")

CLI

The prediction of layer activation using the command line works by adding the --layer argument to the prediction command:

``` kipoi predict my_model_name --dataloader_arguments '{...}' --layer 'layer_of_interest' --output path/to/output/file.

Gradient calculation

The calculation of gradients of the activation of a layer with respect to input data has proven to be a powerful tool for model and data interpretation. This approach can explain which portions of the model input were important for a given model output or for a given activation of a hidden layer. This feature is known as saliency maps. A successor, which relates gradients to the respective model input is the gradinput_ score, which is - as its name implies - a multiplication of the first order gradient on the input data with the input data itself. _gradinput for genomics models is seen as an alterntiave / complement to perturbation-based approaches like variant effect prediction. The advantage of gradient-based approaches being that in a single calculation step the importance of all bases in an input DNA sqeuence is established at the same time. Additionally gradient based approaches are not limited to Kipoi models that support variant effect prediction, and therefore are usable on all deep learning models. Finally gradient-based approaches highlight the input feature importance for all model inputs at once, also taking interactions between input into account, which cannot be done in variant effect prediction (perturbation-based approaches).

Even though deep learning frameworks come with automatic gradient calculation as one of their core features, the accessibility of this functionality to the user is generally not straight forward and may vary from framework version to version. Kipoi comes with a consistent API to extract gradients for the Keras, PyTorch, and Tensorflow frameworks. The API takes advantage of the individual implementations of the automatic differentiation algorithm and in the frameworks, supporting multiple versions of those frameworks.

To offer a consistent design the gradient can only be calculated with respect to model input - not with respect to the input of a hidden layer. What is essentially needed in terms of parameters to perform the calculation is: the Kipoi model instance, the Kipoi model input data (generated by a dataloader), an identifier of the (hidden) layer from which (backwards) the gradient should be calculated (layer). The gradient can in most frameworks only be calculated on a scalar. To ensure compatibility Kipoi therefore uses averaging functions (avg_func) that average across the outputs of a layer. Amongst those the summation (sum), maximum (max), minimum (min) and the maximum of absolute values (absmax) are available. To allow more fine-grain control the user can subset outputs (filter_idx) of a selected layer that are then passed to the averaging function. The filter_idx arguments takes integers, slice objects, and any tuple consisting in combinations of integers and slice objects. What is passed to filter_idx will be used to select from the model layer output, it must therefore be compatible with the layer output shape.

To simplify the process for users who want to calculate the gradient starting from the model output (final layer) the final_layer argument can be used instead of explicitely selecting the layer by its ids with layer. This functionality has to be used with caution, as models may contain post-processing layers or non-linear activation functions as their final layer. Gradient calculation with respect to those values is generally not recommended. To overcome the problem of the last layer being a non-linear activation function another argument pre_nonlinearity was imlemented that tries to traverse the model graph from the selected layer towards model input in case the selected layer is a non-linear activation function. The pre_nonlinearity cannot be implemented for all models and using it may raise Exceptions in case the selected Kipoi model cannot support that feature. It is generally advisable to select a layer explicitely by using the (layer) to ensure that the desired calculations are being performed.

Gradient visualisation

To complete the model gradient calculation Kipoi comes with gradient visualisation tools, that can distinguish between 1-hot encoded DNA sequence model input and other model input. The default plotting function for 1-hot encoded DNA sequence is a seq-logo plot displaying the prediction output as letter height.

In case type of a model input is unknown or it is not DNA sequence a heatmap will be generated. In any case the visualisation tools are only available for one- or two-dimensional input data.

An additional feature is the generation of a bedgraph file instead of a heatmap/seqlogo plot. In the general case a seq_dim parameter has to be set defining the dimension in which every entry corresponds to one genomic position in a consecutive order.

The GradPlotter class which is repsonsible for gradient visualisations has a plot function that has only one required argument sample. sample is the integer index of dataloader sample for which the plot should be produced.

Python / R API

All of the arguments mentioned above are available within the grad_input method of Kipoi models that implement GradientMixin. Namely those are KerasModel, PyTorchModel, and TensorflowModel. For an example please refer to the grad_input.ipynb.

The returned output from the grad_input matches the structre of the inputs entry of the data batch generated by the model dataloader.

CLI

The command line interface is designed in analogy with the model prediction functionality plus the arguments exlpained above. The output can be stored in a tsv or an hdf5 file. In case a hdf5 output is chosen the gradient visualisation methods that come with Kipoi can be used. For an example please refer to the grad_input.ipynb.