predict_snvs

predict_snvs(model, dataloader, vcf_fpath, batch_size, num_workers=0, dataloader_args=None, vcf_to_region=None, vcf_id_generator_fn=<function default_vcf_id_gen at 0x7f65009666a8>, evaluation_function=<function analyse_model_preds at 0x7f6500963f28>, evaluation_function_kwargs={'diff_types': {'logit': <kipoi_veff.scores.Logit object at 0x7f650096c400>}}, sync_pred_writer=None, use_dataloader_example_data=False, return_predictions=False, generated_seq_writer=None)

Predict the effect of SNVs

Prediction of effects of SNV based on a VCF. If desired the VCF can be stored with the predicted values as annotation. For a detailed description of the requirements in the yaml files please take a look at the core kipoi documentation on how to write a dataloader.yaml file or at the documentation of kipoi-veff in the section: overview/#model-and-dataloader-requirements.

The evaluation_function is evaluated after the model predictions for reference and alternative allele were performed. By default the analyse_model_preds function is used, which executes the functions defined in its argument diff_types on the reference and alternative prediction of every sample. When using the default analyse_model_preds, then evaluation_function_kwargs has to be set to {'diff_types': <dict>}, where dict is a dictionary of scoring functions (subclasses of kipoi_veff.scores.Score) and the keys will be used to annotate the VCF (and dataframe) returned by predict_snvs.

Arguments

  • model: A kipoi model handle generated by e.g.: kipoi.get_model()
  • dataloader: Dataloader factory generated by e.g.: kipoi.get_dataloader_factory()
  • vcf_fpath: Path of the VCF defining the positions that shall be assessed. Only SNVs will be tested.
  • batch_size: Prediction batch size used for calling the data loader. Each batch will be generated in 4 mutated states yielding a system RAM consumption of >= 4x batch size.
  • num_workers: Number of parallel workers for loading the dataset.
  • dataloader_args: arguments passed on to the dataloader for sequence generation, arguments mentioned in dataloader.yaml > postprocessing > variant_effects > bed_input will be overwritten by the methods here.
  • vcf_to_region: Callable that generates a region compatible with dataloader/model from a cyvcf2 record
  • vcf_id_generator_fn: Callable that generates a unique ID from a cyvcf2 record
  • evaluation_function: effect evaluation function. Default is analyse_model_preds, which will get arguments defined in evaluation_function_kwargs
  • evaluation_function_kwargs: kwargs passed on to evaluation_function.
  • sync_pred_writer: Single writer or list of writer objects like instances of VcfWriter. This object will be called after effect prediction of a batch is done.
  • use_dataloader_example_data: Fill out the missing dataloader arguments with the example values given in the dataloader.yaml.
  • return_predictions: Return all variant effect predictions as a dictionary. Setting this to False will help maintain a low memory profile and is faster as it avoids concatenating batches after prediction.
  • generated_seq_writer: Single writer or list of writer objects like instances of SyncHdf5SeqWriter. This object will be called after the DNA sequence sets have been generated. If this parameter is not None, no prediction will be performed and only DNA sequence will be written!! This is relevant if you want to use the predict_snvs to generate appropriate input DNA sequences for your model.

Returns

dict: containing a pandas DataFrame containing the calculated values for each model output (target) column VCF SNV line. If return_predictions == False, returns None.