predict_snvs
predict_snvs(model, dataloader, vcf_fpath, batch_size, num_workers=0, dataloader_args=None, vcf_to_region=None, vcf_id_generator_fn=<function default_vcf_id_gen at 0x7f65009666a8>, evaluation_function=<function analyse_model_preds at 0x7f6500963f28>, evaluation_function_kwargs={'diff_types': {'logit': <kipoi_veff.scores.Logit object at 0x7f650096c400>}}, sync_pred_writer=None, use_dataloader_example_data=False, return_predictions=False, generated_seq_writer=None)
Predict the effect of SNVs
Prediction of effects of SNV based on a VCF. If desired the VCF can be stored with the predicted values as
annotation. For a detailed description of the requirements in the yaml files please take a look at
the core kipoi
documentation on how to write a dataloader.yaml
file or at the documentation of
kipoi-veff
in the section: overview/#model-and-dataloader-requirements
.
The evaluation_function
is evaluated after the model predictions for reference and alternative allele were
performed. By default the analyse_model_preds
function is used, which executes the functions defined in its
argument diff_types
on the reference and alternative prediction of every sample. When using the default
analyse_model_preds
, then evaluation_function_kwargs
has to be set to {'diff_types': <dict>}
, where dict
is a dictionary of scoring functions (subclasses of kipoi_veff.scores.Score
) and the keys will be used to
annotate the VCF (and dataframe) returned by predict_snvs
.
Arguments
- model: A kipoi model handle generated by e.g.:
kipoi.get_model()
- dataloader: Dataloader factory generated by e.g.:
kipoi.get_dataloader_factory()
- vcf_fpath: Path of the VCF defining the positions that shall be assessed. Only SNVs will be tested.
- batch_size: Prediction batch size used for calling the data loader. Each batch will be generated in 4 mutated states yielding a system RAM consumption of >= 4x batch size.
- num_workers: Number of parallel workers for loading the dataset.
- dataloader_args: arguments passed on to the dataloader for sequence generation, arguments mentioned in dataloader.yaml > postprocessing > variant_effects > bed_input will be overwritten by the methods here.
- vcf_to_region: Callable that generates a region compatible with dataloader/model from a cyvcf2 record
- vcf_id_generator_fn: Callable that generates a unique ID from a cyvcf2 record
- evaluation_function: effect evaluation function. Default is
analyse_model_preds
, which will get arguments defined inevaluation_function_kwargs
- evaluation_function_kwargs: kwargs passed on to
evaluation_function
. - sync_pred_writer: Single writer or list of writer objects like instances of
VcfWriter
. This object will be called after effect prediction of a batch is done. - use_dataloader_example_data: Fill out the missing dataloader arguments with the example values given in the dataloader.yaml.
- return_predictions: Return all variant effect predictions as a dictionary. Setting this to False will help maintain a low memory profile and is faster as it avoids concatenating batches after prediction.
- generated_seq_writer: Single writer or list of writer objects like instances of
SyncHdf5SeqWriter
. This object will be called after the DNA sequence sets have been generated. If this parameter is not None, no prediction will be performed and only DNA sequence will be written!! This is relevant if you want to use thepredict_snvs
to generate appropriate input DNA sequences for your model.
Returns
dict
: containing a pandas DataFrame containing the calculated values
for each model output (target) column VCF SNV line. If return_predictions == False
, returns None.