Generated from notebooks/mutation_map.ipynb

Generate a mutation map

This inotebook shall give an simple introduction in how to produce mutation maps for all models that are enabled for variant effect prediction. Let's use the DeepSEA/variantEffects model.

Variant selection

In this example we will first run the variant effect prediction code that is described in detail in the variant_effect_prediction_simple.ipynb. We will use these effect scores to select variants with the strongest effects to then visualise one of those variants in a mutation map.

# First let's select and setup the model:
import kipoi
model_name = "DeepSEA/variantEffects"
#kipoi.pipeline.install_model_requirements(model_name)
# The input vcf path
vcf_path = "example_data/clinvar_donor_acceptor_chr22.vcf"
# The output vcf path, based on the input file name    
out_vcf_fpath = vcf_path[:-4] + "%s.vcf"%model_name.replace("/", "_")
# The datalaoder keyword arguments
dataloader_arguments = {"fasta_file": "example_data/hg19_chr22.fa"}
# Now actually run the effect prediction using the logit difference:
import kipoi_veff.snv_predict as sp
sp.score_variants(model = model_name,
                  dl_args = dataloader_arguments,
                  scores = ["logit"],
                  input_vcf = vcf_path,
                  output_vcf = out_vcf_fpath)

100%|██████████| 14/14 [00:34<00:00,  2.43s/it]

Just like in the variant_effect_prediction_simple.ipynb we will now load the results from the generated VCF into a dataframe for easy data access:

from kipoi_veff.parsers import KipoiVCFParser
vcf_reader = KipoiVCFParser("example_data/clinvar_donor_acceptor_chr22DeepSEA_variantEffects.vcf")
import pandas as pd
entries = [el for el in vcf_reader]
entries_df = pd.DataFrame(entries)
entries_df.index = entries_df["variant_id"]

Now we can select a subset of variants that score with the strongest score across the most model output tasks:

# select the 5 variants with the most universal strongest predicted effect in chr22
logit_cols = entries_df.columns.str.contains("LOGIT")
top5_vars = entries_df.loc[:,logit_cols].abs().idxmax().value_counts().head()
top5_vars

12342    305
16376    194
8374     127
22158     65
9970      54
dtype: int64

Now make a subset dataframe:

top5_df = entries_df.loc[entries_df["variant_id"].isin(top5_vars.index)]
top5_df

	variant_id	variant_chr	variant_pos	variant_ref	variant_alt	KV_DeepSEA/variantEffects_LOGIT_8988T_DNase_None_0	KV_DeepSEA/variantEffects_LOGIT_AoSMC_DNase_None_1	KV_DeepSEA/variantEffects_LOGIT_Chorion_DNase_None_2	KV_DeepSEA/variantEffects_LOGIT_CLL_DNase_None_3	KV_DeepSEA/variantEffects_LOGIT_Fibrobl_DNase_None_4	...	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H2AZ_None_910	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K27ac_None_911	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K27me3_None_912	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K36me3_None_913	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K4me1_None_914	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K4me2_None_915	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K4me3_None_916	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K79me2_None_917	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K9me3_None_918	KV_DeepSEA/variantEffects_rID_unnamed_0_0
variant_id
16376	16376	chr22	29108009	G	C	-2.560048	-2.604959	-1.476766	-1.966083	-1.286860	...	0.068821	0.474027	0.227797	0.258411	0.544803	0.144552	0.175402	0.575531	0.055439	chr22:29108009:G:['C']
22158	22158	chr22	29121356	C	G	0.148644	2.256156	0.885380	-0.016387	0.423442	...	0.033477	0.043208	-0.120319	-0.324504	0.189594	0.147367	-0.087590	-0.388921	-0.023764	chr22:29121356:C:['G']
9970	9970	chr22	36701970	G	A	0.504692	1.658696	1.032887	-0.148108	0.914291	...	0.321597	0.299887	0.003522	-0.060838	0.473003	0.342598	0.247115	-0.056063	-0.063615	chr22:36701970:G:['A']
8374	8374	chr22	40750331	A	G	0.571587	0.248589	0.201321	1.890934	0.111852	...	0.421956	-0.020715	-0.156171	-0.151327	0.022475	0.192291	0.189058	-0.144645	-0.187582	chr22:40750331:A:['G']
12342	12342	chr22	50964903	C	G	-1.935331	-2.190516	-1.468465	-1.936842	-1.670801	...	0.236538	0.432350	0.886636	0.113601	0.007765	-0.198594	-0.245576	0.767322	-0.188115	chr22:50964903:C:['G']

5 rows × 925 columns

Based on that selection let's now generate an input VCF file that can then be used for mutation map calculation:

query_vcf = "example_data/clinvar_donor_acceptor_chr22_5vars.vcf"
variant_order = []
# now subset the VCF:
with open(query_vcf, "w") as ofh:
    with open(vcf_path, "r") as ifh:
        for l in ifh:
            els = l.split(" ")
            if l.startswith("#"):
                ofh.write(l)
            elif els[2] in top5_vars.index.tolist():
                variant_order.append(els[2])
                ofh.write(l)

! cat $query_vcf

##fileformat=VCFv4.0
##FILTER=<ID=PASS,Description="All filters passed">
##contig=<ID=chr1,length=249250621>
##contig=<ID=chr2,length=243199373>
##contig=<ID=chr3,length=198022430>
##contig=<ID=chr4,length=191154276>
##contig=<ID=chr5,length=180915260>
##contig=<ID=chr6,length=171115067>
##contig=<ID=chr7,length=159138663>
##contig=<ID=chr8,length=146364022>
##contig=<ID=chr9,length=141213431>
##contig=<ID=chr10,length=135534747>
##contig=<ID=chr11,length=135006516>
##contig=<ID=chr12,length=133851895>
##contig=<ID=chr13,length=115169878>
##contig=<ID=chr14,length=107349540>
##contig=<ID=chr15,length=102531392>
##contig=<ID=chr16,length=90354753>
##contig=<ID=chr17,length=81195210>
##contig=<ID=chr18,length=78077248>
##contig=<ID=chr19,length=59128983>
##contig=<ID=chr20,length=63025520>
##contig=<ID=chr21,length=48129895>
##contig=<ID=chr22,length=51304566>
##contig=<ID=chrX,length=155270560>
##contig=<ID=chrY,length=59373566>
##contig=<ID=chrMT,length=16569>
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
chr22   40750331    8374    A   G   .   .   .
chr22   36701970    9970    G   A   .   .   .
chr22   50964903    12342   C   G   .   .   .
chr22   29108009    16376   G   C   .   .   .
chr22   29121356    22158   C   G   .   .   .

Now we also want the dataframe to have the same order as the VCF:

top5_df = top5_df.loc[variant_order, :]
top5_df

	variant_id	variant_chr	variant_pos	variant_ref	variant_alt	KV_DeepSEA/variantEffects_LOGIT_8988T_DNase_None_0	KV_DeepSEA/variantEffects_LOGIT_AoSMC_DNase_None_1	KV_DeepSEA/variantEffects_LOGIT_Chorion_DNase_None_2	KV_DeepSEA/variantEffects_LOGIT_CLL_DNase_None_3	KV_DeepSEA/variantEffects_LOGIT_Fibrobl_DNase_None_4	...	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H2AZ_None_910	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K27ac_None_911	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K27me3_None_912	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K36me3_None_913	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K4me1_None_914	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K4me2_None_915	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K4me3_None_916	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K79me2_None_917	KV_DeepSEA/variantEffects_LOGIT_Osteoblasts_H3K9me3_None_918	KV_DeepSEA/variantEffects_rID_unnamed_0_0
variant_id
8374	8374	chr22	40750331	A	G	0.571587	0.248589	0.201321	1.890934	0.111852	...	0.421956	-0.020715	-0.156171	-0.151327	0.022475	0.192291	0.189058	-0.144645	-0.187582	chr22:40750331:A:['G']
9970	9970	chr22	36701970	G	A	0.504692	1.658696	1.032887	-0.148108	0.914291	...	0.321597	0.299887	0.003522	-0.060838	0.473003	0.342598	0.247115	-0.056063	-0.063615	chr22:36701970:G:['A']
12342	12342	chr22	50964903	C	G	-1.935331	-2.190516	-1.468465	-1.936842	-1.670801	...	0.236538	0.432350	0.886636	0.113601	0.007765	-0.198594	-0.245576	0.767322	-0.188115	chr22:50964903:C:['G']
16376	16376	chr22	29108009	G	C	-2.560048	-2.604959	-1.476766	-1.966083	-1.286860	...	0.068821	0.474027	0.227797	0.258411	0.544803	0.144552	0.175402	0.575531	0.055439	chr22:29108009:G:['C']
22158	22158	chr22	29121356	C	G	0.148644	2.256156	0.885380	-0.016387	0.423442	...	0.033477	0.043208	-0.120319	-0.324504	0.189594	0.147367	-0.087590	-0.388921	-0.023764	chr22:29121356:C:['G']

5 rows × 925 columns

Calculating the mutation map

Now we are set to generate a mutation map based on a VCF containing variants of interest:

import kipoi
model_name = "DeepSEA/variantEffects"
#kipoi.pipeline.install_model_requirements(model_name)

from kipoi_veff import MutationMap

Set up the mutation map object with the necessary information such as the model and dataloader objects as well as the dataloader arguments to actually perform the calculation

model = kipoi.get_model(model_name)
dataloader = model.default_dataloader
dataloader_arguments = {"fasta_file": "example_data/hg19_chr22.fa"}
mm = MutationMap(model, dataloader, dataloader_args = dataloader_arguments)

Mutation maps can be generated based on a VCF file (mm.query_vcf), based on a bed file (mm.query_bed) or in the python API also based on a region (mm.query_region) that is passed by arguments. Here we will be using the mm.query_vcf method:

mmp = mm.query_vcf(query_vcf, scores = ["logit"])

The object returned by mm.query_* methods are MutationMapPlotter objects that can either be stored in a hdf5 file or be used for plotting using their plot_mutmap method.

The data in the MutationMapPlotter is stored hierarchically by:

Line of the query input (file), which is an integer number. For bed files the index is for regions of length of the model input that overlapped the regions defined in the region definition bed file.
The name of the model (DNA sequence) input with regard to which the effect predictions were calculated.
The scoring function defined by the score argument
The model output task for which the effect prediction was calculated

Therefore also for plotting one must select the dataset that should be plotted. Here we will select the fourth variant in the VCF (id: 16376) and the model task which had the strongest absolute predicted effect in our initial variant selection.

# Fourth variant in the VCF
sel_line = 3
# Select the model output being affected the most by this variant:
model_task = top5_df.loc[:,logit_cols].iloc[sel_line].abs().idxmax().split("_LOGIT_")[1]
model_task

'SK-N-SH_RA_CTCF_None_385'

mmp.plot_mutmap(sel_line, "seq", "logit", model_task)

<matplotlib.axes._subplots.AxesSubplot at 0x2b0856e433c8>

png

As we can see we can't see much. Everything is a bit squished, but the variant is already annotated with the ID that was given in the VCF file: 16376.

By default the full input sequence length is displayed, which is 1000bp for the DeepSEA model. In order to zoom in let's find the variant position and then zoom down to a 80bp window surrounding the variant:

var_pos = top5_df["variant_pos"].iloc[sel_line]
var_pos

29108009

In order to zoom in one can use the limit_region_genomic argument which accepts a tuple of two values - start and end of the genomic region (0-based) which should be selected for plotting.

mmp.plot_mutmap(sel_line, "seq", "logit", model_task, limit_region_genomic=(var_pos -40, var_pos+40))

<matplotlib.axes._subplots.AxesSubplot at 0x2b0857c1b198>

png

Now since we are looking at CTCF track one might suspect that the variant lies in a CTCF binding motif, but it is somewhat not 100% clear. Let's see whether a motif on the reverse strand is actually affected:

mmp.plot_mutmap(sel_line, "seq", "logit", model_task, limit_region_genomic=(var_pos -40, var_pos+40), rc_plot=True)

<matplotlib.axes._subplots.AxesSubplot at 0x2b085741f400>

png

Add additional annotation (dbSNP)

Now that we know how to zoom and reverse-complement mutation maps let's also try and highlight further variants in the region. For that I have prepared a subset of variants of the dbSNP b151 GRCh37 All_20180423.vcf.gz in the proximity of the selected variant. We can use this VCF to highlight dbSNP variants in the region.

Let's first get the variant information into the right format:

import cyvcf2
vcf_obj = cyvcf2.VCF("example_data/dbsnp_chr22_29108009.vcf")
variants = {"chr":[], "pos":[], "id":[], "ref":[], "alt":[]}
for rec in vcf_obj:
    # skip indels
    if rec.is_indel:
        continue
    variants["chr"].append(rec.CHROM)
    variants["pos"].append(rec.POS)
    variants["id"].append(rec.ID)
    variants["ref"].append(rec.REF)
    variants["alt"].append(rec.ALT[0])

Now we can plug this dictionary into the plotting method and take a look at our annotated plot.

mmp.plot_mutmap(sel_line, "seq", "logit", model_task, limit_region_genomic=(var_pos -40, var_pos+40),
                rc_plot=True, annotation_variants = variants)

<matplotlib.axes._subplots.AxesSubplot at 0x2b08574d6e10>

png

Arguably it's not the most beautiful plot as labels start overlapping, so let's remove them and let's also exclude out seed variant, in order to highlight the difference we might actually also want to change the colour of the variant boxes to red:

mmp.plot_mutmap(sel_line, "seq", "logit", model_task, limit_region_genomic=(var_pos -40, var_pos+40),
                rc_plot=True, annotation_variants = variants, show_var_id=False,
                ignore_stored_var_annotation=True,var_box_color="red", )

<matplotlib.axes._subplots.AxesSubplot at 0x2b086f892f28>

png

The CLI

Mutation maps can be calculated using the command line interface analogously to the way presented above for the python API. Let's therefore collect all the input and output file names:

import json
model_name = "DeepSEA/variantEffects"
dataloader_arguments_str = "'%s'"%json.dumps({"fasta_file": "example_data/hg19_chr22.fa"})
query_vcf = "example_data/clinvar_donor_acceptor_chr22_5vars.vcf"
output_file = "example_data/clinvar_donor_acceptor_chr22_5vars_mutmap.hdf5"

Now we are ready to calculate the mutation map data:

! kipoi veff create_mutation_map $model_name --dataloader_args $dataloader_arguments_str \
   --scores logit --output $output_file --regions_file $query_vcf

The output of that was a hdf5 file that can either be loaded by the python API or by the CLI command plot_mutation_map as presented here:

plot_file = "example_data/clinvar_donor_acceptor_chr22_5vars_mmPlot.png"
region_start, region_end = var_pos -40, var_pos+40

! kipoi veff plot_mutation_map --input_file $output_file \
   --output $plot_file --input_entry 3 --model_seq_input seq \
    --scoring_key logit --model_output "SK-N-SH_RA_CTCF_None_385" \
    --limit_region_genomic $region_start $region_end \
    --rc_plot

[33mWARNING[0m [44m[kipoi.__main__][0m `kipoi postproc` has been deprecated. Please use kipoi <plugin> ...: 
    # - plugin commands:
    veff             Variant effect prediction
    interpret        Model interpretation using feature importance scores like ISM, grad*input or DeepLIFT.
[0m

And this is the result:

from IPython.display import Image
Image(filename=plot_file)

png