BedDataset

BedDataset(self, tsv_file, label_dtype=None, bed_columns=3, num_chr=False, ambiguous_mask=None, incl_chromosomes=None, excl_chromosomes=None, ignore_targets=False)

Reads a tsv file in the following format:

chr  start  stop  task1  task2 ...

Arguments

  • tsv_file: tsv file type
  • bed_columns: number of columns corresponding to the bed file. All the columns after that will be parsed as targets
  • num_chr: if specified, 'chr' in the chromosome name will be dropped
  • label_dtype: specific data type for labels, Example: float or np.float32
  • ambiguous_mask: if specified, rows containing only ambiguous_mask values will be skipped
  • incl_chromosomes: exclusive list of chromosome names to include in the final dataset. if not None, only these will be present in the dataset
  • excl_chromosomes: list of chromosome names to omit from the dataset.
  • ignore_targets: if True, target variables are ignored

StringSeqIntervalDl

StringSeqIntervalDl(self, intervals_file, fasta_file, num_chr_fasta=False, label_dtype=None, auto_resize_len=None, use_strand=False, force_upper=True, ignore_targets=False)

Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited intervals_file. Returned sequences are of the type np.array([str]).

Arguments

  • intervals_file: bed3+ file path containing intervals + (optionally) labels example
  • fasta_file: Reference genome FASTA file path. example
  • num_chr_fasta: True, the the dataloader will make sure that the chromosomes don't start with chr.
  • label_dtype: None, datatype of the task labels taken from the intervals_file. Example - str, int, float, np.float32
  • auto_resize_len: None, required sequence length.
  • use_strand: reverse-complement fasta sequence if bed file defines negative strand. Requires a bed6 file
  • force_upper: Force uppercase output of sequences
  • ignore_targets: if True, don't return any target variables

Output schema

  • inputs:
    • shape=(), DNA sequence as string
  • targets:
    • shape=(None,), (optional) values following the bed-entries
  • metadata:
    • ranges:
      • Genomic ranges: chr, start, end, name, strand

SeqIntervalDl

SeqIntervalDl(self, intervals_file, fasta_file, num_chr_fasta=False, label_dtype=None, auto_resize_len=None, use_strand=False, alphabet_axis=1, dummy_axis=None, alphabet='ACGT', ignore_targets=False, dtype=None)

Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited intervals_file and converts them into one-hot encoded format. Returned sequences are of the type np.array with the shape inferred from the arguments: alphabet_axis and dummy_axis.

Arguments

  • intervals_file: bed3+ file path containing intervals + (optionally) labels example
  • fasta_file: Reference genome FASTA file path. example
  • num_chr_fasta: True, the the dataloader will make sure that the chromosomes don't start with chr.
  • label_dtype: None, datatype of the task labels taken from the intervals_file. Example: str, int, float, np.float32
  • auto_resize_len: None, required sequence length.
  • use_strand: reverse-complement fasta sequence if bed file defines negative strand. Requires a bed6 file
  • alphabet_axis: axis along which the alphabet runs (e.g. A,C,G,T for DNA)
  • dummy_axis: defines in which dimension a dummy axis should be added. None if no dummy axis is required.
  • alphabet: alphabet to use for the one-hot encoding. This defines the order of the one-hot encoding. Can either be a list or a string: 'ACGT' or ['A, 'C', 'G', 'T']. Default: 'ACGT'

  • dtype: defines the numpy dtype of the returned array. Example: int, np.int32, np.float32, float

  • ignore_targets: if True, don't return any target variables

Output schema

  • inputs:
    • shape=(None, 4), One-hot encoded DNA sequence
  • targets:
    • shape=(None,), (optional) values following the bed-entry - chr start end target1 target2 ....
  • metadata:
    • ranges:
      • Genomic ranges: chr, start, end, name, strand

AnchoredGTFDl

AnchoredGTFDl(self, gtf_file, fasta_file, num_upstream, num_downstream, gtf_filter='gene_type == "protein_coding"', anchor='tss', transform=<function one_hot_dna at 0x7f7111956950>, interval_attrs=['gene_id', 'Strand'], use_strand=True)

Dataloader for a combination of fasta and gtf files. The dataloader extracts fixed length regions around anchor points. Anchor points are extracted from the gtf based on the anchor parameter. The sequences corresponding to the region are then extracted from the fasta file and optionally trnasformed using a function given by the transform parameter.

Arguments

  • gtf_file: Path to a gtf file (str) example
  • fasta_file: Reference genome FASTA file path (str) example
  • num_upstream: Number of nt by which interval is extended upstream of the anchor point
  • num_downstream: Number of nt by which interval is extended downstream of the anchor point
  • gtf_filter: Allows to filter the gtf before extracting the anchor points. Can be str, callable or None. If str, it is interpreted as argument to pandas .query(). If callable, it is interpreted as function that filters a pandas dataframe and returns the filtered df.

  • anchor: Defines the anchor points. Can be str or callable. If it is a callable, it is treated as function that takes a pandas dataframe and returns a modified version of the dataframe where each row represents one anchor point, the position of which is stored in the column called anchor_pos. If it is a string, a predefined function is loaded. Currently available are tss (anchor is the start of a gene), start_codon (anchor is the start of the start_codon), stop_codon (anchor is the position right after the stop_codon), polya (anchor is the position right after the end of a gene).

  • transform: Callable (or None) to transform the extracted sequence (e.g. one-hot)

  • interval_attrs: Metadata to extract from the gtf, e.g. ["gene_id", "Strand"]
  • use_strand: True or False

Output schema

  • inputs:
    • shape=(None, 4), exon sequence with flanking intronic sequence
  • targets:

  • metadata:

    • gene_id:
      • gene_id
    • Strand:
      • Strand
    • ranges:
      • Genomic ranges: chr, start, end, name, strand

MMSpliceDl

MMSpliceDl(self, gtf_file, fasta_file, intron5prime_len=100, intron3prime_len=100, transform=None, **kwargs)

Dataloader for splicing models. With inputs as gtf annotation file and fasta file, each output is an exon sequence with flanking intronic seuqences. Intronic sequnce lengths specified by the users. Returned sequences are of the type np.array([str])

Arguments

  • gtf_file: file path; Genome annotation GTF file example
  • fasta_file: Reference Genome sequence in fasta format example
  • intron5prime_len: 5' intronic sequence length to take.
  • intron3prime_len: 3' intronic sequence length to take.
  • transform: transformation operation applied to the returned sequence. It needs to take seq, intron5prime_len and intron3prime_len as arguments.

Output schema

  • inputs:
    • shape=(), exon sequence with flanking intronic sequence
  • targets:

  • metadata:

    • geneID:
      • geneID
    • transcriptID:
      • transcriptID
    • ranges:
      • Genomic ranges: chr, start, end, name, strand