one_hot2string

one_hot2string(arr, alphabet=('A', 'C', 'G', 'T'))

Convert a one-hot encoded array back to string

rc_dna

rc_dna(seq)

Reverse complement the DNA sequence

assert rc_seq("TATCG") == "CGATA" assert rc_seq("tatcg") == "cgata"

rc_rna

rc_rna(seq)

Reverse complement the RNA sequence

assert rc_seq("TATCG") == "CGATA"

tokenize

tokenize(seq, alphabet=('A', 'C', 'G', 'T'), neutral_alphabet=['N'])

Convert sequence to integers

Arguments

  • seq: Sequence to encode
  • alphabet: Alphabet to use
  • neutral_alphabet: Neutral alphabet -> assign those values to -1

Returns

List of length len(seq) with integers from -1 to len(alphabet) - 1

token2one_hot

token2one_hot(tokens, alphabet_size=4, neutral_value=0.25, dtype=None)

Note: everything out of the alphabet is transformed into np.zeros(alphabet_size)

one_hot_dna

one_hot_dna(seq:str, alphabet:list=('A', 'C', 'G', 'T'), neutral_alphabet:str='N', neutral_value:Any=0.25, dtype=<class 'numpy.float32'>) -> numpy.ndarray

One-hot encode sequence.

fixed_len

fixed_len(seq, length, anchor='center', value='N')

Pad and/or trim a list of sequences to have common length. Procedure:

  1. Pad the sequence with N's or any other string or list element (value)
  2. Subset the sequence

Note

See also: https://keras.io/preprocessing/sequence/ Aplicable also for lists of characters

Arguments

  • sequence_vec: list of chars or lists List of sequences that can have various lengths
  • value: Neutral element to pad the sequence with. Can be str or list.
  • length: int or None; Final lenght of sequences. If None, length is set to the longest sequence length.
  • anchor: character; 'start', 'end' or 'center' To which end to anchor the sequences when triming/padding. See examples bellow.

Returns

List of sequences of the same class as sequence_vec

Example

    >>> sequence = 'CTTACTCAGA'
    >>> pad_sequence(sequence, 10, anchor="start", value="N")
    'CTTACTCAGA'
    >>> pad_sequence(sequence, 10, anchor="end", value="N")
    'CTTACTCAGA'
    >>> pad_sequences(sequence, 4, anchor="center", value="N")
    'ACTC'

    >>> sequence = 'TCTTTA'
    >>> pad_sequence(sequence, 10, anchor="start", value="N")
    'TCTTTANNNN'
    >>> pad_sequence(sequence, 10, anchor="end", value="N")
    'NNNNTCTTTA'
    >>> pad_sequences(sequence, 4, anchor="center", value="N")
    'CTTT'

resize_interval

resize_interval(interval, width, anchor='center')

Resize the Interval. Returns new Interval instance with correct length.

Arguments: interval: pybedtools.Interval object or an object containing start and end attributes width: desired width of the output interval anchor (str): which part of the sequence should be anchored. Choices: 'start', 'center', or 'end'

translate

translate(seq:str, hg38=False)

Translate the DNA/RNA sequence into AA.

Note: it stops after it encounters a stop codon

Arguments

  • seq: DNA/RNA sequence
  • stop_none: return None if a stop codon is encountered