one_hot2string
one_hot2string(arr, alphabet=('A', 'C', 'G', 'T'))
Convert a one-hot encoded array back to string
rc_dna
rc_dna(seq)
Reverse complement the DNA sequence
assert rc_seq("TATCG") == "CGATA" assert rc_seq("tatcg") == "cgata"
rc_rna
rc_rna(seq)
Reverse complement the RNA sequence
assert rc_seq("TATCG") == "CGATA"
tokenize
tokenize(seq, alphabet=('A', 'C', 'G', 'T'), neutral_alphabet=['N'])
Convert sequence to integers
Arguments
- seq: Sequence to encode
- alphabet: Alphabet to use
- neutral_alphabet: Neutral alphabet -> assign those values to -1
Returns
List of length len(seq)
with integers from -1
to len(alphabet) - 1
token2one_hot
token2one_hot(tokens, alphabet_size=4, neutral_value=0.25, dtype=None)
Note: everything out of the alphabet is transformed into np.zeros(alphabet_size)
one_hot_dna
one_hot_dna(seq:str, alphabet:list=('A', 'C', 'G', 'T'), neutral_alphabet:str='N', neutral_value:Any=0.25, dtype=<class 'numpy.float32'>) -> numpy.ndarray
One-hot encode sequence.
fixed_len
fixed_len(seq, length, anchor='center', value='N')
Pad and/or trim a list of sequences to have common length. Procedure:
- Pad the sequence with N's or any other string or list element (
value
) - Subset the sequence
Note
See also: https://keras.io/preprocessing/sequence/ Aplicable also for lists of characters
Arguments
- sequence_vec: list of chars or lists List of sequences that can have various lengths
- value: Neutral element to pad the sequence with. Can be
str
orlist
. - length: int or None; Final lenght of sequences. If None, length is set to the longest sequence length.
- anchor: character; 'start', 'end' or 'center' To which end to anchor the sequences when triming/padding. See examples bellow.
Returns
List of sequences of the same class as sequence_vec
Example
>>> sequence = 'CTTACTCAGA'
>>> pad_sequence(sequence, 10, anchor="start", value="N")
'CTTACTCAGA'
>>> pad_sequence(sequence, 10, anchor="end", value="N")
'CTTACTCAGA'
>>> pad_sequences(sequence, 4, anchor="center", value="N")
'ACTC'
>>> sequence = 'TCTTTA'
>>> pad_sequence(sequence, 10, anchor="start", value="N")
'TCTTTANNNN'
>>> pad_sequence(sequence, 10, anchor="end", value="N")
'NNNNTCTTTA'
>>> pad_sequences(sequence, 4, anchor="center", value="N")
'CTTT'
resize_interval
resize_interval(interval, width, anchor='center')
Resize the Interval. Returns new Interval instance with correct length.
Arguments:
interval: pybedtools.Interval object or an object containing start
and end
attributes
width: desired width of the output interval
anchor (str): which part of the sequence should be anchored. Choices: 'start', 'center', or 'end'
translate
translate(seq:str, hg38=False)
Translate the DNA/RNA sequence into AA.
Note: it stops after it encounters a stop codon
Arguments
- seq: DNA/RNA sequence
- stop_none: return None if a stop codon is encountered