Basset
Authors: David R. Kelley , Jasper Snoek , John L. Rinn
License: MIT
Type: pytorch
Postprocessing: variant_effects
Trained on: From 2,071,886 total sites, 71,886 randomly reserved for testing and 70,000 for validation, leaving 1,930,000 for training.
This is the Basset model published by David Kelley converted to pytorch by Roman Kreuzhuber. It categorically predicts probabilities of accesible genomic regions in 164 cell types (ENCODE project and Roadmap Epigenomics Consortium). Data was generated using DNAse-seq. The sequence length the model uses as input is 600bp. The input of the tensor has to be (N, 4, 600, 1) for N samples, 600bp window size and 4 nucleotides. Per sample, 164 probabilities of accessible chromatin will be predicted.
kipoi env create Basset
source activate kipoi-Basset
kipoi env install Basset
kipoi test Basset --source=kipoi
kipoi get-example Basset -o example
kipoi predict Basset \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
-o '/tmp/Basset.example_pred.tsv'
# check the results
head '/tmp/Basset.example_pred.tsv'
import kipoi
model = kipoi.get_model('Basset')
pred = model.pipeline.predict_example()
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
it = dl.batch_iter(batch_size=4)
# predict for a batch
batch = next(it)
model.predict_on_batch(batch['inputs'])
pred = model.pipeline.predict(dl_kwargs, batch_size=4)
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Basset')
predictions <- model$pipeline$predict_example()
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=4)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
pred <- model$pipeline$predict(dl_kwargs, batch_size=4)
Defined as: kipoiseq.dataloaders.SeqIntervalDl
Doc: Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited `intervals_file` and converts them into one-hot encoded format. Returned sequences are of the type np.array with the shape inferred from the arguments: `alphabet_axis` and `dummy_axis`.
Authors: Ziga Avsec , Roman Kreuzhuber
Type: Dataset
License: MIT
Arguments
intervals_file : bed3+<columns> file path containing intervals + (optionally) labels
fasta_file : Reference genome FASTA file path.
num_chr_fasta (optional): True, the the dataloader will make sure that the chromosomes don't start with chr.
label_dtype (optional): None, datatype of the task labels taken from the intervals_file. Example: str, int, float, np.float32
ignore_targets (optional): if True, don't return any target variables
- python=3.5
- h5py
- pytorch::pytorch-cpu>=0.2.0
- kipoiseq
- bioconda::pybedtools
- bioconda::pyfaidx
- numpy
- pandas
- kipoiseq