Basenji
License: Apache License v2
Sequential regulatory activity predictions with deep convolutional neural networks. Github link - https://github.com/calico/basenji Abstract Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine learning system to predict cell type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. Using convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.
kipoi env create Basenji
source activate kipoi-Basenji
kipoi test Basenji --batch_size=2 --source=kipoi
kipoi get-example Basenji -o example
kipoi predict Basenji \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
--batch_size=2 -o '/tmp/Basenji.example_pred.tsv'
# check the results
head '/tmp/Basenji.example_pred.tsv'
kipoi env create Basenji
source activate kipoi-Basenji
import kipoi
model = kipoi.get_model('Basenji')
pred = model.pipeline.predict_example(batch_size=2)
# Download example dataloader kwargs
dl_kwargs = model.default_dataloader.download_example('example')
# Get the dataloader and instantiate it
dl = model.default_dataloader(**dl_kwargs)
# get a batch iterator
batch_iterator = dl.batch_iter(batch_size=2)
for batch in batch_iterator:
# predict for a batch
batch_pred = model.predict_on_batch(batch['inputs'])
pred = model.pipeline.predict(dl_kwargs, batch_size=2)
library(reticulate)
kipoi <- import('kipoi')
model <- kipoi$get_model('Basenji')
predictions <- model$pipeline$predict_example()
# Download example dataloader kwargs
dl_kwargs <- model$default_dataloader$download_example('example')
# Get the dataloader
dl <- model$default_dataloader(dl_kwargs)
# get a batch iterator
it <- dl$batch_iter(batch_size=2)
# predict for a batch
batch <- iter_next(it)
model$predict_on_batch(batch$inputs)
pred <- model$pipeline$predict(dl_kwargs, batch_size=2)
docker pull kipoi/kipoi-docker:sharedpy3keras2tf1-slim
docker pull kipoi/kipoi-docker:sharedpy3keras2tf1
docker run -it kipoi/kipoi-docker:sharedpy3keras2tf1-slim
docker run kipoi/kipoi-docker:sharedpy3keras2tf1-slim kipoi test Basenji --batch_size=2 --source=kipoi
# Create an example directory containing the data
mkdir -p $PWD/kipoi-example
# You can replace $PWD/kipoi-example with a different absolute path containing the data
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf1-slim \
kipoi get-example Basenji -o /app/example
docker run -v $PWD/kipoi-example:/app/ kipoi/kipoi-docker:sharedpy3keras2tf1-slim \
kipoi predict Basenji \
--dataloader_args='{'intervals_file': '/app/example/intervals_file', 'fasta_file': '/app/example/fasta_file'}' \
--batch_size=2 -o '/app/Basenji.example_pred.tsv'
# check the results
head $PWD/kipoi-example/Basenji.example_pred.tsv
https://apptainer.org/docs/user/main/quick_start.html#quick-installation-steps
kipoi get-example Basenji -o example
kipoi predict Basenji \
--dataloader_args='{"intervals_file": "example/intervals_file", "fasta_file": "example/fasta_file"}' \
--batch_size=2 -o 'Basenji.example_pred.tsv' \
--singularity
# check the results
head Basenji.example_pred.tsv
Inputs
Single numpy array
Name: seq
Doc: * one-hot encoded DNA sequence * 4096bp starting flank sequence * 122880bp core sequence (960 * 128), predicted by the model in 128bp bins * 4096bp end flank sequence
Defined as: kipoiseq.dataloaders.SeqIntervalDl
Doc: Dataloader for a combination of fasta and tab-delimited input files such as bed files. The dataloader extracts regions from the fasta file as defined in the tab-delimited `intervals_file` and converts them into one-hot encoded format. Returned sequences are of the type np.array with the shape inferred from the arguments: `alphabet_axis` and `dummy_axis`.
Authors: Ziga Avsec , Roman Kreuzhuber
Type: Dataset
License: MIT
Arguments
intervals_file : bed3+<columns> file path containing intervals + (optionally) labels
fasta_file : Reference genome FASTA file path.
num_chr_fasta (optional): True, the the dataloader will make sure that the chromosomes don't start with chr.
use_strand (optional): reverse-complement fasta sequence if bed file defines negative strand. Requires a bed6 file
- python=3.7
- pip=20.2.4
- pysam=0.16.0.1
- cython=0.29.23
- tensorflow<2
- kipoiseq
- protobuf==3.20
- bioconda::pybedtools
- bioconda::pyfaidx
- bioconda::pyranges
- numpy
- pandas
- kipoiseq