Generated from notebooks/contributing_models.ipynb
Contributing a model to the Kipoi model repository
This notebook will show you how to contribute a model to the Kipoi model repository. For a simple 'model contribution checklist' see also http://kipoi.org/docs/contributing/01_Getting_started.
Kipoi basics
Contributing a model to Kipoi means writing a sub-folder with all the required files to the Kipoi model repository via pull request.
Two main components of the model repository are model and dataloader.
Model
Model takes as input numpy arrays and outputs numpy arrays. In practice, a model needs to implement the predict_on_batch(x)
method, where x
is dictionary/list of numpy arrays. The model contributor needs to provide one of the following:
- Serialized Keras model
- Serialized Sklearn model
- Custom model inheriting from
keras.model.BaseModel
. - all the required files, i.e. weights need to be loaded in the
__init__
See http://kipoi.org/docs/contributing/02_Writing_model.yaml/ and http://kipoi.org/docs/contributing/05_Writing_model.py/ for more info.
Dataloader
Dataloader takes raw file paths or other parameters as argument and outputs modelling-ready numpy arrays.
Before writing your own dataloader take a look at our kipoiseq repository to see whether your use-case is covered by the available dataloaders.
Writing your own dataloader
Technically, dataloading can be done through a generator---batch-by-batch, sample-by-sample---or by just returning the whole dataset. The goal is to work really with raw files (say fasta, bed, vcf, etc in bioinformatics), as this allows to make model predictions on new datasets without going through the burden of running custom pre-processing scripts. The model contributor needs to implement one of the following:
- PreloadedDataset
- Dataset
- BatchDataset
- SampleIterator
- BatchIterator
- SampleGenerator
- BatchGenerator
See http://kipoi.org/docs/contributing/04_Writing_dataloader.py/ for more info.
Folder layout
Here is an example folder structure of a Kipoi model:
MyModel
├── dataloader.py # implements the dataloader (only necessary if you wrote your own dataloader)
├── dataloader.yaml # describes the dataloader (only necessary if you wrote your own dataloader)
└── model.yaml # describes the model
The model.yaml
and dataloader.yaml
files a complete description about the model, the dataloader and the files they depend on.
Contributing a simple Iris-classifier
Details about the individual files will be revealed throught the tutorial below. A simple Keras model will be trained to predict the Iris plant class from the well-known Iris dataset.
Outline
- Train the model
- Generate the model directory
- Store all data files required for the model and the dataloader in a temporary folder
- Write
model.yaml
- Write
dataloader.yaml
- Write
dataloader.py
- Test with the model with
$ kipoi test .
- Publish data files on zenodo
- Update
model.yaml
anddataloader.yaml
to contain the links - Test again
- Commit, push and generate a pull request
1. Train the model
Load and pre-process the data
import pandas as pd
import os
from sklearn.preprocessing import LabelBinarizer, StandardScaler
from sklearn import datasets
iris = datasets.load_iris()
# view more info about the dataset
# print(iris["DESCR"])
# Data pre-processing
y_transformer = LabelBinarizer().fit(iris["target"])
x_transformer = StandardScaler().fit(iris["data"])
x = x_transformer.transform(iris["data"])
y = y_transformer.transform(iris["target"])
x[:3]
array([[-0.90068117, 1.03205722, -1.3412724 , -1.31297673],
[-1.14301691, -0.1249576 , -1.3412724 , -1.31297673],
[-1.38535265, 0.33784833, -1.39813811, -1.31297673]])
y[:3]
array([[1, 0, 0],
[1, 0, 0],
[1, 0, 0]])
Train an example model
Let's train a simple linear-regression model using Keras.
from keras.models import Model
import keras.layers as kl
inp = kl.Input(shape=(4, ), name="features")
out = kl.Dense(units=3)(inp)
model = Model(inp, out)
model.compile("adam", "categorical_crossentropy")
model.fit(x, y, verbose=0)
Using TensorFlow backend.
WARNING:tensorflow:From /nfs/research1/stegle/users/rkreuzhu/conda-envs/kipoi_interpret/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:2857: calling reduce_sum (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
WARNING:tensorflow:From /nfs/research1/stegle/users/rkreuzhu/conda-envs/kipoi_interpret/lib/python3.6/site-packages/keras/backend/tensorflow_backend.py:1340: calling reduce_mean (from tensorflow.python.ops.math_ops) with keep_dims is deprecated and will be removed in a future version.
Instructions for updating:
keep_dims is deprecated, use keepdims instead
<keras.callbacks.History at 0x2ab58e8ba860>
2. Set the model directory up:
In reality, you would also need to
- Fork the kipoi/models repository
- Clone your repository fork, ignoring all the git-lfs files
$ git clone [email protected]:<your_username>/models.git
- Create a new folder
<mynewmodel>
3. Store the files in a temporary directory
All the data of the model will have to be published on zenodo or figshare before the pull request is performed. While setting the Kipoi model up, it is handy the keep the models in a temporary directory in the model folder, which we will delete prior to the pull request.
# create the model directory
!mkdir contribution_sample_model
# create the temporary directory where we will keep the files that should later be published in zenodo or figshare
!mkdir contribution_sample_model/tmp
Now we can change the current working directory to the model directory:
import os
os.chdir("contribution_sample_model")
3a. Static files for dataloader
Since in our case here we require to write a new dataloader. The dataloader can use some trained transformer instances (here the LabelBinarizer
and StandardScaler
transformers form sklearn). These should be uploaded with the model files and then referenced correctly in the dataloader.yaml
file. We will store the required files in the temporary folder:
import pickle
with open("tmp/y_transformer.pkl", "wb") as f:
pickle.dump(y_transformer, f, protocol=2)
with open("tmp/x_transformer.pkl", "wb") as f:
pickle.dump(x_transformer, f, protocol=2)
! ls tmp
x_transformer.pkl y_transformer.pkl
3b. Model definition / weights
Now that we have the static files that are required by the dataloader, we also need to store the model architecture and weights:
# Architecture
with open("tmp/model.json", "w") as f:
f.write(model.to_json())
# Weights
model.save_weights("tmp/weights.h5")
Alternatively if we would be using a scikit-learn model we would save the pickle file:
# Alternatively, for the scikit-learn model we would save the pickle file
from sklearn.linear_model import LogisticRegression
from sklearn.multiclass import OneVsRestClassifier
lr = OneVsRestClassifier(LogisticRegression())
lr.fit(x, y)
with open("tmp/sklearn_model.pkl", "wb") as f:
pickle.dump(lr, f, protocol=2)
3c. Example files for the dataloader
Every Kipoi dataloader has to provide a set of example files so that Kipoi can perform its automated tests and users can have an idea what the dataloader files have to look like. Again we will store the files in the temporary folder:
# select first 20 rows of the iris dataset
X = pd.DataFrame(iris["data"][:20], columns=iris["feature_names"])
y = pd.DataFrame({"class": iris["target"][:20]})
# store the model input features and targets as csv files with column names:
X.to_csv("tmp/example_features.csv", index=False)
y.to_csv("tmp/example_targets.csv", index=False)
4 Write the model.yaml
Now it is time to write the model.yaml in the model directory. Since we are in the testing stage we will be using local file paths in the args
field - those will be replaced by zenodo links once everything is ready for publication.
model_yaml = """
defined_as: kipoi.model.KerasModel # use `kipoi.model.KerasModel`
args: # arguments of `kipoi.model.KerasModel`
arch: tmp/model.json
weights: tmp/weights.h5
default_dataloader: . # path to the dataloader directory. Here it's defined in the same directory
info: # General information about the model
authors:
- name: Your Name
github: your_github_username
email: [email protected]
doc: Model predicting the Iris species
cite_as: https://doi.org:/... # preferably a doi url to the paper
trained_on: Iris species dataset (http://archive.ics.uci.edu/ml/datasets/Iris) # short dataset description
license: MIT # Software License - defaults to MIT
dependencies:
conda: # install via conda
- python=3.9
- h5py=3.6
- pip=21.2.4
- keras=2.8
- tensorflow=2.8
pip: # install via pip
- protobuf==3.20
schema: # Model schema
inputs:
features:
shape: (4,) # array shape of a single sample (omitting the batch dimension)
doc: "Features in cm: sepal length, sepal width, petal length, petal width."
targets:
shape: (3,)
doc: "One-hot encoded array of classes: setosa, versicolor, virginica."
"""
with open("model.yaml", "w") as ofh:
ofh.write(model_yaml)
5 and 6 Write the dataloader.yaml and dataloader.py
PLEASE REMEMBER: Before writing a dataloader yourself please check whether the same functionality can be achieved using a ready-made dataloader in kipoiseq and use those as explained in the Kipoi docs.
Now it is time to write the dataloader.yaml
. Since we defined the default_dataloader
field in model.yaml
as .
Kipoi will expect that our dataloader.yaml
file lies in the same directory. Since we are in the testing stage we will be using local file paths in the args
field - those will be replaced by zenodo links once everything is ready for publication.
dataloader_yaml = """
type: Dataset
defined_as: dataloader.MyDataset
args:
features_file:
# descr: > allows multi-line fields
doc: >
Csv file of the Iris Plants Database from
http://archive.ics.uci.edu/ml/datasets/Iris features.
type: str
example: tmp/example_features.csv # example files
x_transformer:
default: tmp/x_transformer.pkl
#default:
# url: https://github.com/kipoi/kipoi/raw/57734d716b8dedaffe460855e7cfe8f37ec2d48d/example/models/sklearn_iris/dataloader_files/x_transformer.pkl
# md5: bc1bf3c61c418b2d07506a7d0521a893
y_transformer:
default: tmp/y_transformer.pkl
targets_file:
doc: >
Csv file of the Iris Plants Database targets.
Not required for making the prediction.
type: str
example: tmp/example_targets.csv
optional: True # if not present, the `targets` field will not be present in the dataloader output
info:
authors:
- name: Your Name
github: your_github_account
email: [email protected]
version: 0.1
doc: Model predicting the Iris species
dependencies:
conda:
- python=3.9
- pandas=1.4
- numpy=1.22
pip:
- sklearn==0.0
output_schema:
inputs:
features:
shape: (4,)
doc: "Features in cm: sepal length, sepal width, petal length, petal width."
targets:
shape: (3, )
doc: "One-hot encoded array of classes: setosa, versicolor, virginica."
metadata: # field providing additional information to the samples (not directly required by the model)
example_row_number:
doc: Just an example metadata column
"""
with open("dataloader.yaml", "w") as ofh:
ofh.write(dataloader_yaml)
Since we have referred to the dataloader as dataloader.MyDataset
we expect a dataloader.py
file in the same directory as dataloader.yaml
which has to contain the dataloader class, which is here MyDataset
.
Notice that external static files are arguments to the __init__
function! Their path was defined in the dataloader.yaml
.
import pickle
from kipoi.data import Dataset
import pandas as pd
import numpy as np
def read_pickle(f):
with open(f, "rb") as f:
return pickle.load(f)
class MyDataset(Dataset):
def __init__(self, features_file, targets_file=None, x_transformer=None, y_transformer=None):
self.features_file = features_file
self.targets_file = targets_file
self.y_transformer = read_pickle(y_transformer)
self.x_transformer = read_pickle(x_transformer)
self.features = pd.read_csv(features_file)
if targets_file is not None:
self.targets = pd.read_csv(targets_file)
assert len(self.targets) == len(self.features)
def __len__(self):
return len(self.features)
def __getitem__(self, idx):
x_features = np.ravel(self.x_transformer.transform(self.features.iloc[idx].values[np.newaxis]))
if self.targets_file is None:
y_class = {}
else:
y_class = np.ravel(self.y_transformer.transform(self.targets.iloc[idx].values[np.newaxis]))
return {
"inputs": {
"features": x_features
},
"targets": y_class,
"metadata": {
"example_row_number": idx
}
}
In order to elucidate what the Dataloader class does I will make a few function calls that are usually performed by the Kipoi API in order to generate model input:
# instantiate the dataloader
ds = MyDataset("tmp/example_features.csv", "tmp/example_targets.csv", "tmp/x_transformer.pkl",
"tmp/y_transformer.pkl")
# call __getitem__
ds[5]
{'inputs': {'features': array([-0.53717756, 1.95766909, -1.17067529, -1.05003079])},
'targets': array([1, 0, 0]),
'metadata': {'example_row_number': 5}}
it = ds.batch_iter(batch_size=3, shuffle=False, num_workers=2)
next(it)
{'inputs': {'features': array([[-0.90068117, 1.03205722, -1.3412724 , -1.31297673],
[-1.14301691, -0.1249576 , -1.3412724 , -1.31297673],
[-1.38535265, 0.33784833, -1.39813811, -1.31297673]])},
'targets': array([[1, 0, 0],
[1, 0, 0],
[1, 0, 0]]),
'metadata': {'example_row_number': array([0, 1, 2])}}
I will now store the code from above in a file so that we can test it:
dataloader_py = """
import pickle
from kipoi.data import Dataset
import pandas as pd
import numpy as np
def read_pickle(f):
with open(f, "rb") as f:
return pickle.load(f)
class MyDataset(Dataset):
def __init__(self, features_file, targets_file=None, x_transformer=None, y_transformer=None):
self.features_file = features_file
self.targets_file = targets_file
self.y_transformer = read_pickle(y_transformer)
self.x_transformer = read_pickle(x_transformer)
self.features = pd.read_csv(features_file)
if targets_file is not None:
self.targets = pd.read_csv(targets_file)
assert len(self.targets) == len(self.features)
def __len__(self):
return len(self.features)
def __getitem__(self, idx):
x_features = np.ravel(self.x_transformer.transform(self.features.iloc[idx].values[np.newaxis]))
if self.targets_file is None:
y_class = {}
else:
y_class = np.ravel(self.y_transformer.transform(self.targets.iloc[idx].values[np.newaxis]))
return {
"inputs": {
"features": x_features
},
"targets": y_class,
"metadata": {
"example_row_number": idx
}
}
"""
with open("dataloader.py", "w") as ofh:
ofh.write(dataloader_py)
7 Test the model
Now it is time to test the model.
!kipoi test .
[33mWARNING[0m [44m[kipoi.specs][0m doc empty for one of the dataloader `args` fields[0m
[33mWARNING[0m [44m[kipoi.specs][0m doc empty for one of the dataloader `args` fields[0m
[32mINFO[0m [44m[kipoi.data][0m successfully loaded the dataloader from /nfs/research1/stegle/users/rkreuzhu/opt/model-zoo/notebooks/contribution_sample_model/dataloader.MyDataset[0m
Using TensorFlow backend.
2018-10-11 17:41:58.586759: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
[32mINFO[0m [44m[kipoi.model][0m successfully loaded model architecture from <_io.TextIOWrapper name='tmp/model.json' mode='r' encoding='UTF-8'>[0m
[32mINFO[0m [44m[kipoi.model][0m successfully loaded model weights from tmp/weights.h5[0m
[32mINFO[0m [44m[kipoi.pipeline][0m dataloader.output_schema is compatible with model.schema[0m
[32mINFO[0m [44m[kipoi.pipeline][0m Initialized data generator. Running batches...[0m
[32mINFO[0m [44m[kipoi.pipeline][0m Returned data schema correct[0m
100%|█████████████████████████████████████████████| 1/1 [00:00<00:00, 28.88it/s]
[32mINFO[0m [44m[kipoi.pipeline][0m predict_example done![0m
[32mINFO[0m [44m[kipoi.cli.main][0m Successfully ran test_predict[0m
8. Publish data on zenodo or figshare
Now that the model works It is time to upload the data files onto zenodo or figshare. To do so follow the instructions on the website. It might be necessary to remove file suffixes in order to be able to load the respective files.
9 Update model.yaml
and dataloader.yaml
Now the local file paths in model.yaml
and dataloader.yaml
have to be replaced by the zenodo / figshare URLs in the following way.
The entry:
args:
...
x_transformer:
default: tmp/x_transformer.pkl
would be replaced by:
args:
...
x_transformer:
default:
url: https://zenodo.org/path/to/example_files/x_transformer.pkl
md5: 76a5sd76asd57
So every local path has to be replaced by the url
and md5
combination. Where md5
is the md5 sum of the file. If you cannot find the the md5 sum on the zenodo / figshare website you can for example run curl https://zenodo.org/.../x_transformer.pkl | md5sum
to calculate the md5 sum.
Now after replacing all the files, test the setup again by running kipoi test .
and then delete the tmp
folder. Now the only file(s) remaining in the folder should be model.yaml
(and in this case also: dataloader.py
dataloader.yaml
).
9 Test again
Now that you have deleted the temporary files, rerun the test to make sure everything works fine.
10 Commit and push
Now commit the model.yaml
and if needed (like in this example) also the dataloader.py
and datalaoder.yaml
files by running: git add model.yaml
.
Now you can push back to your fork (git push
) and submit a pull request with kipoi/models
to request adding your model to the Kipoi models.
Accessing local models through kipoi
In Kipoi it is not necessary to publish your model. You can leverage the full functionality of Kipoi also for local models. All you have to do is specify --source dir
when using the CLI or setting source="dir"
in the python API. The model name is then the local path to the model folder.
import kipoi
m = kipoi.get_model(".", source="dir") # See also python-sdk.ipynb
m.pipeline.predict({"features_file": "tmp/example_features.csv", "targets_file": "tmp/example_targets.csv" })[:5]
0it [00:00, ?it/s][A
1it [00:00, 19.03it/s][A
array([[ 3.2324865 , -0.29753828, 0.62135816],
[ 2.8549244 , 0.4957999 , 0.6873083 ],
[ 3.2744825 , 0.40906954, 0.99161 ],
[ 3.1413555 , 0.58123374, 1.0272367 ],
[ 3.416262 , -0.34901416, 0.76257455]], dtype=float32)
m.info
ModelInfo(authors=[Author(name='Your Name', github='your_github_username', email='[email protected]')], doc='Model predicting the Iris species', name=None, version='0.1', license='MIT', tags=[], contributors=[], cite_as='https://doi.org:/...', trained_on='Iris species dataset (http://archive.ics.uci.edu/ml/datasets/Iris)', training_procedure=None)
m.default_dataloader
dataloader.MyDataset
m.model
<keras.engine.training.Model at 0x2ab5a3eff668>
m.predict_on_batch
<bound method KerasModel.predict_on_batch of <kipoi.model.KerasModel object at 0x2ab5a2d75160>>
Best practices
- Like all other types of virtual environment, conda environments are sensitive to changes in the ever changing world of python dependencies. It is recommended that you pin the versions of the packages listed under dependencies to that used in your local setup. If you already have a conda environment, simply do the following -
conda env export --no-build > env.yml
cat env.yml | grep keras
- Try installing keras, tensorflow, h5py, numpy, pandas etc. from conda as opposed to pip. These packages depend on system libraries so installing them from pip is likely to lead to unintended inconsistencies.
- If you are using a specific conda channel for a particular dependency, you can specify them as channel::package=version such as bioconda::pysam=0.16
- During nightly tests, a conda environment is created for each model group from scratch. If there is a test template present in
model.yaml
generated predictions are compared against predictions stored in a file. In general, kipoi maintainers will generate a test file and updatemodel.yaml
after your submission. However, optionally you can do this as well. The steps are as follows -kipoi test <model-name> --source=dir -o <model-name>.predictions.h5
- Upload
<model-name>.predictions.h5
on zenodo or a file hosting service of choice and get an url and checksum - Add a snippet like this to
model.yaml
. By default the desired precision is 7 decimal places. Feel free to adjust this.
- In some cases, you may submit a model group that contains multiple models with similar configuration. If you feel it is okay to just test a subset of them during kipoi repository's nightly tests - add a file in the top level called test_subset.txt and specify the name of the model like here. In this case, the above test snippet needs to be modified like so. However, this is optional.
- FYI: Your submitted models will be tested in our circleci infrastructure. The specifications are -
- OS: ubuntu-2004:current
- conda: latest version
Recap
Congrats! You made it through the tutorial! Feel free to use this model for your model template. Alternatively, you can use kipoi init
to setup a model directory. Make sure you have read the getting started guide for contributing models.