A software package for rapid prototyping of data-centric software flows to derive machine learning and AI models, especially for chemical use cases. The software is data-centric. The abstraction of the objects in implemented in acheeve https://www.chembee.eu
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Julian Manuel Kleber a858ffa291 Merge pull request 'Refactoring' (#24) from dev into main 2 months ago
__pycache__ finalized the startified plot 4 months ago
chembee update on CI 2 months ago
docs update the pip package with new file struct 3 months ago
notebooks update the pip package with new file struct 3 months ago
tests remove relative import from tests 3 months ago
.env something is wrong here 5 months ago
.gitignore add coverage config 3 months ago
.pylintrc housekeeping on requirements 4 months ago
.readthedocs.yaml rtd 3 months ago
.woodpecker.yml update ci pipeline remote 2 months ago
LICENSE test passed but plot again missing 5 months ago
README.md update badge 3 months ago
benchmark added funtioniality for calucaltion of metrics, and also started plotting 5 months ago
chembee.drawio got the solid pattern back 4 months ago
classes.dot refactored file structure 3 months ago
codecov update badge test woodpecker 3 months ago
linting.out update the pip package with new file struct 3 months ago
local-ci.sh update on CI 2 months ago
packages.dot refactored file structure 3 months ago
prepare-twine.sh update MLP classifier config in calibration with hyperparameters 5 months ago
requirements.txt there is no chance 3.7 will work 3 months ago
setup.cfg add coverage config 3 months ago
setup.py update the pip package with new file struct 3 months ago
severe_sec_iss.log version 5 months ago
solid_pattern_white.png got the solid pattern back 4 months ago



License: AGPL v3 Python Versions Style Black Documentation Status status-badge


Name                                                Stmts   Miss Branch BrPart  Cover   Missing
chembee/actions/benchmark_algorithms.py                95     30     38      4    64%   37, 50-55, 102->107, 113-122, 183->185, 245-259
chembee/actions/cross_validation.py                    54      1     12      3    94%   134->126, 136, 137->139
chembee/actions/evaluation.py                          96      2     14      0    98%   188-189
chembee/actions/feature_extraction.py                  22      1      4      1    92%   25
chembee/actions/save_model.py                           3      3      0      0     0%   1-3
chembee/actions/search.py                              30     10     12      1    60%   15-22, 57, 112
chembee/config/benchmark/grid_search_cv.py             11     11      4      0     0%   1-19
chembee/config/calibration/linear_regression.py        18     18      2      0     0%   1-28
chembee/config/calibration/restricted_bm.py             5      5      0      0     0%   1-9
chembee/config/calibration/spectral_clustering.py      16     16      2      0     0%   1-24
chembee/config/calibration/svc.py                      37     10      4      0    76%   13-16, 20-25
chembee/datasets/BioDegDataSet.py                      40      4      4      1    89%   82, 146-149, 168
chembee/datasets/DataSet.py                            11      4      2      0    69%   8, 11, 15, 19
chembee/plotting/compounds.py                           7      7      2      0     0%   1-31
chembee/plotting/evaluation.py                        215     28     38      2    87%   241-242, 383, 417-436, 470, 612-625
chembee/plotting/graphics.py                           59     15      6      1    72%   108, 249-258, 269-273
chembee/preparation/processing.py                      72     15     20      1    80%   149-157, 189-200, 204-207
chembee/utils/file_utils.py                            81     23     36     10    68%   34, 37->41, 39, 44, 61-71, 88, 107, 110, 133-134, 141-145, 164->exit, 169, 192-193
chembee/utils/utils.py                                 22      9      6      1    50%   28-33, 40-44
TOTAL                                                1270    212    274     25    82%


To accelerate the shift to a sustainable, lean, and demand-driven chemical industry, we at sail.black needed a package to draft microservices fast and reliably. The chembee package abstracts the development of modules in the pipeline envisioned in 2020 for a rapid prototyping software evaluating sustainable chemicals (compare: https://www.researchgate.net/project/Lean-Drug-Development).


Chembee is a modelling kit automatizing the first step of the MLOps value pipeline for a given dataset. The perspective is not algorithm-specific but rather datacentric. Chembee therefore operates data-centric as in contrast to scikit-learn that is algorithm centric.

In the end, data creates value. The package shall help finding the best treatment for a given dataset fast. Automatizing rapid prototyping for environmental degradation modelling and other endpoints, the package merges CADD and Environmental Sciences.

Models crafted with and by chembee must follow the REACH and OECD guidelines for QSAR models replacing experiments for environmental and pharmaceutical endpoints. Therefore, the actions module provides functionality to comply with the REACH and OECD standards.

The goal of chembee is thus to provide methods to create explainable, compliant, and production-ready, QSAR models for use in microservices fast.

Software Pattern

The software pattern is as follows:

SOLID Pattern

And follows SOLID principles. Still, not yet proven in the field, the data preparation might be seen as an action, too. The perspective of seeing the data preparation as part of the actions module, would further abstract the software pattern and is worth a thought for future releases. Do you have any ideas? Participate in our discussions!

Merging CADD and Environmental Sciences

A primer of what the synthesis of Environmental Sciences and CADD can achieve:

Distribution Dataset


pip install chembee

If you get some error regarding file-utils, e.g. on macOS, you can simply run

git clone https://codeberg.org/cap_jmk/file-utils.git
pip install -e file-utils/

Documentation and Tutorials

  • For more information read the documentation
  • For an introduction to the chembee package refer to the notebooks
  • For in-depth information about the project and how to develop with and on chembee refer to the Wiki
  • Check out the corresponding thesis
  • Join us


Get to know your data with especially polar charts.

Example Biodegradability

Is Ready Biodegrable

Polar Chart

Is Not Ready Biodegradable

Polar Chart

Both pictures show clearly that the Lipinski rule of five plays a significant role in the rady biodegradability of a chemical compound according to the OECD Guideline 301. It can be concluded that ready biodegradable compounds follow the Lipinski rule of five more closely than non-biodegradable compounds.


At the moment, pytest runs automated coverage tests as defined in the setup.cfg file.


  1. Ruiz-Moreno, A. J., Reyes-Romero, A., Dömling, A., & Velasco-Velázquez, M. A. (2021). In silico design and selection of new tetrahydroisoquinoline-based CD44 antagonist candidates. Molecules (Basel, Switzerland), 26(7), 1877.
  2. Lunghini F, Marcou G, Gantzer P, Azam P, Horvath D, Van Miert E, Varnek A. 2020 Modelling of ready biodegradability based on combined public and industrial data sources. SAR QSAR Environ. Res. 31, 171–186. (doi:10.1080/1062936X.2019.1697360)
  3. Elsayad AM, Nassef AM, Al-Dhaifallah M, Elsayad KA. 2020 Classification of biodegradable substances using balanced random trees and boosted c5.0 decision trees. Int. J. Environ. Res. Public Health 17, 1–22. (doi:10.3390/ijerph17249322)

How to cite

Before there is a publication, you can always cite the Git.