Contributing#
Development setup#
Prerequisites#
NVIDIA GPU with CUDA support
micromamba, conda/mamba, or uv
A RAPIDS environment (e.g., conda
rapids-26.04or pip-installed RAPIDS)
Clone and install#
git clone https://github.com/scverse/rapids_singlecell.git
cd rapids_singlecell
(uv) pip install -e ".[test]"
The editable install compiles the CUDA kernels for your local GPU architecture.
After the install, compiled .so modules and .pyi type stubs are placed in src/rapids_singlecell/_cuda/.
Pre-commit hooks#
pip install pre-commit
pre-commit install
Run manually on all files:
pre-commit run --all-files
Project structure#
rapids_singlecell/
├── src/rapids_singlecell/ # Python source
│ ├── preprocessing/ # pp module (normalize, scale, HVG, etc.)
│ ├── tools/ # tl module (PCA, UMAP, clustering, etc.)
│ ├── squidpy_gpu/ # spatial analysis (co_occurrence, ligrec, etc.)
│ ├── pertpy_gpu/ # perturbation analysis (edistance, etc.)
│ ├── decoupler_gpu/ # pathway analysis
│ ├── get/ # CPU/GPU data transfer utilities
│ └── _cuda/ # Compiled CUDA kernels (nanobind)
│ ├── nb_types.h # Shared ndarray type aliases
│ ├── <module>/ # Each kernel module (e.g., wilcoxon/)
│ │ ├── <module>.cu # nanobind bindings + launch wrappers
│ │ └── kernels_*.cuh # CUDA kernel implementations
│ ├── *.abi3.so # Compiled modules (gitignored)
│ ├── *.pyi # Type stubs (gitignored, auto-generated)
│ └── py.typed # PEP 561 marker (gitignored, auto-generated)
├── tests/ # pytest test suite
├── docs/ # Sphinx documentation
├── docker/ # Docker and CI build images
├── conda/ # Conda environment files
├── CMakeLists.txt # CMake build for CUDA extensions
└── pyproject.toml # Project metadata and build config
Contributing GPU code#
All contributions are welcome, regardless of the GPU programming approach you use. You do not need to know C++ or nanobind to contribute GPU-accelerated functions.
We accept pull requests using any of the following:
Pure CuPy (array API,
cupyx.scipy, etc.)CuPy RawKernels
numba-cuda kernels
nanobind/CUDA C++ extensions
Please do not introduce JAX or PyTorch as dependencies. The project is built on the RAPIDS/CuPy stack and we want to keep the dependency footprint “minimal”.
The most important thing is a correct, tested implementation. Performance optimization and porting to nanobind C++ (if needed) can happen in follow-up PRs or directly on your branch by the maintainers. Don’t let unfamiliarity with the internal kernel system stop you from contributing — a working CuPy implementation is a great starting point.
Tip
When opening a pull request, please enable “Allow edits by maintainers” (the checkbox on the PR creation page). This lets us make small fixes, optimizations, or nanobind ports directly on your branch without extra back-and-forth.
CUDA kernel architecture (nanobind)#
Overview#
GPU-accelerated functions are implemented as nanobind C++ extensions compiled with CUDA.
Each kernel module lives in its own subdirectory under src/rapids_singlecell/_cuda/ and consists of:
A
.cufile with nanobind bindings and kernel launch wrappersOne or more
.cuhheaders with the actual CUDA kernel implementations
The shared header nb_types.h provides type aliases used across all modules:
cuda_array<T> // no contiguity constraint
cuda_array_c<T> // C-contiguous (row-major)
cuda_array_f<T> // F-contiguous (column-major)
cuda_array_contig<T, Contig> // parameterized contiguity
Choose the appropriate alias based on how the kernel accesses data.
Use cuda_array_f for kernels that index column-by-column (e.g., data + col * n_rows), and cuda_array_c for row-major access.
nanobind will reject arrays with the wrong memory layout at runtime.
Adding a new kernel#
Create a directory under
src/rapids_singlecell/_cuda/your_module/Write the kernel header (
kernels_your_module.cuh) and bindings (your_module.cu)Include
"../nb_types.h"for the shared type aliasesRegister the module in
CMakeLists.txt:add_nb_cuda_module(_your_module_cuda src/rapids_singlecell/_cuda/your_module/your_module.cu)
Add the module name to the
__all__list insrc/rapids_singlecell/_cuda/__init__.py:__all__ = [ ..., "_your_module_cuda", ]
This registers the module for lazy loading — imports return
Noneinstead of raisingImportErrorwhen the compiled extension is unavailable (e.g., docs builds without a GPU).Rebuild:
uv pip install -e .
The add_nb_cuda_module helper automatically handles:
Stable ABI + LTO compilation
Linking against CUDA runtime
Installing the
.sointo the wheelGenerating
.pyitype stubs (install-time for wheels, build-time for editable installs)Copying the built module into the source tree for editable installs
Kernel conventions#
Each kernel launch wrapper is a
static inlinefunction in the.cufileUse
nb::kw_only()to separate data arguments from configuration argumentsAccept
std::uintptr_t streamas the last parameter (default0) to support stream-based executionKeep kernel logic in
.cuhheaders, bindings in.cufilesImport
_cudamodules viarapids_singlecell._cuda. The_cudapackage uses lazy loading with automaticImportErrorhandling — if the compiled extension is unavailable (e.g., docs builds without a GPU), the import returnsNoneinstead of raising an error:from rapids_singlecell._cuda import _my_module_cuda as _my def my_function(adata): # _my is either the real module or None _my.kernel(...)
No
try/exceptor lazy imports needed — the_cuda.__init__.pyhandles it for you.
Testing#
Hatch test environments#
The project uses hatch to manage test environments. The test matrix is defined in hatch.toml with two axes:
cuda:12or13— selects the matching RAPIDS/CuPy packagesdeps:stable,dev, orrapids_prerelease— controls Python version and dependency sources
|
Python |
Description |
|---|---|---|
|
3.12 |
Released versions of all dependencies |
|
3.14 |
Upstream |
|
3.14 |
RAPIDS nightly wheels |
To run the test suite against a specific matrix combination:
# Run stable tests with CUDA 13
(uvx) hatch run hatch-test.stable-13:run
# Run stable tests with CUDA 12
(uvx) hatch run hatch-test.stable-12:run
# Run dev tests (upstream anndata/scanpy) with CUDA 13
(uvx) hatch run hatch-test.dev-13:run
Running individual tests#
For quick iteration during development, you can pass specific test paths:
# Run a specific test file
(uvx) hatch run hatch-test.stable-13:run tests/path/to/test.py -v
# Run a specific test
(uvx) hatch run hatch-test.stable-13:run tests/path/to/test.py::test_name -v
Important
Always set a timeout when running tests with new CUDA kernels, as they may hang on launch failures.
Tests have a default 120-second timeout configured in pyproject.toml.
Test guidelines#
Never change test tolerances without understanding why a test is failing. If a tolerance change is needed, document the current tolerance, the actual error, the proposed tolerance, and the reason.
GPU shared memory limits vary across devices (e.g., T4 has 64KB per block). Kernels should query device limits at runtime rather than using fixed parameters.
Use
pytest.importorskipfor optional dependencies in tests.
Building documentation#
(uvx) hatch run docs:build
To build without compiling CUDA extensions (e.g., on a machine without a GPU):
CMAKE_ARGS="-DRSC_BUILD_EXTENSIONS=OFF" (uvx) hatch run docs:build
The built docs are in docs/_build/html/.
Distribution and packaging#
Package layout on PyPI#
The project publishes three separate packages:
Package |
Contents |
For whom |
|---|---|---|
|
Prebuilt wheels (CUDA 12) |
Most users |
|
Prebuilt wheels (CUDA 13) |
Most users |
|
Source distribution |
Self-compilation |
Wheel builds#
Wheels are built via cibuildwheel in GitHub Actions using custom manylinux Docker images with CUDA toolkit pre-installed.
The CI renames the package and adjusts optional dependencies per CUDA version using an inline Python script in publish.yml.
Each wheel contains:
Compiled
.abi3.somodules (stable ABI, one wheel per platform for all Python 3.12+ versions).pyitype stubs for IDE supportpy.typedPEP 561 marker
Source files (.cu, .cuh, .h) are excluded from wheels via wheel.exclude in pyproject.toml.
They are included in the source distribution for self-compilation.
CUDA architectures#
CUDA 12 wheels target: 75 (Turing), 80 (Ampere), 86 (Ampere), 89 (Ada), 90 (Hopper + PTX for forward compatibility).
CUDA 13 wheels target: 75 (Turing), 80 (Ampere), 86 (Ampere), 89 (Ada), 90 (Hopper), 100 (Blackwell), 120 (Blackwell).
Source builds (pip install rapids-singlecell) compile for the local GPU architecture by default (CMAKE_CUDA_ARCHITECTURES=native).
Docker containers#
The docker/ directory contains two types of Dockerfiles:
User-facing containers (for running rapids-singlecell):
File |
Purpose |
|---|---|
|
Base image with conda RAPIDS environment + pip dependencies. Uses |
|
Final image that builds on |
These are built by docker-push.sh, which strips the rapids-singlecell pip line from the conda environment file and builds both images in sequence.
CI manylinux images (for building PyPI wheels):
File |
Purpose |
|---|---|
|
x86_64 build image with CUDA 12.2 toolkit |
|
aarch64 build image with CUDA 12.2 toolkit |
|
x86_64 build image with CUDA 13.0 toolkit |
|
aarch64 build image with CUDA 13.0 toolkit |
These are based on quay.io/pypa/manylinux_2_28 and only install the CUDA toolkit packages needed for compilation (nvcc, cudart, cublas, cusparse).
They are used by cibuildwheel in publish.yml to produce portable wheels.
Release process#
Tag the release:
git tag v0.X.Y(orv0.X.Yrc1for release candidates)Create a GitHub release from the tag
The
publish.ymlworkflow builds wheels + sdist and uploads to PyPI via trusted publishingPre-releases (
rc,beta,alpha) are automatically recognized by PyPI – users must opt in withpip install --pre