Contributing#

Development setup#

Prerequisites#

  • NVIDIA GPU with CUDA support

  • micromamba, conda/mamba, or uv

  • A RAPIDS environment (e.g., conda rapids-26.04 or pip-installed RAPIDS)

Clone and install#

git clone https://github.com/scverse/rapids_singlecell.git
cd rapids_singlecell
(uv) pip install -e ".[test]"

The editable install compiles the CUDA kernels for your local GPU architecture. After the install, compiled .so modules and .pyi type stubs are placed in src/rapids_singlecell/_cuda/.

Pre-commit hooks#

pip install pre-commit
pre-commit install

Run manually on all files:

pre-commit run --all-files

Project structure#

rapids_singlecell/
├── src/rapids_singlecell/       # Python source
│   ├── preprocessing/           # pp module (normalize, scale, HVG, etc.)
│   ├── tools/                   # tl module (PCA, UMAP, clustering, etc.)
│   ├── squidpy_gpu/             # spatial analysis (co_occurrence, ligrec, etc.)
│   ├── pertpy_gpu/              # perturbation analysis (edistance, etc.)
│   ├── decoupler_gpu/           # pathway analysis
│   ├── get/                     # CPU/GPU data transfer utilities
│   └── _cuda/                   # Compiled CUDA kernels (nanobind)
│       ├── nb_types.h           # Shared ndarray type aliases
│       ├── <module>/            # Each kernel module (e.g., wilcoxon/)
│       │   ├── <module>.cu      # nanobind bindings + launch wrappers
│       │   └── kernels_*.cuh    # CUDA kernel implementations
│       ├── *.abi3.so            # Compiled modules (gitignored)
│       ├── *.pyi                # Type stubs (gitignored, auto-generated)
│       └── py.typed             # PEP 561 marker (gitignored, auto-generated)
├── tests/                       # pytest test suite
├── docs/                        # Sphinx documentation
├── docker/                      # Docker and CI build images
├── conda/                       # Conda environment files
├── CMakeLists.txt               # CMake build for CUDA extensions
└── pyproject.toml               # Project metadata and build config

Contributing GPU code#

All contributions are welcome, regardless of the GPU programming approach you use. You do not need to know C++ or nanobind to contribute GPU-accelerated functions.

We accept pull requests using any of the following:

  • Pure CuPy (array API, cupyx.scipy, etc.)

  • CuPy RawKernels

  • numba-cuda kernels

  • nanobind/CUDA C++ extensions

Please do not introduce JAX or PyTorch as dependencies. The project is built on the RAPIDS/CuPy stack and we want to keep the dependency footprint “minimal”.

The most important thing is a correct, tested implementation. Performance optimization and porting to nanobind C++ (if needed) can happen in follow-up PRs or directly on your branch by the maintainers. Don’t let unfamiliarity with the internal kernel system stop you from contributing — a working CuPy implementation is a great starting point.

Tip

When opening a pull request, please enable “Allow edits by maintainers” (the checkbox on the PR creation page). This lets us make small fixes, optimizations, or nanobind ports directly on your branch without extra back-and-forth.

CUDA kernel architecture (nanobind)#

Overview#

GPU-accelerated functions are implemented as nanobind C++ extensions compiled with CUDA. Each kernel module lives in its own subdirectory under src/rapids_singlecell/_cuda/ and consists of:

  • A .cu file with nanobind bindings and kernel launch wrappers

  • One or more .cuh headers with the actual CUDA kernel implementations

The shared header nb_types.h provides type aliases used across all modules:

cuda_array<T>                  // no contiguity constraint
cuda_array_c<T>                // C-contiguous (row-major)
cuda_array_f<T>                // F-contiguous (column-major)
cuda_array_contig<T, Contig>   // parameterized contiguity

Choose the appropriate alias based on how the kernel accesses data. Use cuda_array_f for kernels that index column-by-column (e.g., data + col * n_rows), and cuda_array_c for row-major access. nanobind will reject arrays with the wrong memory layout at runtime.

Adding a new kernel#

  1. Create a directory under src/rapids_singlecell/_cuda/your_module/

  2. Write the kernel header (kernels_your_module.cuh) and bindings (your_module.cu)

  3. Include "../nb_types.h" for the shared type aliases

  4. Register the module in CMakeLists.txt:

    add_nb_cuda_module(_your_module_cuda src/rapids_singlecell/_cuda/your_module/your_module.cu)
    
  5. Add the module name to the __all__ list in src/rapids_singlecell/_cuda/__init__.py:

    __all__ = [
        ...,
        "_your_module_cuda",
    ]
    

    This registers the module for lazy loading — imports return None instead of raising ImportError when the compiled extension is unavailable (e.g., docs builds without a GPU).

  6. Rebuild: uv pip install -e .

The add_nb_cuda_module helper automatically handles:

  • Stable ABI + LTO compilation

  • Linking against CUDA runtime

  • Installing the .so into the wheel

  • Generating .pyi type stubs (install-time for wheels, build-time for editable installs)

  • Copying the built module into the source tree for editable installs

Kernel conventions#

  • Each kernel launch wrapper is a static inline function in the .cu file

  • Use nb::kw_only() to separate data arguments from configuration arguments

  • Accept std::uintptr_t stream as the last parameter (default 0) to support stream-based execution

  • Keep kernel logic in .cuh headers, bindings in .cu files

  • Import _cuda modules via rapids_singlecell._cuda. The _cuda package uses lazy loading with automatic ImportError handling — if the compiled extension is unavailable (e.g., docs builds without a GPU), the import returns None instead of raising an error:

    from rapids_singlecell._cuda import _my_module_cuda as _my
    
    def my_function(adata):
        # _my is either the real module or None
        _my.kernel(...)
    

    No try/except or lazy imports needed — the _cuda.__init__.py handles it for you.

Testing#

Hatch test environments#

The project uses hatch to manage test environments. The test matrix is defined in hatch.toml with two axes:

  • cuda: 12 or 13 — selects the matching RAPIDS/CuPy packages

  • deps: stable, dev, or rapids_prerelease — controls Python version and dependency sources

deps

Python

Description

stable

3.12

Released versions of all dependencies

dev

3.14

Upstream main branches of anndata and scanpy

rapids_prerelease

3.14

RAPIDS nightly wheels

To run the test suite against a specific matrix combination:

# Run stable tests with CUDA 13
(uvx) hatch run hatch-test.stable-13:run

# Run stable tests with CUDA 12
(uvx) hatch run hatch-test.stable-12:run

# Run dev tests (upstream anndata/scanpy) with CUDA 13
(uvx) hatch run hatch-test.dev-13:run

Running individual tests#

For quick iteration during development, you can pass specific test paths:

# Run a specific test file
(uvx) hatch run hatch-test.stable-13:run tests/path/to/test.py -v

# Run a specific test
(uvx) hatch run hatch-test.stable-13:run tests/path/to/test.py::test_name -v

Important

Always set a timeout when running tests with new CUDA kernels, as they may hang on launch failures. Tests have a default 120-second timeout configured in pyproject.toml.

Test guidelines#

  • Never change test tolerances without understanding why a test is failing. If a tolerance change is needed, document the current tolerance, the actual error, the proposed tolerance, and the reason.

  • GPU shared memory limits vary across devices (e.g., T4 has 64KB per block). Kernels should query device limits at runtime rather than using fixed parameters.

  • Use pytest.importorskip for optional dependencies in tests.

Building documentation#

(uvx) hatch run docs:build

To build without compiling CUDA extensions (e.g., on a machine without a GPU):

CMAKE_ARGS="-DRSC_BUILD_EXTENSIONS=OFF" (uvx) hatch run docs:build

The built docs are in docs/_build/html/.

Distribution and packaging#

Package layout on PyPI#

The project publishes three separate packages:

Package

Contents

For whom

rapids-singlecell-cu12

Prebuilt wheels (CUDA 12)

Most users

rapids-singlecell-cu13

Prebuilt wheels (CUDA 13)

Most users

rapids-singlecell

Source distribution

Self-compilation

Wheel builds#

Wheels are built via cibuildwheel in GitHub Actions using custom manylinux Docker images with CUDA toolkit pre-installed. The CI renames the package and adjusts optional dependencies per CUDA version using an inline Python script in publish.yml.

Each wheel contains:

  • Compiled .abi3.so modules (stable ABI, one wheel per platform for all Python 3.12+ versions)

  • .pyi type stubs for IDE support

  • py.typed PEP 561 marker

Source files (.cu, .cuh, .h) are excluded from wheels via wheel.exclude in pyproject.toml. They are included in the source distribution for self-compilation.

CUDA architectures#

CUDA 12 wheels target: 75 (Turing), 80 (Ampere), 86 (Ampere), 89 (Ada), 90 (Hopper + PTX for forward compatibility). CUDA 13 wheels target: 75 (Turing), 80 (Ampere), 86 (Ampere), 89 (Ada), 90 (Hopper), 100 (Blackwell), 120 (Blackwell).

Source builds (pip install rapids-singlecell) compile for the local GPU architecture by default (CMAKE_CUDA_ARCHITECTURES=native).

Docker containers#

The docker/ directory contains two types of Dockerfiles:

User-facing containers (for running rapids-singlecell):

File

Purpose

Dockerfile.deps

Base image with conda RAPIDS environment + pip dependencies. Uses nvidia/cuda:*-devel for CUDA toolkit access.

Dockerfile

Final image that builds on rapids-singlecell-deps and compiles rapids-singlecell from source for all supported GPU architectures.

These are built by docker-push.sh, which strips the rapids-singlecell pip line from the conda environment file and builds both images in sequence.

CI manylinux images (for building PyPI wheels):

File

Purpose

manylinux_2_28_x86_64_cuda12.2.Dockerfile

x86_64 build image with CUDA 12.2 toolkit

manylinux_2_28_aarch64_cuda12.2.Dockerfile

aarch64 build image with CUDA 12.2 toolkit

manylinux_2_28_x86_64_cuda13.0.Dockerfile

x86_64 build image with CUDA 13.0 toolkit

manylinux_2_28_aarch64_cuda13.0.Dockerfile

aarch64 build image with CUDA 13.0 toolkit

These are based on quay.io/pypa/manylinux_2_28 and only install the CUDA toolkit packages needed for compilation (nvcc, cudart, cublas, cusparse). They are used by cibuildwheel in publish.yml to produce portable wheels.

Release process#

  1. Tag the release: git tag v0.X.Y (or v0.X.Yrc1 for release candidates)

  2. Create a GitHub release from the tag

  3. The publish.yml workflow builds wheels + sdist and uploads to PyPI via trusted publishing

  4. Pre-releases (rc, beta, alpha) are automatically recognized by PyPI – users must opt in with pip install --pre