rapids_singlecell.ptg.Distance#

class rapids_singlecell.ptg.Distance(metric='edistance', layer_key=None, obsm_key=None, **kwargs)[source]#

GPU-accelerated distance computation between groups of cells.

API compatible with pertpy’s Distance class.

Currently supported metrics:

  • "edistance": Energy distance (default).

    Twice the mean pairwise distance between cells of two groups minus the mean pairwise distance between cells within each group. See Peidli et al. (2023). Accepts dense embeddings (e.g. obsm_key="X_pca") or sparse CSR expression data (a sparse layer or layer_key="X"), which is densified inside the kernel rather than on the host.

  • "euclidean" and "root_mean_squared_error": Euclidean distance

    between group mean vectors.

  • "mse": Mean squared distance between group mean vectors.

  • "mean_absolute_error": Mean absolute distance between group mean

    vectors.

  • "pearson_distance": Pearson distance between group mean vectors.

  • "cosine_distance": Cosine distance between group mean vectors.

  • "r2_distance": One minus the coefficient of determination between

    group mean vectors.

  • "wasserstein": Entropy-regularized 2-Wasserstein via Sinkhorn.

    Squared-Euclidean ground cost; per-pair auto-epsilon defaulting to 0.05 * std(C) to match OTT-JAX. Returns OTT’s reg_ot_cost value.

Parameters:
metric Literal['edistance', 'euclidean', 'root_mean_squared_error', 'mse', 'mean_absolute_error', 'pearson_distance', 'cosine_distance', 'r2_distance', 'wasserstein'] (default: 'edistance')

Distance metric to use.

layer_key str | None (default: None)

Key in adata.layers for cell data, or "X" to use adata.X. Mutually exclusive with obsm_key.

obsm_key str | None (default: None)

Key in adata.obsm for embeddings. Mutually exclusive with layer_key. Defaults to "X_pca" if neither is specified.

Notes

The edistance bootstrap implementation differs from pertpy: rather than precomputing an n×n cell distance matrix and sampling from it, this implementation resamples cells and recomputes distances from scratch each iteration. This scales better for large datasets (O(n) vs O(n²) memory) and leverages multi-GPU parallelism for each bootstrap iteration.

"edistance" and "wasserstein" use multi-GPU (pairs are split across devices). Pseudobulk metrics aggregate cells into K group-mean vectors before computing distances, and the resulting K×K kernel is cheap enough on a single GPU that distributing it is not worth the cost. Passing multi_gpu=True for those metrics falls back to a single device with a warning.

Examples

>>> import rapids_singlecell as rsc
>>> distance = rsc.ptg.Distance(metric='edistance')
>>> result = distance.pairwise(adata, groupby='perturbation')
>>> # Direct computation on arrays
>>> d = distance(X, Y)

Methods table#

bootstrap(X, Y, *[, n_bootstrap, random_state])

Compute bootstrap mean and variance for distance between two arrays.

contrast_distances(adata, contrasts, *[, ...])

Compute distances for contrasts.

create_contrasts(adata, groupby, ...[, ...])

Build a contrasts DataFrame for use with contrast_distances().

onesided_distances(adata, groupby, ...[, ...])

Compute distances from one selected group to all other groups.

pairwise(adata, groupby, *[, groups, ...])

Compute pairwise distances between all cell groups.

validate_contrasts(adata, contrasts)

Validate a contrasts DataFrame against an AnnData object.

Methods#

bootstrap#

Distance.bootstrap(X, Y, *, n_bootstrap=100, random_state=0)[source]#

Compute bootstrap mean and variance for distance between two arrays.

This provides pertpy-compatible API for bootstrap computation directly on arrays without requiring an AnnData object.

Parameters:
X np.ndarray | cp.ndarray

First array of shape (n_samples_x, n_features)

Y np.ndarray | cp.ndarray

Second array of shape (n_samples_y, n_features)

n_bootstrap int (default: 100)

Number of bootstrap iterations

random_state int (default: 0)

Random seed for reproducibility

Return type:

MeanVar

Returns:

result Named tuple containing mean and variance of bootstrapped distances

Examples

>>> distance = Distance(metric='edistance')
>>> X = adata.obsm["X_pca"][adata.obs["group"] == "A"]
>>> Y = adata.obsm["X_pca"][adata.obs["group"] == "B"]
>>> result = distance.bootstrap(X, Y, n_bootstrap=100)
>>> print(f"Distance: {result.mean:.3f} ± {result.variance**0.5:.3f}")

contrast_distances#

Distance.contrast_distances(adata, contrasts, *, multi_gpu=None)[source]#

Compute distances for contrasts.

Accepts a DataFrame (from create_contrasts() or constructed manually) with the following layout:

  • First column: the groupby column (target values to compare)

  • ``reference`` column: the control value in the groupby column

  • Other columns: split-by filters (e.g., cell type)

Parameters:
adata AnnData

Annotated data matrix

contrasts DataFrame

DataFrame with a groupby column, a reference column, and optional split columns.

multi_gpu bool | list[int] | str | None (default: None)

GPU selection: - None: Use all GPUs if metric supports it, else GPU 0 (default) - True: Use all available GPUs - False: Use only GPU 0 - list[int]: Use specific GPU IDs (e.g., [0, 2]) - str: Comma-separated GPU IDs (e.g., “0,2”)

Return type:

DataFrame

Returns:

pd.DataFrame Copy of the input DataFrame with an added distance column.

Examples

>>> distance = Distance(metric='edistance')
>>> # Using create_contrasts helper
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene", selected_group="Non_target",
...     split_by="group_name",
... )
>>> result = distance.contrast_distances(adata, contrasts=contrasts)
>>> # Manual DataFrame construction
>>> import pandas as pd
>>> contrasts = pd.DataFrame({
...     "target_gene": ["Irf7", "Ski"],
...     "reference": ["Non_target", "Non_target"],
...     "group_name": ["CD4", "CD4"],
... })
>>> result = distance.contrast_distances(adata, contrasts)

create_contrasts#

static Distance.create_contrasts(adata, groupby, selected_group, *, groups=None, split_by=None)[source]#

Build a contrasts DataFrame for use with contrast_distances().

Each row represents one contrast: comparing a group against the reference, optionally within each level of split_by columns. The resulting DataFrame can be filtered or modified before passing to contrast_distances().

The output layout is:

  • First column (groupby): the target values to compare

  • ``reference`` column: the control value in the groupby column

  • Remaining columns (split_by): stratification filters

Parameters:
adata AnnData

Annotated data matrix

groupby str

Column in adata.obs whose levels are compared against selected_group

selected_group str | Sequence[str]

The reference (control) value(s) in the groupby column. When a sequence is passed, each target is compared against every reference, producing one row per (target, reference) combination.

groups Sequence[str] | None (default: None)

Specific groups to include. If None, all non-reference groups are included.

split_by str | Sequence[str] | None (default: None)

Column(s) in adata.obs to stratify by. If provided, contrasts are computed within each unique combination of these columns. Only combinations where the reference group exists are included.

Return type:

DataFrame

Returns:

pd.DataFrame One row per contrast. First column is groupby, then reference, then any split_by columns.

Examples

>>> # All targets vs control, ignoring celltype
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene", selected_group="Non_target"
... )
>>> # Multiple references
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene",
...     selected_group=["Non_target", "Scramble"],
... )
>>> # Stratified by celltype
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene", selected_group="Non_target",
...     split_by="group_name",
... )
>>> # Filter before computing
>>> contrasts = contrasts[contrasts["group_name"] != "rare_type"]
>>> result = distance.contrast_distances(adata, contrasts=contrasts)
>>> # Manual construction (no helper needed)
>>> import pandas as pd
>>> contrasts = pd.DataFrame({
...     "target_gene": ["Irf7", "Ski"],
...     "reference": ["Non_target", "Non_target"],
...     "group_name": ["CD4", "CD4"],
... })

onesided_distances#

Distance.onesided_distances(adata, groupby, selected_group, *, groups=None, bootstrap=False, n_bootstrap=100, random_state=0, multi_gpu=None)[source]#

Compute distances from one selected group to all other groups.

Parameters:
adata AnnData

Annotated data matrix

groupby str

Key in adata.obs for grouping cells

selected_group Sequence[str] | str

Reference group to compute distances from

groups Sequence[str] | None (default: None)

Specific groups to compute distances to (if None, use all)

bootstrap bool (default: False)

Whether to compute bootstrap variance estimates

n_bootstrap int (default: 100)

Number of bootstrap iterations (if bootstrap=True)

random_state int (default: 0)

Random seed for reproducibility

multi_gpu bool | list[int] | str | None (default: None)

GPU selection: - None: Use all GPUs if metric supports it, else GPU 0 (default) - True: Use all available GPUs - False: Use only GPU 0 - list[int]: Use specific GPU IDs (e.g., [0, 2]) - str: Comma-separated GPU IDs (e.g., “0,2”)

Return type:

Series | DataFrame | tuple[Series, Series] | tuple[DataFrame, DataFrame]

Returns:

distances Series containing distances from selected_group to all other groups. If bootstrap=True, returns tuple of (distances, distances_var).

Examples

>>> distance = Distance(metric='edistance')
>>> distances = distance.onesided_distances(
...     adata, groupby='condition', selected_group='control'
... )

pairwise#

Distance.pairwise(adata, groupby, *, groups=None, bootstrap=False, n_bootstrap=100, random_state=0, multi_gpu=None)[source]#

Compute pairwise distances between all cell groups.

Parameters:
adata AnnData

Annotated data matrix

groupby str

Key in adata.obs for grouping cells

groups Sequence[str] | None (default: None)

Specific groups to compute (if None, use all)

bootstrap bool (default: False)

Whether to compute bootstrap variance estimates

n_bootstrap int (default: 100)

Number of bootstrap iterations (if bootstrap=True)

random_state int (default: 0)

Random seed for reproducibility

multi_gpu bool | list[int] | str | None (default: None)

GPU selection: - None: Use all GPUs if metric supports it, else GPU 0 (default) - True: Use all available GPUs - False: Use only GPU 0 - list[int]: Use specific GPU IDs (e.g., [0, 2]) - str: Comma-separated GPU IDs (e.g., “0,2”)

Returns:

result DataFrame with pairwise distances. If bootstrap=True, returns tuple of (distances, distances_var) DataFrames.

Examples

>>> distance = Distance(metric='edistance')
>>> result = distance.pairwise(adata, groupby='condition')

validate_contrasts#

static Distance.validate_contrasts(adata, contrasts)[source]#

Validate a contrasts DataFrame against an AnnData object.

Expects the DataFrame layout produced by create_contrasts(): first column is the groupby column, reference column contains the control value, remaining columns are split-by filters.

Parameters:
adata AnnData

Annotated data matrix

contrasts DataFrame

DataFrame to validate

Raises:

ValueError – If validation fails.

Return type:

None