pertpy-GPU: ptg

pertpy-GPU: ptg#

pertpy provides tools for perturbation analysis [HJM+25]. rapids_singlecell.ptg accelerates some of these methods.

Distance#

Distance([metric, layer_key, obsm_key])

GPU-accelerated distance computation between groups of cells.

class rapids_singlecell.ptg.Distance(metric='edistance', layer_key=None, obsm_key=None, **kwargs)[source]

GPU-accelerated distance computation between groups of cells.

API compatible with pertpy’s Distance class.

Currently supported metrics:

  • "edistance": Energy distance (default).

    Twice the mean pairwise distance between cells of two groups minus the mean pairwise distance between cells within each group. See Peidli et al. (2023).

  • "euclidean" and "root_mean_squared_error": Euclidean distance

    between group mean vectors.

  • "mse": Mean squared distance between group mean vectors.

  • "mean_absolute_error": Mean absolute distance between group mean

    vectors.

  • "pearson_distance": Pearson distance between group mean vectors.

  • "cosine_distance": Cosine distance between group mean vectors.

  • "r2_distance": One minus the coefficient of determination between

    group mean vectors.

  • "wasserstein": Entropy-regularized 2-Wasserstein via Sinkhorn.

    Squared-Euclidean ground cost; per-pair auto-epsilon defaulting to 0.05 * std(C) to match OTT-JAX. Returns OTT’s reg_ot_cost value.

Parameters:
metric Literal['edistance', 'euclidean', 'root_mean_squared_error', 'mse', 'mean_absolute_error', 'pearson_distance', 'cosine_distance', 'r2_distance', 'wasserstein'] (default: 'edistance')

Distance metric to use.

layer_key str | None (default: None)

Key in adata.layers for cell data. Mutually exclusive with obsm_key.

obsm_key str | None (default: None)

Key in adata.obsm for embeddings. Mutually exclusive with layer_key. Defaults to "X_pca" if neither is specified.

Notes

The edistance bootstrap implementation differs from pertpy: rather than precomputing an n×n cell distance matrix and sampling from it, this implementation resamples cells and recomputes distances from scratch each iteration. This scales better for large datasets (O(n) vs O(n²) memory) and leverages multi-GPU parallelism for each bootstrap iteration.

"edistance" and "wasserstein" use multi-GPU (pairs are split across devices). Pseudobulk metrics aggregate cells into K group-mean vectors before computing distances, and the resulting K×K kernel is cheap enough on a single GPU that distributing it is not worth the cost. Passing multi_gpu=True for those metrics falls back to a single device with a warning.

Examples

>>> import rapids_singlecell as rsc
>>> distance = rsc.ptg.Distance(metric='edistance')
>>> result = distance.pairwise(adata, groupby='perturbation')
>>> # Direct computation on arrays
>>> d = distance(X, Y)

Methods

pairwise(adata, groupby, *[, groups, ...])

Compute pairwise distances between all cell groups.

onesided_distances(adata, groupby, ...[, ...])

Compute distances from one selected group to all other groups.

contrast_distances(adata, contrasts, *[, ...])

Compute distances for contrasts.

create_contrasts(adata, groupby, ...[, ...])

Build a contrasts DataFrame for use with contrast_distances().

bootstrap(X, Y, *[, n_bootstrap, random_state])

Compute bootstrap mean and variance for distance between two arrays.

__call__(X, Y)[source]

Compute distance between two cell groups directly from arrays.

This provides pertpy-compatible API for direct distance computation.

Parameters:
X np.ndarray | cp.ndarray

First array of shape (n_samples_x, n_features)

Y np.ndarray | cp.ndarray

Second array of shape (n_samples_y, n_features)

Return type:

float

Returns:

float Distance between X and Y

Examples

>>> distance = Distance(metric='edistance')
>>> X = adata.obsm["X_pca"][adata.obs["group"] == "A"]
>>> Y = adata.obsm["X_pca"][adata.obs["group"] == "B"]
>>> d = distance(X, Y)
pairwise(adata, groupby, *, groups=None, bootstrap=False, n_bootstrap=100, random_state=0, multi_gpu=None)[source]

Compute pairwise distances between all cell groups.

Parameters:
adata AnnData

Annotated data matrix

groupby str

Key in adata.obs for grouping cells

groups Sequence[str] | None (default: None)

Specific groups to compute (if None, use all)

bootstrap bool (default: False)

Whether to compute bootstrap variance estimates

n_bootstrap int (default: 100)

Number of bootstrap iterations (if bootstrap=True)

random_state int (default: 0)

Random seed for reproducibility

multi_gpu bool | list[int] | str | None (default: None)

GPU selection: - None: Use all GPUs if metric supports it, else GPU 0 (default) - True: Use all available GPUs - False: Use only GPU 0 - list[int]: Use specific GPU IDs (e.g., [0, 2]) - str: Comma-separated GPU IDs (e.g., “0,2”)

Returns:

result DataFrame with pairwise distances. If bootstrap=True, returns tuple of (distances, distances_var) DataFrames.

Examples

>>> distance = Distance(metric='edistance')
>>> result = distance.pairwise(adata, groupby='condition')
onesided_distances(adata, groupby, selected_group, *, groups=None, bootstrap=False, n_bootstrap=100, random_state=0, multi_gpu=None)[source]

Compute distances from one selected group to all other groups.

Parameters:
adata AnnData

Annotated data matrix

groupby str

Key in adata.obs for grouping cells

selected_group Sequence[str] | str

Reference group to compute distances from

groups Sequence[str] | None (default: None)

Specific groups to compute distances to (if None, use all)

bootstrap bool (default: False)

Whether to compute bootstrap variance estimates

n_bootstrap int (default: 100)

Number of bootstrap iterations (if bootstrap=True)

random_state int (default: 0)

Random seed for reproducibility

multi_gpu bool | list[int] | str | None (default: None)

GPU selection: - None: Use all GPUs if metric supports it, else GPU 0 (default) - True: Use all available GPUs - False: Use only GPU 0 - list[int]: Use specific GPU IDs (e.g., [0, 2]) - str: Comma-separated GPU IDs (e.g., “0,2”)

Return type:

Series | DataFrame | tuple[Series, Series] | tuple[DataFrame, DataFrame]

Returns:

distances Series containing distances from selected_group to all other groups. If bootstrap=True, returns tuple of (distances, distances_var).

Examples

>>> distance = Distance(metric='edistance')
>>> distances = distance.onesided_distances(
...     adata, groupby='condition', selected_group='control'
... )
contrast_distances(adata, contrasts, *, multi_gpu=None)[source]

Compute distances for contrasts.

Accepts a DataFrame (from create_contrasts() or constructed manually) with the following layout:

  • First column: the groupby column (target values to compare)

  • ``reference`` column: the control value in the groupby column

  • Other columns: split-by filters (e.g., cell type)

Parameters:
adata AnnData

Annotated data matrix

contrasts DataFrame

DataFrame with a groupby column, a reference column, and optional split columns.

multi_gpu bool | list[int] | str | None (default: None)

GPU selection: - None: Use all GPUs if metric supports it, else GPU 0 (default) - True: Use all available GPUs - False: Use only GPU 0 - list[int]: Use specific GPU IDs (e.g., [0, 2]) - str: Comma-separated GPU IDs (e.g., “0,2”)

Return type:

DataFrame

Returns:

pd.DataFrame Copy of the input DataFrame with an added distance column.

Examples

>>> distance = Distance(metric='edistance')
>>> # Using create_contrasts helper
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene", selected_group="Non_target",
...     split_by="group_name",
... )
>>> result = distance.contrast_distances(adata, contrasts=contrasts)
>>> # Manual DataFrame construction
>>> import pandas as pd
>>> contrasts = pd.DataFrame({
...     "target_gene": ["Irf7", "Ski"],
...     "reference": ["Non_target", "Non_target"],
...     "group_name": ["CD4", "CD4"],
... })
>>> result = distance.contrast_distances(adata, contrasts)
static create_contrasts(adata, groupby, selected_group, *, groups=None, split_by=None)[source]

Build a contrasts DataFrame for use with contrast_distances().

Each row represents one contrast: comparing a group against the reference, optionally within each level of split_by columns. The resulting DataFrame can be filtered or modified before passing to contrast_distances().

The output layout is:

  • First column (groupby): the target values to compare

  • ``reference`` column: the control value in the groupby column

  • Remaining columns (split_by): stratification filters

Parameters:
adata AnnData

Annotated data matrix

groupby str

Column in adata.obs whose levels are compared against selected_group

selected_group str | Sequence[str]

The reference (control) value(s) in the groupby column. When a sequence is passed, each target is compared against every reference, producing one row per (target, reference) combination.

groups Sequence[str] | None (default: None)

Specific groups to include. If None, all non-reference groups are included.

split_by str | Sequence[str] | None (default: None)

Column(s) in adata.obs to stratify by. If provided, contrasts are computed within each unique combination of these columns. Only combinations where the reference group exists are included.

Return type:

DataFrame

Returns:

pd.DataFrame One row per contrast. First column is groupby, then reference, then any split_by columns.

Examples

>>> # All targets vs control, ignoring celltype
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene", selected_group="Non_target"
... )
>>> # Multiple references
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene",
...     selected_group=["Non_target", "Scramble"],
... )
>>> # Stratified by celltype
>>> contrasts = Distance.create_contrasts(
...     adata, groupby="target_gene", selected_group="Non_target",
...     split_by="group_name",
... )
>>> # Filter before computing
>>> contrasts = contrasts[contrasts["group_name"] != "rare_type"]
>>> result = distance.contrast_distances(adata, contrasts=contrasts)
>>> # Manual construction (no helper needed)
>>> import pandas as pd
>>> contrasts = pd.DataFrame({
...     "target_gene": ["Irf7", "Ski"],
...     "reference": ["Non_target", "Non_target"],
...     "group_name": ["CD4", "CD4"],
... })
bootstrap(X, Y, *, n_bootstrap=100, random_state=0)[source]

Compute bootstrap mean and variance for distance between two arrays.

This provides pertpy-compatible API for bootstrap computation directly on arrays without requiring an AnnData object.

Parameters:
X np.ndarray | cp.ndarray

First array of shape (n_samples_x, n_features)

Y np.ndarray | cp.ndarray

Second array of shape (n_samples_y, n_features)

n_bootstrap int (default: 100)

Number of bootstrap iterations

random_state int (default: 0)

Random seed for reproducibility

Return type:

MeanVar

Returns:

result Named tuple containing mean and variance of bootstrapped distances

Examples

>>> distance = Distance(metric='edistance')
>>> X = adata.obsm["X_pca"][adata.obs["group"] == "A"]
>>> Y = adata.obsm["X_pca"][adata.obs["group"] == "B"]
>>> result = distance.bootstrap(X, Y, n_bootstrap=100)
>>> print(f"Distance: {result.mean:.3f} ± {result.variance**0.5:.3f}")

GuideAssignment#

GuideAssignment()

GPU-accelerated guide RNA assignment.

class rapids_singlecell.ptg.GuideAssignment[source]

GPU-accelerated guide RNA assignment.

Provides threshold-based and mixture-model-based methods for assigning cells to guide RNAs, compatible with pertpy’s GuideAssignment API. The mixture model follows crispat’s Poisson-Gaussian assignment rule while using batched EM on GPU instead of per-guide Pyro SVI, yielding orders-of-magnitude speedup.

Methods

assign_by_threshold(adata, *, ...[, layer, ...])

Assign cells to gRNAs exceeding a count threshold.

assign_to_max_guide(adata, *, ...[, layer, ...])

Assign each cell to its most expressed gRNA.

assign_mixture_model(adata, *[, layer, ...])

Assign gRNAs using a GPU-accelerated Poisson–Gaussian mixture model.

assign_by_threshold(adata, *, assignment_threshold, layer=None, output_layer='assigned_guides')[source]

Assign cells to gRNAs exceeding a count threshold.

Each cell is assigned to every gRNA with at least assignment_threshold counts. Expects unnormalized count data.

Parameters:
adata AnnData

Annotated data matrix of shape n_obs x n_vars.

assignment_threshold float

Minimum count for a viable assignment.

layer str | None (default: None)

Layer with raw counts. Uses adata.X if None.

output_layer str (default: 'assigned_guides')

Key under which the binary assignment matrix is stored in adata.layers.

Return type:

None

assign_to_max_guide(adata, *, assignment_threshold, layer=None, obs_key='assigned_guide', no_grna_assigned_key='Negative')[source]

Assign each cell to its most expressed gRNA.

Each cell is assigned to the gRNA with the highest count, provided that count is at least assignment_threshold. Expects unnormalized count data.

Parameters:
adata AnnData

Annotated data matrix of shape n_obs x n_vars.

assignment_threshold float

Minimum count for a viable assignment.

layer str | None (default: None)

Layer with raw counts. Uses adata.X if None.

obs_key str (default: 'assigned_guide')

Column in adata.obs where the assignment is stored.

no_grna_assigned_key str (default: 'Negative')

Label for cells with no guide above threshold.

Return type:

None

assign_mixture_model(adata, *, layer=None, assigned_guides_key='assigned_guide', no_grna_assigned_key='negative', max_assignments_per_cell=5, multiple_grna_assigned_key='multiple', multiple_grna_assignment_string='+', only_return_results=False, max_iter=90, tol=0.0001, posterior_threshold=0.5)[source]

Assign gRNAs using a GPU-accelerated Poisson–Gaussian mixture model.

Fits a two-component mixture (Poisson background + Gaussian signal) to the log₂-transformed non-zero counts of each guide simultaneously using batched Expectation-Maximization on GPU. Like crispat’s Poisson-Gaussian assignment, the fitted model is converted to an integer raw-count threshold. The default posterior cutoff matches pertpy’s crispat-style threshold rule.

Parameters:
adata AnnData

Annotated data matrix with guide RNA counts.

layer str | None (default: None)

Layer with raw counts. Uses adata.X if None.

assigned_guides_key str (default: 'assigned_guide')

Key in adata.obs for storing the assignment result.

no_grna_assigned_key str (default: 'negative')

Label for cells negative for all gRNAs.

max_assignments_per_cell int (default: 5)

Maximum number of gRNAs a cell can be assigned to.

multiple_grna_assigned_key str (default: 'multiple')

Label for cells exceeding max_assignments_per_cell.

multiple_grna_assignment_string str (default: '+')

Delimiter for joining multiple guide names.

only_return_results bool (default: False)

If True, return assignments without modifying adata.

max_iter int (default: 90)

Maximum number of EM iterations.

tol float (default: 0.0001)

Convergence tolerance on parameter changes.

posterior_threshold float (default: 0.5)

Minimum posterior probability of the Gaussian component required for a raw UMI count to define the assignment threshold.

Return type:

ndarray | None

Returns:

If only_return_results is True, returns an array of assignments. Otherwise modifies adata in-place and returns None.