rapids_singlecell.tl.rank_genes_groups

rapids_singlecell.tl.rank_genes_groups#

rapids_singlecell.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, method=None, corr_method='benjamini-hochberg', tie_correct=False, use_continuity=False, return_u_values=False, layer=None, chunk_size=None, n_bins=None, bin_range=None, skip_empty_groups=False, **kwds)[source]#

Rank genes for characterizing groups using GPU acceleration.

Log1p/log-normalized data is expected for biologically meaningful log fold changes. In-memory sparse wilcoxon inputs with explicit negative values use sign-safe dense ranking in the CUDA sparse streamers, materializing bounded dense tiles inside the nanobind path. Dense inputs are ranked directly and support any sign. (wilcoxon_binned rejects negative Dask sparse input, which it cannot bin correctly.)

Note

Dask support: 't-test', 't-test_overestim_var', 'wilcoxon_binned', and 'logreg' support Dask arrays. The 'wilcoxon' method does not support Dask arrays.

Note

Wilcoxon ranking precision: 'wilcoxon' and 'wilcoxon_binned' rank values in float32 on every code path, while means and log fold changes are computed in float64. This only diverges from Scanpy when the preprocessing itself ran in float64 — i.e. normalization/log1p produced values carrying sub-float32 precision. If preprocessing was done in float32 (the common case), the values are float32-exact and ranking is bit-identical to Scanpy (~1e-13), even if they are afterward stored as float64. For a fully float64 pipeline the rank-derived scores and p-values still match Scanpy-on-float64 to ~1e-4 on log-normalized data — below any significance threshold and changing no DE calls — because the rank-sum normal approximation is insensitive to sub-float32 tie jitter. If exact float64 ranking matters for your workflow, please open an issue at scverse/rapids_singlecell#issues.

Parameters:
adata AnnData

Annotated data matrix.

groupby str

The key of the observations grouping to consider.

mask_var ndarray[tuple[Any, ...], dtype[bool]] | str | None (default: None)

Select subset of genes to use in statistical tests. Can be a boolean array of shape (n_vars,) or a key in adata.var.

use_raw bool | None (default: None)

Use raw attribute of adata if present.

groups Union[Literal['all'], Iterable[str]] (default: 'all')

Subset of groups, e.g. ['g1', 'g2', 'g3'], to which comparison shall be restricted, or 'all' (default), for all groups.

reference str (default: 'rest')

If 'rest', compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.

n_genes int | None (default: None)

The number of genes that appear in the returned tables. Defaults to all genes.

rankby_abs bool (default: False)

Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.

pts bool (default: False)

Compute the fraction of cells expressing the genes.

key_added str | None (default: None)

The key in adata.uns information is saved to.

method Literal['logreg', 't-test', 't-test_overestim_var', 'wilcoxon', 'wilcoxon_binned'] | None (default: None)

't-test' uses Welch’s t-test (default), 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'wilcoxon_binned' uses histogram-based approximate Wilcoxon rank-sum (faster for large datasets, supports Dask arrays), 'logreg' uses logistic regression.

corr_method Literal['benjamini-hochberg', 'bonferroni'] (default: 'benjamini-hochberg')

p-value correction method. Used only for 't-test', 't-test_overestim_var', 'wilcoxon', and 'wilcoxon_binned'.

tie_correct bool (default: False)

Use tie correction for 'wilcoxon' and 'wilcoxon_binned' scores. Adjusts the variance of the rank-sum statistic for tied values. For 'wilcoxon_binned', each histogram bin acts as a tie group and the correction is derived from the bin counts.

use_continuity bool (default: False)

Apply continuity correction to 'wilcoxon' and 'wilcoxon_binned' z-scores. Subtracts 0.5 from |R - E[R]| before dividing by the standard deviation, matching scipy.stats.mannwhitneyu() default behavior.

return_u_values bool (default: False)

For 'wilcoxon', store Mann-Whitney U statistics in scores instead of z-scores. P-values are still computed from the z-score normal approximation using the selected tie and continuity settings.

layer str | None (default: None)

Key from adata.layers whose value will be used to perform tests on.

chunk_size int | None (default: None)

Number of genes to process at once for 'wilcoxon' and 'wilcoxon_binned'. Default is 512 for 'wilcoxon'. For 'wilcoxon_binned' the default is sized dynamically based on n_groups and n_bins to keep histogram memory stable.

n_bins int | None (default: None)

Number of histogram bins for 'wilcoxon_binned'. Higher values give a better approximation at slightly increased cost. Default is 1000 for in-memory arrays and 200 for Dask arrays.

bin_range Optional[Literal['log1p', 'auto']] (default: None)

How to determine the histogram bin range for 'wilcoxon_binned'. None (default) uses 'auto' for in-memory arrays and 'log1p' for Dask arrays (to avoid a costly data scan). 'log1p' uses a fixed [0, 15] range suitable for most log1p-normalized data. 'auto' computes the actual data range. Use this for nonnegative expression data outside the fixed log1p range.

skip_empty_groups bool (default: False)

Skip selected groups with fewer than two observations after filtering. This is useful for perturbation workflows where a per-cell-type slice keeps categories that are empty or singleton in that slice.

**kwds

Additional arguments passed to the method. For 'logreg', these are passed to cuml.linear_model.LogisticRegression.

Return type:

None

Returns:

Updates adata with the following fields. Rank result fields are Scanpy-compatible structured arrays.

adata.uns['rank_genes_groups' | key_added]['names']

Structured array to be indexed by group id storing the gene names. Ordered according to scores.

adata.uns['rank_genes_groups' | key_added]['scores']

Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group, or the Mann-Whitney U statistic when return_u_values=True. Ordered according to scores.

adata.uns['rank_genes_groups' | key_added]['logfoldchanges']

Structured array to be indexed by group id storing the log2 fold change for each gene for each group.

adata.uns['rank_genes_groups' | key_added]['pvals']

p-values. Only for 't-test', 't-test_overestim_var', 'wilcoxon', and 'wilcoxon_binned'.

adata.uns['rank_genes_groups' | key_added]['pvals_adj']

Corrected p-values. Only for 't-test', 't-test_overestim_var', 'wilcoxon', and 'wilcoxon_binned'.

adata.uns['rank_genes_groups' | key_added]['pts']

Fraction of cells expressing genes per group. Only if pts=True.

adata.uns['rank_genes_groups' | key_added]['pts_rest']

Fraction of cells expressing genes in rest. Only if pts=True and reference='rest'.