rapids_singlecell.tl.rank_genes_groups

rapids_singlecell.tl.rank_genes_groups#

rapids_singlecell.tl.rank_genes_groups(adata, groupby, *, mask_var=None, use_raw=None, groups='all', reference='rest', n_genes=None, rankby_abs=False, pts=False, key_added=None, method=None, corr_method='benjamini-hochberg', tie_correct=False, use_continuity=False, layer=None, chunk_size=None, pre_load=False, n_bins=None, bin_range=None, **kwds)[source]#

Rank genes for characterizing groups using GPU acceleration.

Expects logarithmized data.

Note

Dask support: 't-test', 't-test_overestim_var', and 'wilcoxon_binned' support Dask arrays. The 'wilcoxon' and 'logreg' methods do not support Dask arrays.

Parameters:
adata AnnData

Annotated data matrix.

groupby str

The key of the observations grouping to consider.

mask_var ndarray[tuple[Any, ...], dtype[bool]] | str | None (default: None)

Select subset of genes to use in statistical tests. Can be a boolean array of shape (n_vars,) or a key in adata.var.

use_raw bool | None (default: None)

Use raw attribute of adata if present.

groups Union[Literal['all'], Iterable[str]] (default: 'all')

Subset of groups, e.g. ['g1', 'g2', 'g3'], to which comparison shall be restricted, or 'all' (default), for all groups.

reference str (default: 'rest')

If 'rest', compare each group to the union of the rest of the group. If a group identifier, compare with respect to this group.

n_genes int | None (default: None)

The number of genes that appear in the returned tables. Defaults to all genes.

rankby_abs bool (default: False)

Rank genes by the absolute value of the score, not by the score. The returned scores are never the absolute values.

pts bool (default: False)

Compute the fraction of cells expressing the genes.

key_added str | None (default: None)

The key in adata.uns information is saved to.

method Literal['logreg', 't-test', 't-test_overestim_var', 'wilcoxon', 'wilcoxon_binned'] | None (default: None)

't-test' uses Welch’s t-test (default), 't-test_overestim_var' overestimates variance of each group, 'wilcoxon' uses Wilcoxon rank-sum, 'wilcoxon_binned' uses histogram-based approximate Wilcoxon rank-sum (faster for large datasets, supports Dask arrays), 'logreg' uses logistic regression.

corr_method Literal['benjamini-hochberg', 'bonferroni'] (default: 'benjamini-hochberg')

p-value correction method. Used only for 't-test', 't-test_overestim_var', 'wilcoxon', and 'wilcoxon_binned'.

tie_correct bool (default: False)

Use tie correction for 'wilcoxon' and 'wilcoxon_binned' scores. Adjusts the variance of the rank-sum statistic for tied values. For 'wilcoxon_binned', each histogram bin acts as a tie group and the correction is derived from the bin counts.

use_continuity bool (default: False)

Apply continuity correction to 'wilcoxon' and 'wilcoxon_binned' z-scores. Subtracts 0.5 from |R - E[R]| before dividing by the standard deviation, matching scipy.stats.mannwhitneyu() default behavior.

layer str | None (default: None)

Key from adata.layers whose value will be used to perform tests on.

chunk_size int | None (default: None)

Number of genes to process at once for 'wilcoxon' and 'wilcoxon_binned'. Default is 128 for 'wilcoxon'. For 'wilcoxon_binned' the default is sized dynamically based on n_groups and n_bins to keep histogram memory stable.

pre_load bool (default: False)

Pre-load the data into GPU memory. Used only for 'wilcoxon'.

n_bins int | None (default: None)

Number of histogram bins for 'wilcoxon_binned'. Higher values give a better approximation at slightly increased cost. Default is 1000 for in-memory arrays and 200 for Dask arrays.

bin_range Optional[Literal['log1p', 'auto']] (default: None)

How to determine the histogram bin range for 'wilcoxon_binned'. None (default) uses 'auto' for in-memory arrays and 'log1p' for Dask arrays (to avoid a costly data scan). 'log1p' uses a fixed [0, 15] range suitable for most log1p-normalized data. 'auto' computes the actual data range. Use this for z-scored or unnormalized data.

**kwds

Additional arguments passed to the method. For 'logreg', these are passed to cuml.linear_model.LogisticRegression.

Return type:

None

Returns:

Updates adata with the following fields:

adata.uns['rank_genes_groups' | key_added]['names']

Structured array to be indexed by group id storing the gene names. Ordered according to scores.

adata.uns['rank_genes_groups' | key_added]['scores']

Structured array to be indexed by group id storing the z-score underlying the computation of a p-value for each gene for each group. Ordered according to scores.

adata.uns['rank_genes_groups' | key_added]['logfoldchanges']

Structured array to be indexed by group id storing the log2 fold change for each gene for each group.

adata.uns['rank_genes_groups' | key_added]['pvals']

p-values. Only for 't-test', 't-test_overestim_var', 'wilcoxon', and 'wilcoxon_binned'.

adata.uns['rank_genes_groups' | key_added]['pvals_adj']

Corrected p-values. Only for 't-test', 't-test_overestim_var', 'wilcoxon', and 'wilcoxon_binned'.

adata.uns['rank_genes_groups' | key_added]['pts']

Fraction of cells expressing genes per group. Only if pts=True.

adata.uns['rank_genes_groups' | key_added]['pts_rest']

Fraction of cells expressing genes in rest. Only if pts=True and reference='rest'.