rapids_singlecell.pp.neighbors

Contents

rapids_singlecell.pp.neighbors#

rapids_singlecell.pp.neighbors(adata, n_neighbors=15, n_pcs=None, *, use_rep=None, random_state=0, algorithm='brute', metric='euclidean', metric_kwds=mappingproxy({}), algorithm_kwds=mappingproxy({}), method='umap', key_added=None, copy=False)[source]#

Compute a neighborhood graph of observations [ONN+24].

The neighbor search efficiency of this heavily relies on cuVS, which also provides a method for estimating connectivities of data points - the connectivity of the manifold.

Parameters:
adata AnnData

Annotated data matrix.

n_neighbors int (default: 15)

The size of local neighborhood (in terms of number of neighboring data points) used for manifold approximation. Larger values result in more global views of the manifold, while smaller values result in more local data being preserved. In general values should be in the range 2 to 100.

n_pcs int | None (default: None)

Use this many PCs. If n_pcs==0 use .X if use_rep is None.

use_rep str | None (default: None)

Use the indicated representation. 'X' or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise 'X_pca' is used. If 'X_pca' is not present, it’s computed with default parameters or n_pcs if present.

random_state None | int | RandomState (default: 0)

A numpy random seed.

algorithm Literal['brute', 'cagra', 'ivfflat', 'ivfpq', 'nn_descent', 'all_neighbors', 'mg_ivfflat', 'mg_ivfpq'] (default: 'brute')

The query algorithm to use. Valid options are:
'brute'

Brute-force search that computes distances to all data points, guaranteeing exact results.

'ivfflat'

Uses inverted file indexing to partition the dataset into coarse quantizer cells and performs the search within the relevant cells.

'ivfpq'

Combines inverted file indexing with product quantization to encode sub-vectors of the dataset, facilitating faster distance computation.

'cagra'

Employs the Compressed, Accurate Graph-based search to quickly find nearest neighbors by traversing a graph structure.

'nn_descent'

Uses the NN-descent algorithm to approximate the k-nearest neighbors. Note: Performance may be degraded when Dask is active.

'all_neighbors'

Uses the all-neighbors algorithm to approximate the k-nearest neighbors. Note: Performance may be degraded when Dask is active and algorithm is nn_descent.

'mg_ivfflat'

Uses the Multi-GPU inverted file indexing to partition the dataset into coarse quantizer cells and performs the search within the relevant cells.

'mg_ivfpq'

Combines Multi-GPU inverted file indexing with product quantization to encode sub-vectors of the dataset, facilitating faster distance computation.

Please ensure that the chosen algorithm is compatible with your dataset and the specific requirements of your search problem.

metric Union[Literal['l2', 'chebyshev', 'manhattan', 'taxicab', 'correlation', 'inner_product', 'euclidean', 'canberra', 'lp', 'minkowski', 'cosine', 'jensenshannon', 'linf', 'cityblock', 'l1', 'haversine', 'sqeuclidean'], Literal['canberra', 'chebyshev', 'cityblock', 'cosine', 'euclidean', 'hellinger', 'inner_product', 'jaccard', 'l1', 'l2', 'linf', 'lp', 'manhattan', 'minkowski', 'taxicab']] (default: 'euclidean')

A known metric’s name or a callable that returns a distance.

metric_kwds Mapping[str, Any] (default: mappingproxy({}))

Options for the metric.

method Literal['umap', 'gauss', 'jaccard'] (default: 'umap')

Method for computing connectivities.

'umap'

UMAP fuzzy simplicial set (default).

'gauss'

Adaptive Gaussian kernel.

'jaccard'

PhenoGraph Jaccard index.

algorithm_kwds Mapping[str, Any] (default: mappingproxy({}))

Options for the algorithm. For ivfflat and ivfpq algorithms, the following parameters can be specified:

  • ’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)).

  • ’n_probes’: Number of lists to probe during search. Default is 20. Higher values increase accuracy but reduce speed.

For nn_descent algorithm, the following parameters can be specified:

  • ’intermediate_graph_degree’: The degree of the intermediate graph. Default is None. It is recommended to set it to >= 1.5 * n_neighbors.

For all_neighbors algorithm, the following parameters can be specified:

  • ’algo’: The algorithm to use. Valid options are: ‘ivf_pq’ and ‘nn_descent’. Default is ‘nn_descent’.

  • ’n_clusters’: Number of clusters/batches to partition the dataset into (> overlap_factor). Default is number of GPUs.

  • ’overlap_factor’: Number of clusters each point is assigned to (must be < n_clusters). Default is 1.

  • ’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)). Only available for ivf_pq algorithm.

  • ’intermediate_graph_degree’: The degree of the intermediate graph. Default is None. It is recommended to set it to >= 1.5 * n_neighbors. Only available for nn_descent algorithm.

For mg_ivfflat and mg_ivfpq algorithms, the following parameters can be specified:

  • ’distribution_mode’: The distribution mode to use. Valid options are: ‘replicated’ and ‘shared’. Default is ‘replicated’.

  • ’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)).

  • ’n_probes’: Number of lists to probe during search. Default is 20. Higher values increase accuracy but reduce speed.

key_added str | None (default: None)

If not specified, the neighbors data is stored in .uns[‘neighbors’], distances and connectivities are stored in .obsp[‘distances’] and .obsp[‘connectivities’] respectively. If specified, the neighbors data is added to .uns[key_added], distances are stored in .obsp[key_added+’_distances’] and connectivities in .obsp[key_added+’_connectivities’].

copy bool (default: False)

Return a copy instead of writing to adata.

Return type:

AnnData | None

Returns:

Depending on copy, updates or returns adata with the following:

See key_added parameter description for the storage path of connectivities and distances.

connectivitiessparse matrix of dtype float32.

Weighted adjacency matrix of the neighborhood graph of data points. Weights should be interpreted as connectivities.

distancessparse matrix of dtype float32.

Instead of decaying weights, this stores distances for each pair of neighbors.