rapids_singlecell.pp.bbknn

Contents

rapids_singlecell.pp.bbknn#

rapids_singlecell.pp.bbknn(adata, neighbors_within_batch=3, n_pcs=None, *, batch_key=None, use_rep=None, random_state=0, algorithm='brute', metric='euclidean', metric_kwds=mappingproxy({}), algorithm_kwds=mappingproxy({}), trim=None, key_added=None, copy=False)[source]#

Batch balanced KNN [PYM+19], altering the KNN procedure to identify each cell’s top neighbours in each batch separately instead of the entire cell pool with no accounting for batch. The nearest neighbours for each batch are then merged to create a final list of neighbours for the cell.

Parameters:
adata AnnData

Annotated data matrix.

neighbors_within_batch int (default: 3)

How many top neighbours to report for each batch; total number of neighbours in the initial k-nearest-neighbours computation will be this number times the number of batches. This then serves as the basis for the construction of a symmetrical matrix of connectivities.

n_pcs int | None (default: None)

Use this many PCs. If n_pcs==0 and use_rep is None, use .X.

use_rep str | None (default: None)

Use the indicated representation. 'X' or any key for .obsm is valid. If None, the representation is chosen automatically: For .n_vars < 50, .X is used, otherwise 'X_pca' is used. If 'X_pca' is not present, it’s computed with default parameters or n_pcs if present.

random_state None | int | RandomState (default: 0)

A numpy random seed.

algorithm Literal['brute', 'cagra', 'ivfflat', 'ivfpq', 'mg_ivfflat', 'mg_ivfpq'] (default: 'brute')

The query algorithm to use. Valid options are:

'brute'

Brute-force search that computes distances to all data points, guaranteeing exact results.

'ivfflat'

Uses inverted file indexing to partition the dataset into coarse quantizer cells and performs the search within the relevant cells.

'ivfpq'

Combines inverted file indexing with product quantization to encode sub-vectors of the dataset, facilitating faster distance computation.

'cagra'

Employs the Compressed, Accurate Graph-based search to quickly find nearest neighbors by traversing a graph structure.

'mg_ivfflat'

Uses the Multi-GPU inverted file indexing to partition the dataset into coarse quantizer cells and performs the search within the relevant cells.

'mg_ivfpq'

Combines Multi-GPU inverted file indexing with product quantization to encode sub-vectors of the dataset, facilitating faster distance computation.

Please ensure that the chosen algorithm is compatible with your dataset and the specific requirements of your search problem.

metric Union[Literal['l2', 'chebyshev', 'manhattan', 'taxicab', 'correlation', 'inner_product', 'euclidean', 'canberra', 'lp', 'minkowski', 'cosine', 'jensenshannon', 'linf', 'cityblock', 'l1', 'haversine', 'sqeuclidean'], Literal['canberra', 'chebyshev', 'cityblock', 'cosine', 'euclidean', 'hellinger', 'inner_product', 'jaccard', 'l1', 'l2', 'linf', 'lp', 'manhattan', 'minkowski', 'taxicab']] (default: 'euclidean')

A known metric’s name or a callable that returns a distance.

metric_kwds Mapping[str, Any] (default: mappingproxy({}))

Options for the metric.

algorithm_kwds Mapping[str, Any] (default: mappingproxy({}))

Options for the algorithm. For ivfflat and ivfpq algorithms, the following parameters can be specified:

  • ’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)).

  • ’nprobes’: Number of lists to probe during search. Default is 1. Higher values increase accuracy but reduce speed.

For mg_ivfflat and mg_ivfpq algorithms, the following parameters can be specified:

  • ’distribution_mode’: The distribution mode to use. Valid options are: ‘replicated’ and ‘shared’. Default is ‘replicated’.

  • ’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)).

  • ’n_probes’: Number of lists to probe during search. Default is 20. Higher values increase accuracy but reduce speed.

trim int | None (default: None)

Trim the neighbours of each cell to these many top connectivities. May help with population independence and improve the tidiness of clustering. The lower the value the more independent the individual populations, at the cost of more conserved batch effect. If None, sets the parameter value automatically to 10 times neighbors_within_batch times the number of batches. Set to 0 to skip.

key_added str | None (default: None)

If not specified, the neighbors data is stored in .uns['neighbors'], distances and connectivities are stored in .obsp['distances'] and .obsp['connectivities'] respectively. If specified, the neighbors data is added to .uns[key_added], distances are stored in .obsp[f'{key_added}_distances'] and connectivities in .obsp[f'{key_added}_connectivities'].

copy bool (default: False)

Return a copy instead of writing to adata.

Return type:

AnnData | None

Returns:

Depending on copy, updates or returns adata with the following:

connectivitiessparse matrix of dtype float32.

Weighted adjacency matrix of the neighborhood graph of data points. Weights should be interpreted as connectivities.

distancessparse matrix of dtype float32.

Instead of decaying weights, this stores distances for each pair of neighbors.

See key_added parameter description for the storage path of connectivities and distances.