rapids_singlecell.pp.bbknn#
- rapids_singlecell.pp.bbknn(adata, neighbors_within_batch=3, n_pcs=None, *, batch_key=None, use_rep=None, random_state=0, algorithm='brute', metric='euclidean', metric_kwds=mappingproxy({}), algorithm_kwds=mappingproxy({}), trim=None, key_added=None, copy=False)[source]#
Batch balanced KNN [PYM+19], altering the KNN procedure to identify each cell’s top neighbours in each batch separately instead of the entire cell pool with no accounting for batch. The nearest neighbours for each batch are then merged to create a final list of neighbours for the cell.
- Parameters:
- adata
AnnData Annotated data matrix.
- neighbors_within_batch
int(default:3) How many top neighbours to report for each batch; total number of neighbours in the initial k-nearest-neighbours computation will be this number times the number of batches. This then serves as the basis for the construction of a symmetrical matrix of connectivities.
- n_pcs
int|None(default:None) Use this many PCs. If
n_pcs==0anduse_rep is None, use.X.- use_rep
str|None(default:None) Use the indicated representation.
'X'or any key for.obsmis valid. IfNone, the representation is chosen automatically: For.n_vars < 50,.Xis used, otherwise'X_pca'is used. If'X_pca'is not present, it’s computed with default parameters orn_pcsif present.- random_state
None|int|RandomState(default:0) A numpy random seed.
- algorithm
Literal['brute','cagra','ivfflat','ivfpq','mg_ivfflat','mg_ivfpq'] (default:'brute') The query algorithm to use. Valid options are:
'brute'Brute-force search that computes distances to all data points, guaranteeing exact results.
'ivfflat'Uses inverted file indexing to partition the dataset into coarse quantizer cells and performs the search within the relevant cells.
'ivfpq'Combines inverted file indexing with product quantization to encode sub-vectors of the dataset, facilitating faster distance computation.
'cagra'Employs the Compressed, Accurate Graph-based search to quickly find nearest neighbors by traversing a graph structure.
'mg_ivfflat'Uses the Multi-GPU inverted file indexing to partition the dataset into coarse quantizer cells and performs the search within the relevant cells.
'mg_ivfpq'Combines Multi-GPU inverted file indexing with product quantization to encode sub-vectors of the dataset, facilitating faster distance computation.
Please ensure that the chosen algorithm is compatible with your dataset and the specific requirements of your search problem.
- metric
Union[Literal['l2','chebyshev','manhattan','taxicab','correlation','inner_product','euclidean','canberra','lp','minkowski','cosine','jensenshannon','linf','cityblock','l1','haversine','sqeuclidean'],Literal['canberra','chebyshev','cityblock','cosine','euclidean','hellinger','inner_product','jaccard','l1','l2','linf','lp','manhattan','minkowski','taxicab']] (default:'euclidean') A known metric’s name or a callable that returns a distance.
- metric_kwds
Mapping[str,Any] (default:mappingproxy({})) Options for the metric.
- algorithm_kwds
Mapping[str,Any] (default:mappingproxy({})) Options for the algorithm. For
ivfflatandivfpqalgorithms, the following parameters can be specified:’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)).
’nprobes’: Number of lists to probe during search. Default is 1. Higher values increase accuracy but reduce speed.
For
mg_ivfflatandmg_ivfpqalgorithms, the following parameters can be specified:’distribution_mode’: The distribution mode to use. Valid options are: ‘replicated’ and ‘shared’. Default is ‘replicated’.
’n_lists’: Number of inverted lists for IVF indexing. Default is 2 * next_power_of_2(sqrt(n_samples)).
’n_probes’: Number of lists to probe during search. Default is 20. Higher values increase accuracy but reduce speed.
- trim
int|None(default:None) Trim the neighbours of each cell to these many top connectivities. May help with population independence and improve the tidiness of clustering. The lower the value the more independent the individual populations, at the cost of more conserved batch effect. If
None, sets the parameter value automatically to 10 timesneighbors_within_batchtimes the number of batches. Set to 0 to skip.- key_added
str|None(default:None) If not specified, the neighbors data is stored in
.uns['neighbors'], distances and connectivities are stored in.obsp['distances']and.obsp['connectivities']respectively. If specified, the neighbors data is added to.uns[key_added], distances are stored in.obsp[f'{key_added}_distances']and connectivities in.obsp[f'{key_added}_connectivities'].- copy
bool(default:False) Return a copy instead of writing to adata.
- adata
- Return type:
- Returns:
Depending on
copy, updates or returnsadatawith the following:- connectivitiessparse matrix of dtype
float32. Weighted adjacency matrix of the neighborhood graph of data points. Weights should be interpreted as connectivities.
- distancessparse matrix of dtype
float32. Instead of decaying weights, this stores distances for each pair of neighbors.
See
key_addedparameter description for the storage path of connectivities and distances.- connectivitiessparse matrix of dtype