rapids_singlecell.pp.harmony_integrate

rapids_singlecell.pp.harmony_integrate#

rapids_singlecell.pp.harmony_integrate(adata, key, *, basis='X_pca', adjusted_basis='X_pca_harmony', dtype=<class 'numpy.float64'>, flavor='harmony2', n_clusters=None, max_iter_harmony=10, max_iter_clustering=200, tol_harmony=0.0001, tol_clustering=1e-05, sigma=0.1, theta=2.0, tau=0, ridge_lambda=1.0, alpha=0.2, batch_prune_threshold=1e-05, correction_method=None, colsum_algo=None, block_proportion=0.05, random_state=0, verbose=False)[source]#

Integrate different experiments using the Harmony algorithm [KMF+19, PYM+26].

This GPU-accelerated implementation was originally based on the harmony-pytorch package. Multiple batch variables now follow the per-covariate formulation described in the Harmony papers: each key is modeled separately instead of combining all keys into one joint category. As Harmony works by adjusting the principal components, this function should be run after performing PCA but before computing the neighbor graph.

By default, the Harmony2 algorithm is used, which includes a stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning. To revert to the original Harmony behavior:

rsc.pp.harmony_integrate(adata, key, flavor="harmony1")

Parameters:

adata AnnData: The annotated data matrix.
key str | list[str]: The key(s) of the column(s) in adata.obs that differentiate(s) among experiments/batches. Multiple keys are modeled as separate batch variables, with one active categorical level per variable and cell. To retain the joint-combination behavior of earlier releases, combine the desired columns into one categorical column and pass that single key.
basis str (default: 'X_pca'): The name of the field in adata.obsm where the PCA table is stored.
adjusted_basis str (default: 'X_pca_harmony'): The name of the field in adata.obsm where the adjusted PCA table will be stored.
dtype type (default: <class 'numpy.float64'>): The data type to use for Harmony computation. If you use 32-bit you may experience numerical instability.
flavor Literal[‘harmony2’, ‘harmony1’] (default: 'harmony2'): Which version of the Harmony algorithm to use. "harmony2" (default) enables the stabilized diversity penalty, dynamic per-cluster-per-batch ridge regularization, and automatic batch pruning from [PYM+26]. "harmony1" uses the original algorithm from [KMF+19].
n_clusters int | None (default: None): Number of clusters used for soft k-means in the Harmony algorithm. If None, uses min(100, N / 30). More clusters capture finer-grained structure but increase computation time.
max_iter_harmony int (default: 10): Maximum number of outer Harmony iterations (each consisting of a clustering step followed by a correction step).
max_iter_clustering int (default: 200): Maximum iterations for the clustering step within each Harmony iteration.
tol_harmony float (default: 0.0001): Convergence tolerance for the Harmony objective function. The algorithm stops when the relative change in objective falls below this value.
tol_clustering float (default: 1e-05): Convergence tolerance for the clustering step within each Harmony iteration.
sigma float (default: 0.1): Width of the soft-clustering kernel. Controls the entropy of cluster assignments: smaller values produce harder assignments (cells assigned to fewer clusters), while larger values produce softer assignments (cells spread across more clusters).
theta float | int | list[float] | np.ndarray | cp.ndarray (default: 2.0): Diversity penalty weight per batch variable. Controls how strongly Harmony encourages each cluster to contain a balanced representation of all batches. Higher values (e.g. 4) produce more aggressive mixing; lower values (e.g. 0.5) allow more batch-specific clusters. Set to 0 to disable the diversity penalty for a batch variable. A scalar is applied to every key. A sequence may contain one value per key, expanded over that key’s categorical levels, or one value per categorical level across all keys.
tau int (default: 0): Discounting factor on theta. When tau > 0, the diversity penalty is down-weighted for batches with fewer cells, preventing over-correction of small batches. By default (0), there is no discounting.
ridge_lambda float (default: 1.0): Ridge regression regularization for the correction step. Larger values produce more conservative (smaller) corrections, preventing over-fitting. Must be finite and greater than zero. Only used with flavor="harmony1".
alpha float (default: 0.2): Scaling factor for the dynamic per-cluster-per-batch ridge regularization. The effective regularization for each cluster-batch pair is alpha * E_kb where E_kb is the expected number of cells. Larger values produce more conservative corrections. Only used with flavor="harmony2".
batch_prune_threshold float | None (default: 1e-05): Fraction threshold below which a batch-cluster pair is pruned (correction suppressed). When the fraction of a batch’s cells assigned to a cluster (O_kb / N_b) falls below this threshold, that batch-cluster pair receives no correction, preventing spurious adjustments. Only used with flavor="harmony2". Set to None to disable pruning.
correction_method Literal[‘fast’, ‘batched’] | None (default: None): Method for the correction step. "fast" uses a precomputed factorization that avoids the full inversion, which can be faster for datasets with many batches. "batched" processes all clusters simultaneously (fastest but requires more memory). With one key, None automatically selects "batched" unless its workspace would exceed 1 GiB, in which case "fast" is used. Multiple keys always use the exact general-design solve because the arrowhead optimization applies only to one batch variable; clusters are processed in workspace-bounded chunks when needed. For multiple keys, use None or "batched"; passing "fast" emits a warning and uses the exact solve.
colsum_algo COLSUM_ALGO | None (default: None): Algorithm for column sums. If None, chosen automatically. If "benchmark", benchmarks all algorithms.
block_proportion float (default: 0.05): Proportion of cells updated per clustering sub-iteration. Smaller values produce more stochastic updates. Larger values are faster but may converge to different solutions.
random_state int (default: 0): Random seed for reproducibility.
verbose bool (default: False): Whether to print benchmarking and convergence information.

Return type:

None

Returns:

Updates adata with the field adata.obsm[adjusted_basis], containing principal components adjusted by Harmony such that different experiments are integrated.

rapids_singlecell.pp.harmony_integrate

Contents

rapids_singlecell.pp.harmony_integrate#