rapids_singlecell.pp.pca

rapids_singlecell.pp.pca#

rapids_singlecell.pp.pca(data, n_comps=None, *, layer=None, zero_center=True, svd_solver=None, random_state=0, mask_var=<object object>, use_highly_variable=None, dtype='float32', chunked=False, chunk_size=None, key_added=None, return_info=False, copy=False, **kwargs)[source]#

Principal component analysis using GPU acceleration [HMT09, TQOA24].

Uses the following implementations based on data type (defaults for svd_solver in parentheses):

	Dense	Sparse	Dask
`zero_center=True`	cuML PCA (`'full'`)	Custom (`'lanczos'` if n_vars > 8k, else `'covariance_eigh'`)	Custom (`'covariance_eigh'`)
`zero_center=False`	cuML TruncatedSVD (`'full'`)	Custom (`'lanczos'` if n_vars > 8k, else `'covariance_eigh'`)	Custom (`'covariance_eigh'`)
`chunked=True`	cuML IncrementalPCA	cuML IncrementalPCA	Not supported

Parameters:

data Union[AnnData, ndarray, csc_matrix, csr_matrix, Array]

The (annotated) data matrix of shape n_obs × n_vars. Rows correspond to cells and columns to genes. If a matrix is passed instead of an AnnData object, the PCA representation is returned.

n_comps int | None (default: None)

Number of principal components to compute. Defaults to 50, or 1 - minimum dimension size of selected representation.

layer str (default: None)

If provided, use adata.layers[layer] for expression values instead of adata.X.

zero_center bool (default: True)

If True, compute standard PCA from covariance matrix. If False, omit zero-centering variables (truncated SVD).

svd_solver str | None (default: None)

SVD solver to use. See table above for which implementation is used based on data type, as well as the default solver when svd_solver=None.

None: Choose automatically based on data type (see table above).
'covariance_eigh': Eigendecomposition of the covariance matrix. Fast for sparse matrices with fewer than ~8,000 features. Works with Dask arrays.
'lanczos': Lanczos bidiagonalization with implicit restarts. Memory efficient for large sparse matrices (>8,000 features). Best singular value accuracy. Does not support Dask arrays.
'randomized': Randomized SVD (Halko et al. 2009) with CholeskyQR2 orthogonalization (Tomás et al. 2024). Faster than Lanczos but approximate. Does not support Dask arrays.
'full': cuML: Full eigendecomposition of covariance matrix. For dense arrays only.
'jacobi': cuML: Jacobi iterative solver. Faster but less accurate. For dense arrays only.

random_state int | None (default: 0)

Random state for initialization.

mask_var ndarray[tuple[Any, ...], dtype[bool]] | str | None (default: <object object at 0x78cabde60ce0>)

Mask to use for the PCA computation. If None, all variables are used. If np.ndarray, use the provided mask. If str, use the mask stored in adata.var[mask_var].

use_highly_variable bool | None (default: None)

Whether to use highly variable genes only, stored in .var['highly_variable']. By default uses them if they have been determined beforehand.

dtype str (default: 'float32')

Numpy data type string to which to convert the result.

chunked bool (default: False)

If True, perform an incremental PCA on segments of chunk_size. The incremental PCA automatically zero centers and ignores settings of random_seed and svd_solver. If False, perform a full PCA.

chunk_size int (default: None)

Number of observations to include in each chunk. Required if chunked=True was passed.

key_added str | None (default: None)

If not specified, the embedding is stored as obsm['X_pca'], the loadings as varm['PCs'], and the parameters in uns['pca']. If specified, the embedding is stored as obsm[key_added], the loadings as varm[key_added], and the parameters in uns[key_added].

return_info bool (default: False)

Only relevant when passing a matrix instead of an AnnData: see “Returns”.

copy bool (default: False)

Whether to return a copy or update data. Only applies to AnnData input.

**kwargs

Additional arguments for specific SVD solvers. For svd_solver='randomized':

n_oversamples: Extra random vectors for better approximation. Higher values improve accuracy. Default is 10.
n_iter: Number of power iterations. Higher values improve accuracy for matrices with slowly decaying singular values. Default is 2.

Return type:

Union[None, AnnData, ndarray, csc_matrix, csr_matrix, Array]

Returns:

If a matrix is passed and return_info=False, the PCA representation is returned. If a matrix is passed and return_info=True, a tuple of (X_pca, components, variance_ratio, variance) is returned.

If an AnnData object is passed, adds fields to adata:

.obsm['X_pca' | key_added]
PCA representation of data.

.varm['PCs' | key_added]
The principal components containing the loadings.

.uns['pca' | key_added]['variance_ratio']
Ratio of explained variance.

.uns['pca' | key_added]['variance']
Explained variance, equivalent to the eigenvalues of the covariance matrix.

rapids_singlecell.pp.pca

Contents

rapids_singlecell.pp.pca#