rapids_singlecell.pp.highly_variable_genes

rapids_singlecell.pp.highly_variable_genes#

rapids_singlecell.pp.highly_variable_genes(adata, *, layer=None, min_mean=0.0125, max_mean=3, min_disp=0.5, max_disp=inf, n_top_genes=None, flavor='seurat', n_bins=20, span=0.3, check_values=True, theta=100, clip=None, chunksize=1000, n_samples=10000, batch_key=None)[source]#

Annotate highly variable genes [AH19, LBK21, SFG+15, SBH+19, ZTB+17].

Expects logarithmized data, except when flavor='seurat_v3','seurat_v3_paper','pearson_residuals','poisson_gene_selection', in which count data is expected.

Reimplementation of scanpy’s function. Depending on flavor, this reproduces the R-implementations of Seurat, Cell Ranger, Seurat v3 and Pearson Residuals. Flavor poisson_gene_selection calculates analytical Poisson gene selection based on M3Drop using CuPy with CUDA kernels.

For these dispersion-based methods, the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.

For flavor='seurat_v3'/'seurat_v3_paper', a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance. Only if batch_key is not None, the two flavors differ: For flavor='seurat_v3', genes are first sorted by the median (across batches) rank, with ties broken by the number of batches a gene is a HVG. For flavor='seurat_v3_paper', genes are first sorted by the number of batches a gene is a HVG, with ties broken by the median (across batches) rank.

The following may help when comparing to Seurat’s naming: If batch_key=None and flavor='seurat', this mimics Seurat’s FindVariableFeatures(…, method='mean.var.plot'). If batch_key=None and flavor='seurat_v3'/flavor='seurat_v3_paper', this mimics Seurat’s FindVariableFeatures(..., method='vst'). If batch_key is not None and flavor='seurat_v3_paper', this mimics Seurat’s SelectIntegrationFeatures.

Parameters:

adata AnnData

AnnData object

layer str (default: None)

If provided, use adata.layers[layer] for expression values instead of adata.X.

min_mean float (default: 0.0125)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

max_mean float (default: 3)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

min_disp float (default: 0.5)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

max_disp float (default: inf)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

n_top_genes int (default: None)

Number of highly-variable genes to keep.

flavor Literal['seurat', 'cell_ranger', 'seurat_v3', 'seurat_v3_paper', 'pearson_residuals', 'poisson_gene_selection'] (default: 'seurat')

Choose the flavors for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.

n_bins int (default: 20)

Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1.

span float (default: 0.3)

The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor='seurat_v3'.

check_values bool (default: True)

Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if flavor='seurat_v3' or 'pearson_residuals'.

theta int (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.Inf corresponds to a Poisson model.

clip bool (default: None)

Only used if flavor='pearson_residuals'. Determines if and how residuals are clipped:

If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).
If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.Inf for no clipping.

chunksize int (default: 1000)

If 'poisson_gene_selection', this dertermines how many genes are processed at once. Choosing a smaller value will reduce the required memory.

n_samples int (default: 10000)

The number of Binomial samples to use to estimate posterior probability of enrichment of zeros for each gene (only for flavor='poisson_gene_selection').

batch_key str | None (default: None)

If specified, highly-variable genes are selected within each batch separately and merged.

Return type:

None

Returns:

updates adata.var with the following fields:

highly_variablebool: boolean indicator of highly-variable genes
means: float: means per gene
dispersions: float: For dispersion-based flavors, dispersions per gene
dispersions_norm: float: For dispersion-based flavors, normalized dispersions per gene
variances: float: For flavor='seurat_v3','pearson_residuals', variance per gene
variances_norm: float: For flavor='seurat_v3', normalized variance per gene, averaged in the case of multiple batches
residual_variancesfloat: For flavor='pearson_residuals', residual variance per gene. Averaged in the case of multiple batches.
highly_variable_rankfloat: For flavor='seurat_v3','pearson_residuals', rank of the gene according to normalized variance, median rank in the case of multiple batches
highly_variable_nbatchesint: If batch_key is given, this denotes in how many batches genes are detected as HVG
highly_variable_intersectionbool: If batch_key is given, this denotes the genes that are highly variable in all batches

rapids_singlecell.pp.highly_variable_genes

Contents

rapids_singlecell.pp.highly_variable_genes#