rapids_singlecell.pp.highly_variable_genes

rapids_singlecell.pp.highly_variable_genes#

rapids_singlecell.pp.highly_variable_genes(adata, *, layer=None, min_mean=0.0125, max_mean=3, min_disp=0.5, max_disp=inf, n_top_genes=None, flavor='seurat', n_bins=20, span=0.3, check_values=True, theta=100, clip=None, chunksize=1000, n_samples=10000, batch_key=None)[source]#

Annotate highly variable genes. Expects logarithmized data, except when flavor='seurat_v3','pearson_residuals','poisson_gene_selection', in which count data is expected.

Reimplementation of scanpy’s function. Depending on flavor, this reproduces the R-implementations of Seurat, Cell Ranger, Seurat v3 and Pearson Residuals. Flavor poisson_gene_selection is an implementation of scvi, which is based on M3Drop. It requires gpu accelerated pytorch to be installed.

For these dispersion-based methods, the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.

For Seurat v3, a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance.

Parameters:
adata AnnData

AnnData object

layer str (default: None)

If provided, use adata.layers[layer] for expression values instead of adata.X.

min_mean float (default: 0.0125)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

max_mean float (default: 3)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

min_disp float (default: 0.5)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

max_disp float (default: inf)

If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.

n_top_genes int (default: None)

Number of highly-variable genes to keep.

flavor str (default: 'seurat')

Choose the flavor (seurat, cell_ranger, seurat_v3, pearson_residuals, poisson_gene_selection) for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.

n_bins int (default: 20)

Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1.

span float (default: 0.3)

The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor='seurat_v3'.

check_values bool (default: True)

Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if flavor='seurat_v3' or 'pearson_residuals'.

theta int (default: 100)

The negative binomial overdispersion parameter theta for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), and theta=np.Inf corresponds to a Poisson model.

clip bool (default: None)

Only used if flavor='pearson_residuals'. Determines if and how residuals are clipped:
  • If None, residuals are clipped to the interval [-sqrt(n_obs), sqrt(n_obs)], where n_obs is the number of cells in the dataset (default behavior).

  • If any scalar c, residuals are clipped to the interval [-c, c]. Set clip=np.Inf for no clipping.

chunksize int (default: 1000)

If 'poisson_gene_selection', this dertermines how many genes are processed at once. Choosing a smaller value will reduce the required memory.

n_samples int (default: 10000)

The number of Binomial samples to use to estimate posterior probability of enrichment of zeros for each gene (only for flavor='poisson_gene_selection').

batch_key str (default: None)

If specified, highly-variable genes are selected within each batch separately and merged.

Return type:

None

Returns:

updates adata.var with the following fields:

highly_variablebool

boolean indicator of highly-variable genes

means: float

means per gene

dispersions: float

For dispersion-based flavors, dispersions per gene

dispersions_norm: float

For dispersion-based flavors, normalized dispersions per gene

variances: float

For flavor='seurat_v3','pearson_residuals', variance per gene

variances_norm: float

For flavor='seurat_v3', normalized variance per gene, averaged in the case of multiple batches

residual_variancesfloat

For flavor='pearson_residuals', residual variance per gene. Averaged in the case of multiple batches.

highly_variable_rankfloat

For flavor='seurat_v3','pearson_residuals', rank of the gene according to normalized variance, median rank in the case of multiple batches

highly_variable_nbatchesint

If batch_key is given, this denotes in how many batches genes are detected as HVG

highly_variable_intersectionbool

If batch_key is given, this denotes the genes that are highly variable in all batches