rapids_singlecell.pp.highly_variable_genes#
- rapids_singlecell.pp.highly_variable_genes(adata, *, layer=None, min_mean=0.0125, max_mean=3, min_disp=0.5, max_disp=inf, n_top_genes=None, flavor='seurat', n_bins=20, span=0.3, check_values=True, theta=100, clip=None, chunksize=1000, n_samples=10000, batch_key=None)[source]#
Annotate highly variable genes [AH19, LBK21, SFG+15, SBH+19, ZTB+17].
Expects logarithmized data, except when
flavor='seurat_v3','seurat_v3_paper','pearson_residuals','poisson_gene_selection', in which count data is expected.Reimplementation of scanpy’s function. Depending on flavor, this reproduces the R-implementations of Seurat, Cell Ranger, Seurat v3 and Pearson Residuals. Flavor
poisson_gene_selectioncalculates analytical Poisson gene selection based on M3Drop using CuPy with CUDA kernels.For these dispersion-based methods, the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.
For
flavor='seurat_v3'/'seurat_v3_paper', a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance. Only ifbatch_keyis notNone, the two flavors differ: Forflavor='seurat_v3', genes are first sorted by the median (across batches) rank, with ties broken by the number of batches a gene is a HVG. Forflavor='seurat_v3_paper', genes are first sorted by the number of batches a gene is a HVG, with ties broken by the median (across batches) rank.The following may help when comparing to Seurat’s naming: If
batch_key=Noneandflavor='seurat', this mimics Seurat’sFindVariableFeatures(…, method='mean.var.plot'). Ifbatch_key=Noneandflavor='seurat_v3'/flavor='seurat_v3_paper', this mimics Seurat’sFindVariableFeatures(..., method='vst'). Ifbatch_keyis notNoneandflavor='seurat_v3_paper', this mimics Seurat’sSelectIntegrationFeatures.- Parameters:
- adata
AnnData AnnData object
- layer
str(default:None) If provided, use
adata.layers[layer]for expression values instead ofadata.X.- min_mean
float(default:0.0125) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- max_mean
float(default:3) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- min_disp
float(default:0.5) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- max_disp
float(default:inf) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- n_top_genes
int(default:None) Number of highly-variable genes to keep.
- flavor
Literal['seurat','cell_ranger','seurat_v3','seurat_v3_paper','pearson_residuals','poisson_gene_selection'] (default:'seurat') Choose the flavors for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.
- n_bins
int(default:20) Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1.
- span
float(default:0.3) The fraction of the data (cells) used when estimating the variance in the loess model fit if
flavor='seurat_v3'.- check_values
bool(default:True) Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if
flavor='seurat_v3'or'pearson_residuals'.- theta
int(default:100) The negative binomial overdispersion parameter
thetafor Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta), andtheta=np.Infcorresponds to a Poisson model.- clip
bool(default:None) - Only used if
flavor='pearson_residuals'. Determines if and how residuals are clipped: If
None, residuals are clipped to the interval[-sqrt(n_obs), sqrt(n_obs)], wheren_obsis the number of cells in the dataset (default behavior).If any scalar
c, residuals are clipped to the interval[-c, c]. Setclip=np.Inffor no clipping.
- Only used if
- chunksize
int(default:1000) If
'poisson_gene_selection', this dertermines how many genes are processed at once. Choosing a smaller value will reduce the required memory.- n_samples
int(default:10000) The number of Binomial samples to use to estimate posterior probability of enrichment of zeros for each gene (only for
flavor='poisson_gene_selection').- batch_key
str|None(default:None) If specified, highly-variable genes are selected within each batch separately and merged.
- adata
- Return type:
- Returns:
updates
adata.varwith the following fields:highly_variableboolboolean indicator of highly-variable genes
means: floatmeans per gene
dispersions: floatFor dispersion-based flavors, dispersions per gene
dispersions_norm: floatFor dispersion-based flavors, normalized dispersions per gene
variances: floatFor
flavor='seurat_v3','pearson_residuals', variance per genevariances_norm: floatFor
flavor='seurat_v3', normalized variance per gene, averaged in the case of multiple batchesresidual_variancesfloatFor
flavor='pearson_residuals', residual variance per gene. Averaged in the case of multiple batches.highly_variable_rankfloatFor
flavor='seurat_v3','pearson_residuals', rank of the gene according to normalized variance, median rank in the case of multiple batcheshighly_variable_nbatchesintIf batch_key is given, this denotes in how many batches genes are detected as HVG
highly_variable_intersectionboolIf batch_key is given, this denotes the genes that are highly variable in all batches