rapids_singlecell.pp.highly_variable_genes#
- rapids_singlecell.pp.highly_variable_genes(adata, *, layer=None, min_mean=0.0125, max_mean=3, min_disp=0.5, max_disp=inf, n_top_genes=None, flavor='seurat', n_bins=20, span=0.3, check_values=True, theta=100, clip=None, chunksize=1000, n_samples=10000, batch_key=None)[source]#
Annotate highly variable genes. Expects logarithmized data, except when
flavor='seurat_v3','pearson_residuals','poisson_gene_selection'
, in which count data is expected.Reimplementation of scanpy’s function. Depending on flavor, this reproduces the R-implementations of Seurat, Cell Ranger, Seurat v3 and Pearson Residuals. Flavor
poisson_gene_selection
is an implementation of scvi, which is based on M3Drop. It requires gpu accelerated pytorch to be installed.For these dispersion-based methods, the normalized dispersion is obtained by scaling with the mean and standard deviation of the dispersions for genes falling into a given bin for mean expression of genes. This means that for each bin of mean expression, highly variable genes are selected.
For Seurat v3, a normalized variance for each gene is computed. First, the data are standardized (i.e., z-score normalization per feature) with a regularized standard deviation. Next, the normalized variance is computed as the variance of each gene after the transformation. Genes are ranked by the normalized variance.
- Parameters:
- adata
AnnData
AnnData object
- layer
str
(default:None
) If provided, use
adata.layers[layer]
for expression values instead ofadata.X
.- min_mean
float
(default:0.0125
) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- max_mean
float
(default:3
) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- min_disp
float
(default:0.5
) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- max_disp
float
(default:inf
) If n_top_genes unequals None, this and all other cutoffs for the means and the normalized dispersions are ignored.
- n_top_genes
int
(default:None
) Number of highly-variable genes to keep.
- flavor
str
(default:'seurat'
) Choose the flavor (
seurat
,cell_ranger
,seurat_v3
,pearson_residuals
,poisson_gene_selection
) for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.- n_bins
int
(default:20
) Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1.
- span
float
(default:0.3
) The fraction of the data (cells) used when estimating the variance in the loess model fit if
flavor='seurat_v3'
.- check_values
bool
(default:True
) Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if
flavor='seurat_v3'
or'pearson_residuals'
.- theta
int
(default:100
) The negative binomial overdispersion parameter
theta
for Pearson residuals. Higher values correspond to less overdispersion (var = mean + mean^2/theta
), andtheta=np.Inf
corresponds to a Poisson model.- clip
bool
(default:None
) - Only used if
flavor='pearson_residuals'
. Determines if and how residuals are clipped: If
None
, residuals are clipped to the interval[-sqrt(n_obs), sqrt(n_obs)]
, wheren_obs
is the number of cells in the dataset (default behavior).If any scalar
c
, residuals are clipped to the interval[-c, c]
. Setclip=np.Inf
for no clipping.
- Only used if
- chunksize
int
(default:1000
) If
'poisson_gene_selection'
, this dertermines how many genes are processed at once. Choosing a smaller value will reduce the required memory.- n_samples
int
(default:10000
) The number of Binomial samples to use to estimate posterior probability of enrichment of zeros for each gene (only for
flavor='poisson_gene_selection'
).- batch_key
str
(default:None
) If specified, highly-variable genes are selected within each batch separately and merged.
- adata
- Return type:
- Returns:
updates
adata.var
with the following fields:highly_variable
boolboolean indicator of highly-variable genes
means
: floatmeans per gene
dispersions
: floatFor dispersion-based flavors, dispersions per gene
dispersions_norm
: floatFor dispersion-based flavors, normalized dispersions per gene
variances
: floatFor
flavor='seurat_v3','pearson_residuals'
, variance per genevariances_norm
: floatFor
flavor='seurat_v3'
, normalized variance per gene, averaged in the case of multiple batchesresidual_variances
floatFor
flavor='pearson_residuals'
, residual variance per gene. Averaged in the case of multiple batches.highly_variable_rank
floatFor
flavor='seurat_v3','pearson_residuals'
, rank of the gene according to normalized variance, median rank in the case of multiple batcheshighly_variable_nbatches
intIf batch_key is given, this denotes in how many batches genes are detected as HVG
highly_variable_intersection
boolIf batch_key is given, this denotes the genes that are highly variable in all batches