Basic Demo Workflow#

Author: Severin Dicks Copyright scverse

This tutorial demonstrates how rapids-singlecell utilizes GPU-acceleration to streamline single-cell workflows. By offloading computationally intensive tasks to the GPU, we transform traditionally day-long bottlenecks into near-instant analyses without increasing code complexity.

import warnings

warnings.filterwarnings("ignore")

import anndata as ad
import scanpy as sc
import cupy as cp

import time
import rapids_singlecell as rsc
import rmm
from rmm.allocators.cupy import rmm_cupy_allocator

rmm.reinitialize(
    managed_memory=False,  # Allows oversubscription
    pool_allocator=False,  # default is False
    devices=0,  # GPU device IDs to register. By default registers only GPU 0.
)
cp.cuda.set_allocator(rmm_cupy_allocator)

Load and Prepare Data#

First, we load the AnnData object which contains a sparse count matrix using AnnData. The sparse count matrix will then be placed on the GPU.

data_load_start = time.time()
%%time
import pooch
path = pooch.retrieve(
    url="https://exampledata.scverse.org/rapids-singlecell/dli_census.h5ad",
    fname="dli_census.h5ad",
)
adata = ad.read(path)
%%time
rsc.get.anndata_to_GPU(adata)
CPU times: user 152 ms, sys: 2.2 s, total: 2.35 s
Wall time: 2.35 s
data_load_time = time.time()
print("Total data load and format time: %s" % (data_load_time - data_load_start))
Total data load and format time: 4.077548503875732

Preprocessing#

preprocess_start = time.time()
adata.var_names = adata.var.feature_name

Quality Control#

We calculate quality control (QC) metrics to assess cell and gene quality. These include:

  • Per cell metrics:

    • Total counts per cell (library size)

    • Number of detected genes per cell

    • Percentage of counts from mitochondrial (MT) and ribosomal (RIBO) genes

  • Per gene metrics (gene space):

    • Total counts per gene

    • Number of cells expressing each gene

These metrics help identify low-quality or stressed cells and ensure a meaningful feature set for downstream analysis.

%%time
rsc.pp.flag_gene_family(adata, gene_family_name="MT", gene_family_prefix="MT")
CPU times: user 6.75 ms, sys: 0 ns, total: 6.75 ms
Wall time: 6.71 ms
%%time
rsc.pp.flag_gene_family(adata, gene_family_name="RIBO", gene_family_prefix="RPS")
CPU times: user 6.3 ms, sys: 0 ns, total: 6.3 ms
Wall time: 6.22 ms
%%time
rsc.pp.calculate_qc_metrics(adata, qc_vars=["MT", "RIBO"])
CPU times: user 66.7 ms, sys: 7.9 ms, total: 74.6 ms
Wall time: 74.4 ms

To visualize the quality control (QC) metrics, we generate the following plots:

  1. Scatter plot: Total counts vs. Mitochondrial percentage

    • This plot shows the relationship between the total UMI counts per cell and the percentage of mitochondrial (MT) gene expression.

    • Cells with high mitochondrial percentages may indicate stressed or dying cells.

  2. Scatter plot: Total counts vs. Number of detected genes

    • This plot displays the correlation between the total UMI counts per cell and the number of detected genes.

    • A strong correlation is expected, but outliers with low gene counts might indicate empty droplets or dead cells.

  3. Violin plot: Number of detected genes per cell

    • This violin plot visualizes the distribution of the number of detected genes per cell.

    • It helps identify cells with abnormally low or high gene counts, which could be filtered out.

  4. Violin plot: Total counts per cell

    • This plot shows the distribution of total counts per cell, indicating library size variation.

    • Extreme values may suggest low-quality or overly dominant cells.

  5. Violin plot: Percentage of mitochondrial counts per cell

    • This plot illustrates the distribution of mitochondrial gene expression across all cells.

    • High mitochondrial content could be a sign of cell stress or apoptosis.

These visualizations help assess dataset quality and guide decisions on filtering low-quality cells.

sc.pl.scatter(adata, x="total_counts", y="pct_counts_MT")
sc.pl.scatter(adata, x="total_counts", y="n_genes_by_counts")
../_images/0fec43720500881ef04fbdcb48a55be3b6664bc25320c67c1a9b4d376f336e38.png ../_images/6fbc470dfc267a4aec36cb9f80c6d9ea9ac09d32645f7400f61182939aaa037b.png
sc.pl.violin(adata, "n_genes_by_counts", jitter=0.4, groupby="assay")
sc.pl.violin(adata, "total_counts", jitter=0.4, groupby="assay")
sc.pl.violin(adata, "pct_counts_MT", jitter=0.4, groupby="assay")
../_images/e61a53f1ab5b84caa0e95f9fe48c0a3f43d5541e0980c12d4cf63b98e3a49587.png ../_images/d672b5c4c59546d4293f85156433b8ab2f0c7e6261f86ce4a9a4c6e8f6a81689.png ../_images/141ef7ed9ee9db3e1e74f1a89a133a77b2f752a963a34c76e91085878e1a7509.png

Filter#

We filter the count matrix to remove cells with an extreme number of genes expressed. We also filter out cells with a mitchondrial countent of more than 20%.

%%time
adata = adata[adata.obs["n_genes_by_counts"] < 5000]
adata = adata[adata.obs["pct_counts_MT"] < 20]
CPU times: user 125 ms, sys: 16.1 ms, total: 141 ms
Wall time: 141 ms

We also filter out genes that are expressed in less than 3 cells.

%%time
rsc.pp.filter_genes(adata, min_cells=3)
filtered out 33823 genes that are detected in less than 3 cells
CPU times: user 411 ms, sys: 108 ms, total: 519 ms
Wall time: 749 ms

The size of our count matrix is now reduced.

adata.shape
(213082, 28065)

Normalize#

We normalize the count matrix so that the total counts in each cell sum to 1e4.

%%time
rsc.pp.normalize_total(adata, target_sum=1e4)
CPU times: user 744 μs, sys: 86 μs, total: 830 μs
Wall time: 682 μs

Next, we data transform the count matrix.

%%time
rsc.pp.log1p(adata)
CPU times: user 9.37 ms, sys: 20.4 ms, total: 29.7 ms
Wall time: 92.2 ms

Select Most Variable Genes#

We now search for highly variable genes. This function supports the flavors cell_ranger, seurat, seurat_v3, and pearson_residuals.
As in Scanpy, you can either filter based on variance cutoffs or select the n_top_genes. Additionally, you can use a batch_key to correct for batch effects.

In this example, we use the cell_ranger method, which selects highly variable genes based on the log-normalized counts stored in .X.

%%time
rsc.pp.highly_variable_genes(adata, n_top_genes=5000, flavor="cell_ranger")
CPU times: user 204 ms, sys: 16.2 ms, total: 220 ms
Wall time: 218 ms

Now we safe this version of the AnnData as adata.raw.

%%time
adata.raw = adata
CPU times: user 231 ms, sys: 146 ms, total: 377 ms
Wall time: 376 ms

We now restrict our AnnData object to only the highly variable genes.
This step reduces the number of features, focusing the analysis on the most informative genes while improving computational efficiency.

%%time
rsc.pp.filter_highly_variable(adata)
CPU times: user 493 ms, sys: 318 ms, total: 811 ms
Wall time: 810 ms

Next, we regress out the effects of total counts per cell and mitochondrial content.
This helps remove technical variation that could bias downstream analyses.

As in Scanpy, you can regress out any numerical column from .obs, allowing for the correction of batch effects, sequencing depth, or other confounding factors.

%%time
rsc.pp.regress_out(adata, keys=["total_counts", "pct_counts_MT"])
CPU times: user 702 ms, sys: 808 ms, total: 1.51 s
Wall time: 1.51 s

Scale#

Finally, we scale the count matrix to obtain a z-score transformation, standardizing gene expression across cells.
To prevent extreme values from dominating the analysis, we apply a cutoff of 10 standard deviations.
This ensures that highly expressed genes do not disproportionately influence downstream computations.

%%time
rsc.pp.scale(adata, max_value=10)
CPU times: user 58.9 ms, sys: 852 μs, total: 59.8 ms
Wall time: 59.1 ms

Now we move .X out of the GPU.

preprocess_time = time.time()
print("Total Preprocessing time: %s" % (preprocess_time - preprocess_start))
Total Preprocessing time: 8.972412824630737

Clustering and Visualization#

Principal component analysis#

We use Principal Component Analysis (PCA) to reduce the dimensionality of the gene expression matrix to its top 100 principal components.
This step captures the most important sources of variation in the data while reducing computational complexity.

We use the GPU-accelerated PCA implementation from cuML, which significantly speeds up computation compared to CPU-based methods.

%%time
rsc.tl.pca(adata, n_comps=100)
CPU times: user 1.26 s, sys: 145 ms, total: 1.4 s
Wall time: 1.61 s

We can use scanpy pca_variance_ratio plot to inspect the contribution of single PCs to the total variance in the data.

sc.pl.pca_variance_ratio(adata, log=True, n_pcs=100)
../_images/26399926bb16af7ebf41ff3258c46ba1ef40a29deb7d154b0d2a3d39e37a11f8.png

Now, we transfer the .X matrix back to host (CPU) memory to free up GPU resources.

Since some downstream analyses or visualizations may not require GPU acceleration, this step helps optimize memory usage,
preventing unnecessary GPU load while still keeping the processed data accessible for further analysis.

rsc.get.anndata_to_CPU(adata)

Computing the neighborhood graph and UMAP#

Next, we compute the neighborhood graph using rsc.

Scanpy’s CPU-based implementation of nearest neighbor search uses an approximation, while the GPU-accelerated RAPIDS implementation computes the exact graph.
Both methods are valid, but small differences in results can occur.

The following approximate nearest neighbor (ANN) algorithms are also supported:

  • ivfflat: Uses an inverted file index for fast approximate search, suitable for very large datasets.

  • ivfpq: A variant of ivfflat that compresses data for even more efficient memory usage, trading off some accuracy.

  • cagra: A graph-based ANN method optimized for fast, high-accuracy queries on GPUs.

  • nn_descent: A graph-based method that incrementally refines the nearest neighbor search and works well for large, high-dimensional datasets.

Since our dataset is relatively small, we use brute-force (brute) search to ensure exact results.

%%time
rsc.pp.neighbors(adata, n_neighbors=15, n_pcs=40)
CPU times: user 1.15 s, sys: 148 ms, total: 1.3 s
Wall time: 1.29 s

Next, we calculate the UMAP embedding using RAPIDS.

UMAP (Uniform Manifold Approximation and Projection) is a dimensionality reduction technique that preserves local and global structure in the data.
The RAPIDS implementation accelerates UMAP computation on GPUs, making it significantly faster than the standard CPU version.

%%time
rsc.tl.umap(adata)
CPU times: user 287 ms, sys: 77.1 ms, total: 364 ms
Wall time: 362 ms

Clustering#

Next, we use the Leiden algorithm for graph-based clustering with RAPIDS.

Leiden detects communities (clusters) in the neighborhood graph by optimizing modularity, while guaranteeing well-connected, stable clusters.

Using RAPIDS accelerates the clustering process on GPUs, making it significantly faster than traditional CPU implementations.

%%time
rsc.tl.leiden(adata, resolution=0.6)
CPU times: user 697 ms, sys: 1.13 s, total: 1.83 s
Wall time: 1.82 s
%%time
sc.pl.umap(adata, color="leiden", legend_loc="on data")
sc.pl.umap(adata, color=["cell_type", "assay"],ncols=1,)
../_images/b6a80d582ecac46e4b56fb1a96725c5c00a28f7d8e84ceaab44db37e3cb26c3d.png

Batch Correction#

In the previous UMAP, we observed strong batch effects between assay types.
To correct for this, we apply Harmony, a method that aligns different batches while preserving biological variation.

After applying Harmony, we will redo:

  • Neighborhood search to recompute the k-nearest neighbor graph

  • UMAP to visualize the corrected embedding

  • Graph-based clustering to identify cell populations without batch-driven artifacts

This ensures that batch effects do not drive clustering results while retaining meaningful biological structure.

import rmm
from rmm.allocators.cupy import rmm_cupy_allocator

rmm.reinitialize(
    managed_memory=False,  # Allows oversubscription
    pool_allocator=True,  # default is False
    devices=0,  # GPU device IDs to register. By default registers only GPU 0.
)
cp.cuda.set_allocator(rmm_cupy_allocator)
%%time
rsc.pp.harmony_integrate(adata, key="assay", dtype=cp.float32)
CPU times: user 4.41 s, sys: 117 ms, total: 4.53 s
Wall time: 4.53 s
%%time
rsc.pp.neighbors(adata, n_neighbors=15, n_pcs=40,use_rep="X_pca_harmony",key_added="harmony")
rsc.tl.umap(adata,neighbors_key="harmony",key_added="X_umap_harmony")
rsc.tl.leiden(adata, resolution=0.6,neighbors_key="harmony",key_added="leiden_harmony")
%%time
sc.pl.embedding(adata,basis="X_umap_harmony", color="leiden_harmony", legend_loc="on data")
sc.pl.embedding(adata,basis="X_umap_harmony", color=["cell_type", "assay"], ncols=1)
../_images/5ba253bcba2b4c23ce01a94b11bfbf31acc94b399656d4ce23ea99730b3abb84.png

TSNE#

Next, we use t-distributed Stochastic Neighbor Embedding (t-SNE) to visualize cells in two dimensions.
t-SNE is a non-linear dimensionality reduction technique that preserves local structure and reveals complex patterns in the data.

We leverage the RAPIDS GPU implementation of t-SNE, which is significantly faster than CPU-based methods.
This allows us to efficiently process large datasets while maintaining high-quality embeddings.

%%time
rsc.tl.tsne(adata, n_pcs=40, perplexity=30, early_exaggeration=12, learning_rate=200, use_rep="X_pca_harmony")
[W] [16:13:09.501768] # of Nearest Neighbors should be at least 3 * perplexity. Your results might be a bit strange...
CPU times: user 2.31 s, sys: 2.29 s, total: 4.6 s
Wall time: 4.57 s
sc.pl.tsne(adata, color=["cell_type", "assay"], ncols=1)
../_images/9cd088ef31f46c8b287ae9a36b489bfe8090e33aaed03b8cdd71da250a102f86.png

Differential expression analysis#

To identify key marker genes for each cell type, we use logistic regression to compute a ranking of highly differential genes.
Unlike traditional statistical tests, logistic regression models the probability of a gene being highly expressed in a given cluster
while accounting for all genes simultaneously, making it more robust for complex datasets.

We rank the top 50 genes that best distinguish each cell type, helping to characterize biological differences between clusters.

%%time
rsc.tl.rank_genes_groups_logreg(adata, groupby="cell_type", use_raw=False)
[W] [16:13:33.249025] L-BFGS: max iterations reached
[W] [16:13:33.249241] Maximum iterations reached before solver is converged. To increase model accuracy you can increase the number of iterations (max_iter) or improve the scaling of the input data.
CPU times: user 11.1 s, sys: 6.64 s, total: 17.7 s
Wall time: 17.7 s
sc.pl.rank_genes_groups(adata)
../_images/55241a6efed02d1a5ee2b0a6c3f768629a7f541252042c01472b9606978430d5.png
post_time = time.time()
print("Total Postprocessing time: %s" % (post_time - preprocess_time))
Total Postprocessing time: 46.12144637107849

Diffusion Map Embedding#

Next, we compute Diffusion Maps, a nonlinear dimensionality reduction method that preserves the continuous structure of the data.
Unlike UMAP and t-SNE, Diffusion Maps are well-suited for capturing trajectory-like relationships in single-cell data.

We use the batch-corrected neighborhood graph (neighbors_key="harmony") to ensure that the embedding is not influenced by batch effects.

%%time
rsc.tl.diffmap(adata,neighbors_key="harmony")
#Due to an issue with plotting in Scanpy, we need to shift the first component of the diffusion map by removing the first column:
adata.obsm["X_diffmap"] = adata.obsm["X_diffmap"][:, 1:]
CPU times: user 1.54 s, sys: 261 μs, total: 1.54 s
Wall time: 1.54 s
sc.pl.diffmap(adata, color="cell_type")
../_images/b4cc2f87dc4019a25aafac3419fc02e0f50e42f736fcb729e008e94d819e2c2d.png
print("Total Processing time: %s" % (time.time() - preprocess_start))
Total Processing time: 57.584962129592896
adata.write("h5/dli_decoupler.h5ad")