Pearson Residues Example#

Author: Severin Dicks

To run this notebook please make sure you have a working rapids environment with all nessaray dependencies. Run the data_downloader notebook first to create the AnnData object we are working with. In this example workflow we’ll be looking at a dataset of 500000 brain cells from Nvidia.

import scanpy as sc
import cupy as cp

import time
import rapids_singlecell as rsc

import warnings

warnings.filterwarnings("ignore")
import rmm
from rmm.allocators.cupy import rmm_cupy_allocator

rmm.reinitialize(
    managed_memory=False,  # Allows oversubscription
    pool_allocator=False,  # default is False
    devices=0,  # GPU device IDs to register. By default registers only GPU 0.
)
cp.cuda.set_allocator(rmm_cupy_allocator)

Load and Prepare Data#

We load the sparse count matrix from an h5ad file using Scanpy. The sparse count matrix will then be placed on the GPU.

data_load_start = time.time()
%%time
adata = sc.read("h5/500000.h5ad")
adata.var_names_make_unique()
CPU times: user 564 ms, sys: 2.52 s, total: 3.09 s
Wall time: 10.3 s
%%time
rsc.get.anndata_to_GPU(adata)
CPU times: user 414 ms, sys: 1.62 s, total: 2.04 s
Wall time: 2.04 s

Verify the shape of the resulting sparse matrix:

adata.shape
(500000, 27998)

And the number of non-zero values in the matrix:

data_load_time = time.time()
print("Total data load and format time: %s" % (data_load_time - data_load_start))
Total data load and format time: 12.361832857131958

Preprocessing#

preprocess_start = time.time()

Quality Control#

We perform a basic qulitiy control and plot the results

%%time
rsc.pp.flag_gene_family(adata, gene_family_name="MT", gene_family_prefix="mt-")
CPU times: user 5.41 ms, sys: 0 ns, total: 5.41 ms
Wall time: 5.29 ms
%%time
rsc.pp.calculate_qc_metrics(adata, qc_vars=["MT"])
CPU times: user 148 ms, sys: 9.6 ms, total: 158 ms
Wall time: 233 ms
sc.pl.scatter(adata, "total_counts", "pct_counts_MT")
sc.pl.scatter(adata, "total_counts", "n_genes_by_counts")
../_images/d82c2e76a9faeaa1f9e5ae5321b4ea1b898e571497d4d6ec1cd64ee3b45705f4.png ../_images/99b98ec7045359b4a6ce0252b491f63c7cc0024ca02ccdbe9c0828ac5527f01b.png
sc.pl.violin(adata, keys="n_genes_by_counts")
sc.pl.violin(adata, keys="total_counts")
sc.pl.violin(adata, keys="pct_counts_MT")
../_images/7eb932ddc01a6c2d0a4962e5c09b6eb3602158834ef67b3d2db5304bc226faa9.png ../_images/ecf3a2f4b7e9f8fe0562cca6f5ddc2fadb3666fb889098f0b1f9c92e56269b93.png ../_images/59a3d08fc99ce65d58b1696e45f543f7c06587c39647e0cbe1fba519e58a7955.png

Filter#

We filter the count matrix to remove cells with an extreme number of genes expressed. We also filter out cells with a mitchondrial countent of more than 20%.

The size of our count matrix is now reduced.

%%time
rsc.pp.filter_genes(adata, min_count=10)
filtered out 7663 genes based on n_cells_by_counts
CPU times: user 378 ms, sys: 42.2 ms, total: 421 ms
Wall time: 439 ms
%%time
rsc.pp.filter_cells(adata, qc_var="n_genes_by_counts", min_count=500, max_count=4000)
rsc.pp.filter_cells(adata, qc_var="total_counts", max_count=20000)
adata = adata[adata.obs["pct_counts_MT"] < 20]
filtered out 15844 cells
filtered out 5 cells
CPU times: user 139 ms, sys: 62.4 ms, total: 201 ms
Wall time: 765 ms

We copy the raw counts to the layer counts

adata.layers["counts"] = adata.X.copy()

Log-Normalize counts#

We normalize the count matrix so that the total counts in each cell sum to 1e4.

%%time
rsc.pp.normalize_total(adata, target_sum=1e4)
CPU times: user 1.86 ms, sys: 692 µs, total: 2.55 ms
Wall time: 8.99 ms

Next, we log transform the count matrix.

%%time
rsc.pp.log1p(adata)
CPU times: user 30.2 ms, sys: 3.87 ms, total: 34.1 ms
Wall time: 50.9 ms

Now we safe this version of the AnnData as adata.raw.

%%time
adata.raw = adata
CPU times: user 666 ms, sys: 456 ms, total: 1.12 s
Wall time: 1.12 s

Select Most Variable Genes#

Now we search for highly variable genes. This function only supports the flavors cell_ranger seurat seurat_v3 and pearson_residuals. As you can in scanpy you can filter based on cutoffs or select the top n cells. You can also use a batch_key to reduce batcheffects.

In this example we use pearson_residuals for selecting highly variable genes with .layers["counts"]

%%time
rsc.pp.highly_variable_genes(
    adata, n_top_genes=5000, flavor="pearson_residuals", layer="counts"
)
CPU times: user 3.01 s, sys: 67.7 ms, total: 3.08 s
Wall time: 3.15 s

Now we restrict our AnnData object to the highly variable genes.

%%time
rsc.pp.filter_highly_variable(adata)
CPU times: user 1.32 s, sys: 978 ms, total: 2.3 s
Wall time: 2.33 s
adata.shape
(483870, 5000)

Normalize#

We normalize the raw counts matrix with pearson_residuals.

%%time
adata.layers["pearson_residuals"] = rsc.pp.normalize_pearson_residuals(
    adata, layer="counts", inplace=False
)
CPU times: user 974 ms, sys: 12.9 ms, total: 987 ms
Wall time: 1 s

Principal component analysis#

We use PCA to reduce the dimensionality of the matrix to its top 150 principal components. We use the PCA implementation from cuMLs.

We can use scanpy pca_variance_ratio plot to inspect the contribution of single PCs to the total variance in the data.

%%time
rsc.pp.pca(adata, n_comps=150, layer="pearson_residuals")
CPU times: user 2.58 s, sys: 110 ms, total: 2.69 s
Wall time: 3.28 s
sc.pl.pca_variance_ratio(adata, log=True, n_pcs=150)
../_images/838e7b88e43cf587dbdbc4cda8664e27df02ae4d0461ff1d600d868e1e212636.png

Now we move .X and .layers out of the GPU.

%%time
rsc.get.anndata_to_CPU(adata, convert_all=True)
CPU times: user 1.07 s, sys: 988 ms, total: 2.06 s
Wall time: 2.06 s
preprocess_time = time.time()
print("Total Preprocessing time: %s" % (preprocess_time - preprocess_start))
Total Preprocessing time: 24.27028512954712

We have now finished the preprocessing of the data.

Clustering and Visualization#

Computing the neighborhood graph and UMAP#

Next we compute the neighborhood graph using rsc.

Scanpy CPU implementation of nearest neighbor uses an approximation, while the GPU version calculates the exact graph. Both methods are valid, but you might see differences.

%%time
rsc.pp.neighbors(adata, n_neighbors=15, n_pcs=60)
CPU times: user 4.92 s, sys: 94 ms, total: 5.02 s
Wall time: 5.04 s

Next we calculate the UMAP embedding using rapdis.

%%time
rsc.tl.umap(adata)
CPU times: user 1.22 s, sys: 71.9 ms, total: 1.29 s
Wall time: 1.29 s

Clustering#

Next, we use the Louvain and Leiden algorithm for graph-based clustering.

%%time
rsc.tl.louvain(adata, resolution=1.0)
CPU times: user 4.85 s, sys: 4.38 s, total: 9.24 s
Wall time: 11.1 s
%%time
rsc.tl.leiden(adata, resolution=1.0)
CPU times: user 1.51 s, sys: 3.62 s, total: 5.13 s
Wall time: 5.13 s
%%time
sc.pl.umap(adata, color=["louvain", "leiden"], legend_loc="on data")
../_images/90110011d2d0d31f0795348b616ddda01bcf3865801c2b6f003d3b3424154c2d.png
CPU times: user 2.42 s, sys: 163 ms, total: 2.58 s
Wall time: 2.41 s
%%time
sc.pl.umap(adata, color=["Stmn2", "Hes1"], legend_loc="on data", cmap="Reds")
../_images/a7165032f32833bbeb9cda76d205cc4bb32ed0cde6c47fa7d869539b65645f92.png
CPU times: user 3.33 s, sys: 162 ms, total: 3.49 s
Wall time: 3.32 s
%%time
rsc.tl.diffmap(adata)
CPU times: user 4.24 s, sys: 14.7 s, total: 18.9 s
Wall time: 1.7 s
adata.obsm["X_diffmap_"] = adata.obsm["X_diffmap"][:, 1:]
sc.pl.embedding(adata, "diffmap_", color="louvain")
../_images/1403549bf14e8035a3433f4cbfefdbfc2111b672abf2c30986b2ef6ffdf0357a.png
%%time
rsc.tl.rank_genes_groups_logreg(adata, groupby="louvain", use_raw=False)
[W] [17:00:21.909123] L-BFGS: max iterations reached
[W] [17:00:21.910102] Maximum iterations reached before solver is converged. To increase model accuracy you can increase the number of iterations (max_iter) or improve the scaling of the input data.
CPU times: user 1min 5s, sys: 3.42 s, total: 1min 9s
Wall time: 1min 9s
%%time
sc.pl.rank_genes_groups(adata, n_genes=20)
../_images/c170dcbb8cbd5d4cdc15eff02f599ee5dae932ef1b1b10cc993a5499f652ea52.png
CPU times: user 2.8 s, sys: 225 ms, total: 3.03 s
Wall time: 2.85 s
print("Total Processing time: %s" % (time.time() - preprocess_start))
Total Processing time: 128.05585312843323